Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm

plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.982
Model:                            OLS   Adj. R-squared:                  0.981
Method:                 Least Squares   F-statistic:                     831.6
Date:                Thu, 03 Nov 2022   Prob (F-statistic):           4.63e-40
Time:                        06:34:53   Log-Likelihood:                0.30166
No. Observations:                  50   AIC:                             7.397
Df Residuals:                      46   BIC:                             15.04
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.2132      0.085     60.996      0.000       5.041       5.385
x1             0.4696      0.013     35.623      0.000       0.443       0.496
x2             0.5091      0.052      9.826      0.000       0.405       0.613
x3            -0.0179      0.001    -15.493      0.000      -0.020      -0.016
==============================================================================
Omnibus:                        0.916   Durbin-Watson:                   1.512
Prob(Omnibus):                  0.633   Jarque-Bera (JB):                0.897
Skew:                           0.136   Prob(JB):                        0.639
Kurtosis:                       2.403   Cond. No.                         221.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.76491656  5.2288618   5.65362959  6.01147205  6.28465536  6.4683734
  6.57153733  6.61531158  6.62963654  6.64830945  6.70343163  6.82013432
  7.01244964  7.28100524  7.6129211   7.98392572  8.36234383  8.71430061
  9.00928715  9.22517301  9.35184221  9.39285523  9.36486403  9.2948762
  9.21581769  9.16112313  9.15924267  9.22896767  9.37634333  9.5936755
  9.86079478 10.14837033 10.42272881 10.65138684 10.8083879  10.87856696
 10.86004199 10.76452323 10.61538917 10.4438456  10.28379854 10.16628357
 10.11436682 10.13935522 10.23893883 10.39757035 10.5890195  10.78068257
 10.93893914 11.0346758 ]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew)  # predict out of sample
print(ynewpred)
[11.03871909 10.90953611 10.66709354 10.35689295 10.03883038  9.77253176
  9.60275436  9.54842865  9.59802348  9.71236931]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7fbc7091c4f0>
../../../_images/examples_notebooks_generated_predict_12_1.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1": x1, "y": y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           5.213184
x1                  0.469554
np.sin(x1)          0.509144
I((x1 - 5) ** 2)   -0.017931
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    11.038719
1    10.909536
2    10.667094
3    10.356893
4    10.038830
5     9.772532
6     9.602754
7     9.548429
8     9.598023
9     9.712369
dtype: float64