Prediction (out of sample)

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.988
Model:                            OLS   Adj. R-squared:                  0.988
Method:                 Least Squares   F-statistic:                     1298.
Date:                Sat, 02 Nov 2019   Prob (F-statistic):           1.94e-44
Time:                        05:19:07   Log-Likelihood:                 8.4067
No. Observations:                  50   AIC:                            -8.813
Df Residuals:                      46   BIC:                            -1.165
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9154      0.073     67.633      0.000       4.769       5.062
x1             0.5142      0.011     45.876      0.000       0.492       0.537
x2             0.5593      0.044     12.693      0.000       0.471       0.648
x3            -0.0209      0.001    -21.248      0.000      -0.023      -0.019
==============================================================================
Omnibus:                        0.027   Durbin-Watson:                   2.031
Prob(Omnibus):                  0.987   Jarque-Bera (JB):                0.203
Skew:                          -0.013   Prob(JB):                        0.903
Kurtosis:                       2.689   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[ 4.39266487  4.90641333  5.37672019  5.77310377  6.07608309  6.2803785
  6.39577912  6.44553454  6.46253496  6.48390741  6.5449159   6.67316766
  6.88407731  7.17833431  7.54178968  7.94778063  8.36151118  8.74576906
  9.06703956  9.30101189  9.43657343  9.47763542  9.44249044  9.36080693
  9.26875476  9.20306265  9.19498372  9.26516069  9.42023437  9.65175253
  9.93755867 10.24543217 10.53838214 10.78072483 10.94394645 11.01138784
 10.98098158 10.8655916  10.69089973 10.49118621 10.30369756 10.16252708
 10.09301342 10.10757749 10.20368178 10.36424722 10.56045927 10.75650202
 10.91544269 11.00529953]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.99043297 10.82860765 10.54175745 10.17986683  9.80873297  9.49385626
  9.28440357  9.20117046  9.23148971  9.33233268]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           4.915431
x1                  0.514207
np.sin(x1)          0.559306
I((x1 - 5) ** 2)   -0.020911
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
0    10.990433
1    10.828608
2    10.541757
3    10.179867
4     9.808733
5     9.493856
6     9.284404
7     9.201170
8     9.231490
9     9.332333
dtype: float64