Prediction (out of sample)ΒΆ

Link to Notebook GitHub

In [1]:
from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.981
Model:                            OLS   Adj. R-squared:                  0.980
Method:                 Least Squares   F-statistic:                     796.5
Date:                Wed, 27 Apr 2016   Prob (F-statistic):           1.22e-39
Time:                        04:34:15   Log-Likelihood:                -1.1372
No. Observations:                  50   AIC:                             10.27
Df Residuals:                      46   BIC:                             17.92
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          5.0978      0.088     57.954      0.000         4.921     5.275
x1             0.4924      0.014     36.295      0.000         0.465     0.520
x2             0.5335      0.053     10.003      0.000         0.426     0.641
x3            -0.0204      0.001    -17.137      0.000        -0.023    -0.018
==============================================================================
Omnibus:                        1.849   Durbin-Watson:                   2.631
Prob(Omnibus):                  0.397   Jarque-Bera (JB):                1.018
Skew:                          -0.270   Prob(JB):                        0.601
Kurtosis:                       3.444   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[  4.58747959   5.08011438   5.53115769   5.91153513   6.20266511
   6.39951171   6.51141214   6.56054259   6.57827479   6.60002171
   6.65941931   6.78280024   6.98486748   7.26627864   7.61353798
   8.00121367   8.39611617   8.76275122   9.06915141   9.29212832
   9.42108238   9.45974425   9.42556199   9.34683431   9.2580612
   9.19427571   9.18528806   9.25078795   9.3971096    9.61619101
   9.88689857  10.17849923  10.45570988  10.68449421  10.83765484
  10.89930174  10.86746291  10.75440841  10.58463439  10.39083844
  10.20854751  10.07028088  10.00020736  10.01017445  10.09776175
  10.24667873  10.42944113  10.61188658  10.7587874   10.83963837]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[ 10.819296    10.65854633  10.37831058  10.02626558   9.66517077
   9.35750256   9.15015789   9.06297224   9.0838632    9.17178867]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           5.097776
x1                  0.492374
np.sin(x1)          0.533484
I((x1 - 5) ** 2)   -0.020412
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
array([ 10.819296  ,  10.65854633,  10.37831058,  10.02626558,
         9.66517077,   9.35750256,   9.15015789,   9.06297224,
         9.0838632 ,   9.17178867])