Logo

Prediction (out of sample)ΒΆ

Link to Notebook GitHub

In [1]:
from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.981
Model:                            OLS   Adj. R-squared:                  0.980
Method:                 Least Squares   F-statistic:                     808.7
Date:                Wed, 24 Dec 2014   Prob (F-statistic):           8.70e-40
Time:                        13:07:57   Log-Likelihood:                -3.7832
No. Observations:                  50   AIC:                             15.57
Df Residuals:                      46   BIC:                             23.21
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          4.8950      0.093     52.780      0.000         4.708     5.082
x1             0.5222      0.014     36.510      0.000         0.493     0.551
x2             0.3781      0.056      6.725      0.000         0.265     0.491
x3            -0.0216      0.001    -17.196      0.000        -0.024    -0.019
==============================================================================
Omnibus:                        8.517   Durbin-Watson:                   1.657
Prob(Omnibus):                  0.014   Jarque-Bera (JB):                7.697
Skew:                          -0.900   Prob(JB):                       0.0213
Kurtosis:                       3.677   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[  4.35511791   4.80289944   5.21882741   5.58229511   5.88013272
   6.10877099   6.27482773   6.39402048   6.48858417   6.58361801
   6.7029617    6.86527872   7.08098989   7.35056132   7.66442779
   8.0045643    8.34744756   8.66792073   8.94332669   9.1572305
   9.3021196    9.38063817   9.4051526    9.39571983   9.37679198
   9.37319894   9.40606895   9.48935722   9.62755315   9.81494276
  10.03654745  10.27058496  10.49204811  10.67681345  10.80560463
  10.86715948  10.86008033  10.79306367  10.68347133  10.55447795
  10.43126351  10.33687628  10.28844609  10.29436992  10.35293232
  10.45258723  10.57385501  10.69252282  10.78362281  10.82553416]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[ 10.78902362  10.64943394  10.42159319  10.13929264   9.84701351
   9.58903639   9.39859984   9.28976237   9.2539605    9.26210536]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           4.895017
x1                  0.522221
np.sin(x1)          0.378110
I((x1 - 5) ** 2)   -0.021596
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
array([ 10.78902362,  10.64943394,  10.42159319,  10.13929264,
         9.84701351,   9.58903639,   9.39859984,   9.28976237,
         9.2539605 ,   9.26210536])

This Page