Prediction (out of sample)

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm
/build/statsmodels-GE7Zhw/statsmodels-0.8.0/.pybuild/cpython3_3.6_statsmodels/build/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.983
Model:                            OLS   Adj. R-squared:                  0.982
Method:                 Least Squares   F-statistic:                     877.4
Date:                Wed, 27 Jun 2018   Prob (F-statistic):           1.38e-40
Time:                        23:25:22   Log-Likelihood:                -1.4283
No. Observations:                  50   AIC:                             10.86
Df Residuals:                      46   BIC:                             18.50
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9664      0.088     56.132      0.000       4.788       5.144
x1             0.5034      0.014     36.891      0.000       0.476       0.531
x2             0.4067      0.054      7.582      0.000       0.299       0.515
x3            -0.0196      0.001    -16.323      0.000      -0.022      -0.017
==============================================================================
Omnibus:                        0.171   Durbin-Watson:                   1.680
Prob(Omnibus):                  0.918   Jarque-Bera (JB):                0.202
Skew:                          -0.125   Prob(JB):                        0.904
Kurtosis:                       2.816   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[ 4.47745236  4.92091305  5.33133464  5.68655172  5.97239827  6.18503507
  6.3315805   6.42894101  6.50103341  6.57485545  6.67605012  6.8246926
  7.03199199  7.29844964  7.61377681  7.95858507  8.30757176  8.63367717
  8.91253034  9.12645316  9.26736483  9.33810952  9.35198906  9.3305775
  9.30017657  9.28749434  9.3152572   9.39847596  9.54197963  9.73962199
  9.97529144 10.22555808 10.46352315 10.66323846 10.80396956 10.8736024
 10.8706336  10.80441745 10.69362904 10.56319601 10.44020298 10.34944163
 10.30933751 10.32892276 10.40635217 10.52920634 10.67653212 10.82228521
 10.93960912 11.00524764]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.99282559 10.871364   10.65681255 10.38551853 10.10532781  9.86387041
  9.69689911  9.61953585  9.62256937  9.67471047]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           4.966357
x1                  0.503387
np.sin(x1)          0.406711
I((x1 - 5) ** 2)   -0.019556
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
0    10.992826
1    10.871364
2    10.656813
3    10.385519
4    10.105328
5     9.863870
6     9.696899
7     9.619536
8     9.622569
9     9.674710
dtype: float64