Prediction (out of sample)

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm
/build/statsmodels-IFPJo1/statsmodels-0.8.0/.pybuild/cpython3_3.7_statsmodels/build/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.983
Model:                            OLS   Adj. R-squared:                  0.982
Method:                 Least Squares   F-statistic:                     872.2
Date:                Sat, 03 Nov 2018   Prob (F-statistic):           1.58e-40
Time:                        12:14:34   Log-Likelihood:                -1.3441
No. Observations:                  50   AIC:                             10.69
Df Residuals:                      46   BIC:                             18.34
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9445      0.088     55.979      0.000       4.767       5.122
x1             0.5220      0.014     38.321      0.000       0.495       0.549
x2             0.4796      0.054      8.955      0.000       0.372       0.587
x3            -0.0220      0.001    -18.398      0.000      -0.024      -0.020
==============================================================================
Omnibus:                        0.331   Durbin-Watson:                   2.180
Prob(Omnibus):                  0.847   Jarque-Bera (JB):                0.189
Skew:                          -0.148   Prob(JB):                        0.910
Kurtosis:                       2.944   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[ 4.39439088  4.88396437  5.33493173  5.72115694  6.02593635  6.24474299
  6.38597042  6.4695536   6.5236937   6.5802247   6.66938317  6.81484047
  7.02981363  7.3148938   7.65794917  8.03611817  8.41956555  8.77638423
  9.07783756  9.30308067  9.4425852   9.4997048   9.49012412  9.43928191
  9.37819144  9.33834515  9.34654052  9.42047734  9.56584964  9.77541022
 10.03016131 10.3024759  10.5606367  10.7740473  10.91825898 10.97898762
 10.95446054 10.85570772 10.70474962 10.53097928 10.36633314 10.24004395
 10.17383791 10.17836514 10.2514499  10.37844801 10.53465291 10.68935482
 10.81088632 10.87182519]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.83749602 10.67512185 10.40350952 10.06551741  9.71756223  9.41580627
  9.20240685  9.09519479  9.08330878  9.12985462]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           4.944508
x1                  0.522027
np.sin(x1)          0.479568
I((x1 - 5) ** 2)   -0.022005
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
0    10.837496
1    10.675122
2    10.403510
3    10.065517
4     9.717562
5     9.415806
6     9.202407
7     9.095195
8     9.083309
9     9.129855
dtype: float64