Prediction (out of sample)

In [1]:
%matplotlib inline
In [2]:
from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.981
Model:                            OLS   Adj. R-squared:                  0.980
Method:                 Least Squares   F-statistic:                     785.7
Date:                Tue, 18 Feb 2020   Prob (F-statistic):           1.67e-39
Time:                        09:47:20   Log-Likelihood:                -2.6839
No. Observations:                  50   AIC:                             13.37
Df Residuals:                      46   BIC:                             21.02
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0546      0.091     55.712      0.000       4.872       5.237
x1             0.4932      0.014     35.250      0.000       0.465       0.521
x2             0.4909      0.055      8.924      0.000       0.380       0.602
x3            -0.0195      0.001    -15.903      0.000      -0.022      -0.017
==============================================================================
Omnibus:                        0.610   Durbin-Watson:                   2.191
Prob(Omnibus):                  0.737   Jarque-Bera (JB):                0.426
Skew:                          -0.224   Prob(JB):                        0.808
Kurtosis:                       2.936   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.56614888  5.03879541  5.47292095  5.84177376  6.12825668  6.3277361
  6.44880329  6.51186287  6.54578039  6.58313979  6.65488996  6.78525992
  6.98777808  7.26304952  7.59865662  7.97119921  8.35013918  8.70281778
  8.99982135  9.21981395  9.35304274  9.40294039  9.38556105  9.32694277
  9.25882944  9.21345542  9.21824951  9.29132846  9.43852045  9.65240755
  9.91354443 10.19365323 10.46026945 10.68207586 10.83404778 10.90156458
 10.88281181 10.7890793  10.64290628 10.47437826 10.31618403 10.19824502
 10.14279929 10.16074786 10.2498637  10.39515767 10.57134114 10.74698087
 10.88966339 10.97132045]

Create a new sample of explanatory variables Xnew, predict and plot

In [6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.9611592  10.821429   10.57137974 10.25487949  9.92967406  9.65324882
  9.46875426  9.39444128  9.41919268  9.50524509]

Plot comparison

In [7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [8]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [9]:
res.params
Out[9]:
Intercept           5.054596
x1                  0.493231
np.sin(x1)          0.490866
I((x1 - 5) ** 2)   -0.019538
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [10]:
res.predict(exog=dict(x1=x1n))
Out[10]:
0    10.961159
1    10.821429
2    10.571380
3    10.254879
4     9.929674
5     9.653249
6     9.468754
7     9.394441
8     9.419193
9     9.505245
dtype: float64