Prediction (out of sample)

In [1]:
%matplotlib inline
In [2]:
from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.990
Model:                            OLS   Adj. R-squared:                  0.989
Method:                 Least Squares   F-statistic:                     1484.
Date:                Sat, 09 Nov 2019   Prob (F-statistic):           9.22e-46
Time:                        22:31:23   Log-Likelihood:                 12.039
No. Observations:                  50   AIC:                            -16.08
Df Residuals:                      46   BIC:                            -8.429
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9232      0.068     72.843      0.000       4.787       5.059
x1             0.5178      0.010     49.677      0.000       0.497       0.539
x2             0.4388      0.041     10.708      0.000       0.356       0.521
x3            -0.0216      0.001    -23.575      0.000      -0.023      -0.020
==============================================================================
Omnibus:                        0.761   Durbin-Watson:                   2.332
Prob(Omnibus):                  0.684   Jarque-Bera (JB):                0.357
Skew:                          -0.200   Prob(JB):                        0.837
Kurtosis:                       3.109   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.3837814   4.8537578   5.28793124  5.66238875  5.96184742  6.18216534
  6.33102213  6.42665716  6.49487291  6.56479568  6.66409022  6.8144145
  7.02786133  7.30497162  7.63464542  7.99596568  8.36163484  8.70245972
  8.99214792  9.21162753  9.35218065  9.41687569  9.42006322  9.38501825
  9.34011617  9.3141708   9.33170033  9.40889904  9.55097667  9.75130243
  9.99249444 10.24927554 10.4926262  10.69455242 10.83268501 10.89395482
 10.87673992 10.79113213 10.65727894 10.50207332 10.35473529 10.24201119
 10.18377926 10.18978377 10.25803402 10.37513143 10.51847088 10.65995466
 10.77060886 10.8253437 ]

Create a new sample of explanatory variables Xnew, predict and plot

In [6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.79202912 10.6408395  10.38898204 10.07566974  9.75252073  9.47092029
  9.26944003  9.16439414  9.14584512  9.18003685]

Plot comparison

In [7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [8]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [9]:
res.params
Out[9]:
Intercept           4.923161
x1                  0.517801
np.sin(x1)          0.438777
I((x1 - 5) ** 2)   -0.021575
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [10]:
res.predict(exog=dict(x1=x1n))
Out[10]:
0    10.792029
1    10.640840
2    10.388982
3    10.075670
4     9.752521
5     9.470920
6     9.269440
7     9.164394
8     9.145845
9     9.180037
dtype: float64