Prediction (out of sample)

In [1]:
%matplotlib inline
In [2]:
from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.985
Model:                            OLS   Adj. R-squared:                  0.984
Method:                 Least Squares   F-statistic:                     988.4
Date:                Sun, 10 Nov 2019   Prob (F-statistic):           9.32e-42
Time:                        16:32:41   Log-Likelihood:                 1.9555
No. Observations:                  50   AIC:                             4.089
Df Residuals:                      46   BIC:                             11.74
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9957      0.083     60.417      0.000       4.829       5.162
x1             0.4993      0.013     39.155      0.000       0.474       0.525
x2             0.4471      0.050      8.919      0.000       0.346       0.548
x3            -0.0194      0.001    -17.314      0.000      -0.022      -0.017
==============================================================================
Omnibus:                        1.153   Durbin-Watson:                   2.183
Prob(Omnibus):                  0.562   Jarque-Bera (JB):                0.440
Skew:                          -0.018   Prob(JB):                        0.802
Kurtosis:                       3.458   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.51102379  4.96820507  5.38976772  5.75134322  6.03735756  6.24358992
  6.37786621  6.45877325  6.51260501  6.56904262  6.6562778   6.79638108
  7.00167563  7.27271264  7.59818076  7.95676447  8.32064622  8.66007675
  8.94826281  9.16576922  9.30371199  9.36521798  9.36491113  9.3265098
  9.27892992  9.25153415  9.26930772  9.34875342  9.4951802   9.70183098
  9.95099277 10.2169069  10.47000107 10.68174803 10.82935248 10.89949631
 10.8905267  10.81272772 10.68663063 10.53964067 10.40153449 10.2995679
 10.25399779 10.27475405 10.35980807 10.49550604 10.65881222 10.8210937
 10.95282465 11.02843734]

Create a new sample of explanatory variables Xnew, predict and plot

In [6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[11.01996692 10.89267611 10.66409985 10.37419812 10.0755723   9.82058652
  9.6485471   9.57607901  9.59305545  9.6650773 ]

Plot comparison

In [7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [8]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [9]:
res.params
Out[9]:
Intercept           4.995678
x1                  0.499322
np.sin(x1)          0.447136
I((x1 - 5) ** 2)   -0.019386
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [10]:
res.predict(exog=dict(x1=x1n))
Out[10]:
0    11.019967
1    10.892676
2    10.664100
3    10.374198
4    10.075572
5     9.820587
6     9.648547
7     9.576079
8     9.593055
9     9.665077
dtype: float64