Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm

plt.rc("figure", figsize=(16,8))
plt.rc("font", size=14)

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.981
Model:                            OLS   Adj. R-squared:                  0.980
Method:                 Least Squares   F-statistic:                     807.2
Date:                Sat, 06 Feb 2021   Prob (F-statistic):           9.07e-40
Time:                        06:11:05   Log-Likelihood:                -1.7637
No. Observations:                  50   AIC:                             11.53
Df Residuals:                      46   BIC:                             19.18
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.1284      0.089     57.576      0.000       4.949       5.308
x1             0.4914      0.014     35.774      0.000       0.464       0.519
x2             0.5294      0.054      9.804      0.000       0.421       0.638
x3            -0.0195      0.001    -16.191      0.000      -0.022      -0.017
==============================================================================
Omnibus:                        4.125   Durbin-Watson:                   1.356
Prob(Omnibus):                  0.127   Jarque-Bera (JB):                3.024
Skew:                           0.534   Prob(JB):                        0.220
Kurtosis:                       3.556   Cond. No.                         221.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.64017827  5.12736695  5.57352199  5.94978914  6.23772749  6.43233928
  6.54289099  6.59139092  6.60897326  6.63078284  6.69020095  6.81336084
  7.01485416  7.29533358  7.6414055   8.02783068  8.42167087  8.78770056
  9.09419439  9.31813952  9.44901661  9.49052812  9.45998995  9.38548657
  9.30125675  9.24206826  9.23750562  9.30710946  9.457166    9.67967432
  9.95366081 10.24862513 10.52955131 10.76266077 10.92096204 10.98868541
 10.96387382 10.85870441 10.69748787 10.5126742  10.33952136 10.2103024
 10.14900314 10.16738136 10.26303501 10.41979688 10.61039082 10.80091314
 10.95640288 11.0465858 ]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[11.03891005 10.89225466 10.62738244 10.29160921  9.94721922  9.65621584
  9.46514092  9.39367974  9.42984138  9.5328945 ]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");
../../../_images/examples_notebooks_generated_predict_12_0.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           5.128384
x1                  0.491435
np.sin(x1)          0.529444
I((x1 - 5) ** 2)   -0.019528
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    11.038910
1    10.892255
2    10.627382
3    10.291609
4     9.947219
5     9.656216
6     9.465141
7     9.393680
8     9.429841
9     9.532895
dtype: float64