Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import statsmodels.api as sm

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.985
Model:                            OLS   Adj. R-squared:                  0.984
Method:                 Least Squares   F-statistic:                     1022.
Date:                Sun, 09 Aug 2020   Prob (F-statistic):           4.39e-42
Time:                        21:12:11   Log-Likelihood:                 4.3139
No. Observations:                  50   AIC:                           -0.6279
Df Residuals:                      46   BIC:                             7.020
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0700      0.079     64.277      0.000       4.911       5.229
x1             0.4948      0.012     40.678      0.000       0.470       0.519
x2             0.4804      0.048     10.045      0.000       0.384       0.577
x3            -0.0201      0.001    -18.796      0.000      -0.022      -0.018
==============================================================================
Omnibus:                        2.516   Durbin-Watson:                   2.383
Prob(Omnibus):                  0.284   Jarque-Bera (JB):                2.002
Skew:                          -0.342   Prob(JB):                        0.367
Kurtosis:                       2.298   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.56811682  5.03936666  5.47260018  5.84163719  6.12974582  6.33239146
  6.45798183  6.52648559  6.56615156  6.60886751  6.6849211   6.81802367
  7.0214146   7.29568604  7.62868568  7.99751331  8.37228327  8.72103458
  9.01498204  9.23324553  9.36628061  9.41744659  9.40245464  9.34678646
  9.28150774  9.23816414  9.24359842  9.31554026  9.45969332  9.66879815
  9.92382489 10.19709984 10.45685213 10.67243363 10.81935433 10.88330597
 10.86251287 10.76802365 10.62189597 10.45357256 10.29504379 10.17559181
 10.11697955 10.12987511 10.21209908 10.34898253 10.51577716 10.6817213
 10.81509401 10.88842675]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.86998862 10.72427211 10.47011581 10.15045054  9.82178835  9.54038638
  9.34847322  9.26390981  9.27581649  9.3472367 ]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");
../../../_images/examples_notebooks_generated_predict_12_0.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           5.070015
x1                  0.494847
np.sin(x1)          0.480378
I((x1 - 5) ** 2)   -0.020076
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    10.869989
1    10.724272
2    10.470116
3    10.150451
4     9.821788
5     9.540386
6     9.348473
7     9.263910
8     9.275816
9     9.347237
dtype: float64