Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm

plt.rc("figure", figsize=(16,8))
plt.rc("font", size=14)

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.985
Model:                            OLS   Adj. R-squared:                  0.984
Method:                 Least Squares   F-statistic:                     1034.
Date:                Mon, 26 Oct 2020   Prob (F-statistic):           3.38e-42
Time:                        07:28:11   Log-Likelihood:                 2.1453
No. Observations:                  50   AIC:                             3.709
Df Residuals:                      46   BIC:                             11.36
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.8261      0.082     58.588      0.000       4.660       4.992
x1             0.5217      0.013     41.064      0.000       0.496       0.547
x2             0.4193      0.050      8.397      0.000       0.319       0.520
x3            -0.0213      0.001    -19.125      0.000      -0.024      -0.019
==============================================================================
Omnibus:                        0.719   Durbin-Watson:                   2.493
Prob(Omnibus):                  0.698   Jarque-Bera (JB):                0.760
Skew:                          -0.087   Prob(JB):                        0.684
Kurtosis:                       2.422   Cond. No.                         221.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.2927808   4.7556813   5.18412681  5.55526333  5.85448478  6.07783268
  6.23264654  6.336358    6.41362686  6.49228952  6.5987856   6.75381391
  6.96893174  7.24465603  7.57037841  7.92610816  8.2857566   8.6214234
  8.90798052  9.1272007   9.27075209  9.34156721  9.35336112  9.32837807
  9.29373684  9.2769752   9.30152552  9.38286492  9.52597242  9.72451088
  9.9618682  10.2138866  10.45283175 10.65194935 10.78986086 10.8540759
 10.84304438 10.76641119 10.64343156 10.4998075  10.36346501 10.25996592
 10.20830818 10.21780451 10.28655237 10.40174654 10.541783   10.67980836
 10.78813184 10.84277477]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.81346852 10.67115085 10.43226692 10.13429316  9.82656174  9.55818234
  9.36601836  9.26566152  9.24761441  9.27961584]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");
../../../_images/examples_notebooks_generated_predict_12_0.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           4.826102
x1                  0.521686
np.sin(x1)          0.419346
I((x1 - 5) ** 2)   -0.021333
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    10.813469
1    10.671151
2    10.432267
3    10.134293
4     9.826562
5     9.558182
6     9.366018
7     9.265662
8     9.247614
9     9.279616
dtype: float64