Prediction (out of sample)¶
[1]:
%matplotlib inline
[2]:
import numpy as np
import statsmodels.api as sm
Artificial data¶
[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)
Estimation¶
[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.990
Model: OLS Adj. R-squared: 0.989
Method: Least Squares F-statistic: 1475.
Date: Fri, 24 Apr 2020 Prob (F-statistic): 1.05e-45
Time: 14:17:04 Log-Likelihood: 12.748
No. Observations: 50 AIC: -17.50
Df Residuals: 46 BIC: -9.849
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 4.9817 0.067 74.763 0.000 4.848 5.116
x1 0.5085 0.010 49.481 0.000 0.488 0.529
x2 0.4889 0.040 12.103 0.000 0.408 0.570
x3 -0.0211 0.001 -23.426 0.000 -0.023 -0.019
==============================================================================
Omnibus: 0.644 Durbin-Watson: 1.979
Prob(Omnibus): 0.725 Jarque-Bera (JB): 0.699
Skew: -0.009 Prob(JB): 0.705
Kurtosis: 2.421 Cond. No. 221.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In-sample prediction¶
[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.45329077 4.93765885 5.38309946 5.76296633 6.0602297 6.27027423
6.40165732 6.47470315 6.51816345 6.56449368 6.64452072 6.78237806
6.99154075 7.27261152 7.61322168 7.99006334 8.37271878 8.72865802
9.02858338 9.25124303 9.3869226 9.4390412 9.42358971 9.3665035
9.29940131 9.2543904 9.25879143 9.33064972 9.47577047 9.686765
9.94426465 10.22010303 10.48194367 10.6985928 10.84512441 10.90697553
10.88233884 10.78245966 10.62978837 10.45429185 10.28853001 10.1623063
10.0977711 10.10578245 10.18412212 10.31786004 10.4818073 10.64465459
10.77411625 10.84223468]
Create a new sample of explanatory variables Xnew, predict and plot¶
[6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.81516213 10.65819358 10.39050303 10.05578558 9.71155937 9.41508312
9.20933719 9.1125005 9.11349971 9.17472033]
Plot comparison¶
[7]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas¶
Using formulas can make both estimation and prediction a lot easier
[8]:
from statsmodels.formula.api import ols
data = {"x1" : x1, "y" : y}
res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()
We use the I
to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2
[9]:
res.params
[9]:
Intercept 4.981707
x1 0.508495
np.sin(x1) 0.488930
I((x1 - 5) ** 2) -0.021137
dtype: float64
Now we only have to pass the single variable and we get the transformed right-hand side variables automatically
[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0 10.815162
1 10.658194
2 10.390503
3 10.055786
4 9.711559
5 9.415083
6 9.209337
7 9.112501
8 9.113500
9 9.174720
dtype: float64