Prediction (out of sample)¶
[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
plt.rc("figure", figsize=(16,8))
plt.rc("font", size=14)
Artificial data¶
[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)
Estimation¶
[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.981
Model: OLS Adj. R-squared: 0.980
Method: Least Squares F-statistic: 807.2
Date: Sat, 06 Feb 2021 Prob (F-statistic): 9.07e-40
Time: 06:11:05 Log-Likelihood: -1.7637
No. Observations: 50 AIC: 11.53
Df Residuals: 46 BIC: 19.18
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.1284 0.089 57.576 0.000 4.949 5.308
x1 0.4914 0.014 35.774 0.000 0.464 0.519
x2 0.5294 0.054 9.804 0.000 0.421 0.638
x3 -0.0195 0.001 -16.191 0.000 -0.022 -0.017
==============================================================================
Omnibus: 4.125 Durbin-Watson: 1.356
Prob(Omnibus): 0.127 Jarque-Bera (JB): 3.024
Skew: 0.534 Prob(JB): 0.220
Kurtosis: 3.556 Cond. No. 221.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In-sample prediction¶
[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.64017827 5.12736695 5.57352199 5.94978914 6.23772749 6.43233928
6.54289099 6.59139092 6.60897326 6.63078284 6.69020095 6.81336084
7.01485416 7.29533358 7.6414055 8.02783068 8.42167087 8.78770056
9.09419439 9.31813952 9.44901661 9.49052812 9.45998995 9.38548657
9.30125675 9.24206826 9.23750562 9.30710946 9.457166 9.67967432
9.95366081 10.24862513 10.52955131 10.76266077 10.92096204 10.98868541
10.96387382 10.85870441 10.69748787 10.5126742 10.33952136 10.2103024
10.14900314 10.16738136 10.26303501 10.41979688 10.61039082 10.80091314
10.95640288 11.0465858 ]
Create a new sample of explanatory variables Xnew, predict and plot¶
[6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[11.03891005 10.89225466 10.62738244 10.29160921 9.94721922 9.65621584
9.46514092 9.39367974 9.42984138 9.5328945 ]
Plot comparison¶
[7]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas¶
Using formulas can make both estimation and prediction a lot easier
[8]:
from statsmodels.formula.api import ols
data = {"x1" : x1, "y" : y}
res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()
We use the I
to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2
[9]:
res.params
[9]:
Intercept 5.128384
x1 0.491435
np.sin(x1) 0.529444
I((x1 - 5) ** 2) -0.019528
dtype: float64
Now we only have to pass the single variable and we get the transformed right-hand side variables automatically
[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0 11.038910
1 10.892255
2 10.627382
3 10.291609
4 9.947219
5 9.656216
6 9.465141
7 9.393680
8 9.429841
9 9.532895
dtype: float64