Prediction (out of sample)¶
[1]:
%matplotlib inline
[2]:
import numpy as np
import statsmodels.api as sm
Artificial data¶
[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)
Estimation¶
[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.985
Model: OLS Adj. R-squared: 0.984
Method: Least Squares F-statistic: 1022.
Date: Sun, 09 Aug 2020 Prob (F-statistic): 4.39e-42
Time: 21:12:11 Log-Likelihood: 4.3139
No. Observations: 50 AIC: -0.6279
Df Residuals: 46 BIC: 7.020
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.0700 0.079 64.277 0.000 4.911 5.229
x1 0.4948 0.012 40.678 0.000 0.470 0.519
x2 0.4804 0.048 10.045 0.000 0.384 0.577
x3 -0.0201 0.001 -18.796 0.000 -0.022 -0.018
==============================================================================
Omnibus: 2.516 Durbin-Watson: 2.383
Prob(Omnibus): 0.284 Jarque-Bera (JB): 2.002
Skew: -0.342 Prob(JB): 0.367
Kurtosis: 2.298 Cond. No. 221.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In-sample prediction¶
[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.56811682 5.03936666 5.47260018 5.84163719 6.12974582 6.33239146
6.45798183 6.52648559 6.56615156 6.60886751 6.6849211 6.81802367
7.0214146 7.29568604 7.62868568 7.99751331 8.37228327 8.72103458
9.01498204 9.23324553 9.36628061 9.41744659 9.40245464 9.34678646
9.28150774 9.23816414 9.24359842 9.31554026 9.45969332 9.66879815
9.92382489 10.19709984 10.45685213 10.67243363 10.81935433 10.88330597
10.86251287 10.76802365 10.62189597 10.45357256 10.29504379 10.17559181
10.11697955 10.12987511 10.21209908 10.34898253 10.51577716 10.6817213
10.81509401 10.88842675]
Create a new sample of explanatory variables Xnew, predict and plot¶
[6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.86998862 10.72427211 10.47011581 10.15045054 9.82178835 9.54038638
9.34847322 9.26390981 9.27581649 9.3472367 ]
Plot comparison¶
[7]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas¶
Using formulas can make both estimation and prediction a lot easier
[8]:
from statsmodels.formula.api import ols
data = {"x1" : x1, "y" : y}
res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()
We use the I
to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2
[9]:
res.params
[9]:
Intercept 5.070015
x1 0.494847
np.sin(x1) 0.480378
I((x1 - 5) ** 2) -0.020076
dtype: float64
Now we only have to pass the single variable and we get the transformed right-hand side variables automatically
[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0 10.869989
1 10.724272
2 10.470116
3 10.150451
4 9.821788
5 9.540386
6 9.348473
7 9.263910
8 9.275816
9 9.347237
dtype: float64