Prediction (out of sample)¶
[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
plt.rc("figure", figsize=(16,8))
plt.rc("font", size=14)
Artificial data¶
[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)
Estimation¶
[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.985
Model: OLS Adj. R-squared: 0.984
Method: Least Squares F-statistic: 1034.
Date: Mon, 26 Oct 2020 Prob (F-statistic): 3.38e-42
Time: 07:28:11 Log-Likelihood: 2.1453
No. Observations: 50 AIC: 3.709
Df Residuals: 46 BIC: 11.36
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 4.8261 0.082 58.588 0.000 4.660 4.992
x1 0.5217 0.013 41.064 0.000 0.496 0.547
x2 0.4193 0.050 8.397 0.000 0.319 0.520
x3 -0.0213 0.001 -19.125 0.000 -0.024 -0.019
==============================================================================
Omnibus: 0.719 Durbin-Watson: 2.493
Prob(Omnibus): 0.698 Jarque-Bera (JB): 0.760
Skew: -0.087 Prob(JB): 0.684
Kurtosis: 2.422 Cond. No. 221.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In-sample prediction¶
[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.2927808 4.7556813 5.18412681 5.55526333 5.85448478 6.07783268
6.23264654 6.336358 6.41362686 6.49228952 6.5987856 6.75381391
6.96893174 7.24465603 7.57037841 7.92610816 8.2857566 8.6214234
8.90798052 9.1272007 9.27075209 9.34156721 9.35336112 9.32837807
9.29373684 9.2769752 9.30152552 9.38286492 9.52597242 9.72451088
9.9618682 10.2138866 10.45283175 10.65194935 10.78986086 10.8540759
10.84304438 10.76641119 10.64343156 10.4998075 10.36346501 10.25996592
10.20830818 10.21780451 10.28655237 10.40174654 10.541783 10.67980836
10.78813184 10.84277477]
Create a new sample of explanatory variables Xnew, predict and plot¶
[6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.81346852 10.67115085 10.43226692 10.13429316 9.82656174 9.55818234
9.36601836 9.26566152 9.24761441 9.27961584]
Plot comparison¶
[7]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas¶
Using formulas can make both estimation and prediction a lot easier
[8]:
from statsmodels.formula.api import ols
data = {"x1" : x1, "y" : y}
res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()
We use the I
to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2
[9]:
res.params
[9]:
Intercept 4.826102
x1 0.521686
np.sin(x1) 0.419346
I((x1 - 5) ** 2) -0.021333
dtype: float64
Now we only have to pass the single variable and we get the transformed right-hand side variables automatically
[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0 10.813469
1 10.671151
2 10.432267
3 10.134293
4 9.826562
5 9.558182
6 9.366018
7 9.265662
8 9.247614
9 9.279616
dtype: float64