Prediction (out of sample)¶
[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)
Artificial data¶
[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)
Estimation¶
[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.982
Model: OLS Adj. R-squared: 0.981
Method: Least Squares F-statistic: 831.6
Date: Thu, 03 Nov 2022 Prob (F-statistic): 4.63e-40
Time: 06:34:53 Log-Likelihood: 0.30166
No. Observations: 50 AIC: 7.397
Df Residuals: 46 BIC: 15.04
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.2132 0.085 60.996 0.000 5.041 5.385
x1 0.4696 0.013 35.623 0.000 0.443 0.496
x2 0.5091 0.052 9.826 0.000 0.405 0.613
x3 -0.0179 0.001 -15.493 0.000 -0.020 -0.016
==============================================================================
Omnibus: 0.916 Durbin-Watson: 1.512
Prob(Omnibus): 0.633 Jarque-Bera (JB): 0.897
Skew: 0.136 Prob(JB): 0.639
Kurtosis: 2.403 Cond. No. 221.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In-sample prediction¶
[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.76491656 5.2288618 5.65362959 6.01147205 6.28465536 6.4683734
6.57153733 6.61531158 6.62963654 6.64830945 6.70343163 6.82013432
7.01244964 7.28100524 7.6129211 7.98392572 8.36234383 8.71430061
9.00928715 9.22517301 9.35184221 9.39285523 9.36486403 9.2948762
9.21581769 9.16112313 9.15924267 9.22896767 9.37634333 9.5936755
9.86079478 10.14837033 10.42272881 10.65138684 10.8083879 10.87856696
10.86004199 10.76452323 10.61538917 10.4438456 10.28379854 10.16628357
10.11436682 10.13935522 10.23893883 10.39757035 10.5890195 10.78068257
10.93893914 11.0346758 ]
Create a new sample of explanatory variables Xnew, predict and plot¶
[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[11.03871909 10.90953611 10.66709354 10.35689295 10.03883038 9.77253176
9.60275436 9.54842865 9.59802348 9.71236931]
Plot comparison¶
[7]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7fbc7091c4f0>

Predicting with Formulas¶
Using formulas can make both estimation and prediction a lot easier
[8]:
from statsmodels.formula.api import ols
data = {"x1": x1, "y": y}
res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()
We use the I
to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2
[9]:
res.params
[9]:
Intercept 5.213184
x1 0.469554
np.sin(x1) 0.509144
I((x1 - 5) ** 2) -0.017931
dtype: float64
Now we only have to pass the single variable and we get the transformed right-hand side variables automatically
[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0 11.038719
1 10.909536
2 10.667094
3 10.356893
4 10.038830
5 9.772532
6 9.602754
7 9.548429
8 9.598023
9 9.712369
dtype: float64