Robust Linear Models¶
[1]:
%matplotlib inline
[2]:
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.sandbox.regression.predstd import wls_prediction_std
Estimation¶
Load data:
[3]:
data = sm.datasets.stackloss.load(as_pandas=False)
data.exog = sm.add_constant(data.exog)
Huber’s T norm with the (default) median absolute deviation scaling
[4]:
huber_t = sm.RLM(data.endog, data.exog, M=sm.robust.norms.HuberT())
hub_results = huber_t.fit()
print(hub_results.params)
print(hub_results.bse)
print(hub_results.summary(yname='y',
xname=['var_%d' % i for i in range(len(hub_results.params))]))
[-41.02649835 0.82938433 0.92606597 -0.12784672]
[9.79189854 0.11100521 0.30293016 0.12864961]
Robust linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 21
Model: RLM Df Residuals: 17
Method: IRLS Df Model: 3
Norm: HuberT
Scale Est.: mad
Cov Type: H1
Date: Sun, 17 Jan 2021
Time: 14:02:15
No. Iterations: 19
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
var_0 -41.0265 9.792 -4.190 0.000 -60.218 -21.835
var_1 0.8294 0.111 7.472 0.000 0.612 1.047
var_2 0.9261 0.303 3.057 0.002 0.332 1.520
var_3 -0.1278 0.129 -0.994 0.320 -0.380 0.124
==============================================================================
If the model instance has been used for another fit with different fit parameters, then the fit options might not be the correct ones anymore .
Huber’s T norm with ‘H2’ covariance matrix
[5]:
hub_results2 = huber_t.fit(cov="H2")
print(hub_results2.params)
print(hub_results2.bse)
[-41.02649835 0.82938433 0.92606597 -0.12784672]
[9.08950419 0.11945975 0.32235497 0.11796313]
Andrew’s Wave norm with Huber’s Proposal 2 scaling and ‘H3’ covariance matrix
[6]:
andrew_mod = sm.RLM(data.endog, data.exog, M=sm.robust.norms.AndrewWave())
andrew_results = andrew_mod.fit(scale_est=sm.robust.scale.HuberScale(), cov="H3")
print('Parameters: ', andrew_results.params)
Parameters: [-40.8817957 0.79276138 1.04857556 -0.13360865]
See help(sm.RLM.fit)
for more options and module sm.robust.scale
for scale options
Comparing OLS and RLM¶
Artificial data with outliers:
[7]:
nsample = 50
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, (x1-5)**2))
X = sm.add_constant(X)
sig = 0.3 # smaller error variance makes OLS<->RLM contrast bigger
beta = [5, 0.5, -0.0]
y_true2 = np.dot(X, beta)
y2 = y_true2 + sig*1. * np.random.normal(size=nsample)
y2[[39,41,43,45,48]] -= 5 # add some outliers (10% of nsample)
Example 1: quadratic function with linear truth¶
Note that the quadratic term in OLS regression will capture outlier effects.
[8]:
res = sm.OLS(y2, X).fit()
print(res.params)
print(res.bse)
print(res.predict())
[ 5.02206203 0.52277793 -0.01332124]
[0.4731087 0.0730416 0.00646305]
[ 4.6890311 4.95456296 5.21565626 5.47231099 5.72452717 5.97230477
6.21564382 6.4545443 6.68900622 6.91902958 7.14461437 7.3657606
7.58246827 7.79473737 8.00256792 8.20595989 8.40491331 8.59942816
8.78950445 8.97514218 9.15634134 9.33310194 9.50542398 9.67330745
9.83675236 9.99575871 10.15032649 10.30045571 10.44614637 10.58739847
10.724212 10.85658697 10.98452338 11.10802122 11.2270805 11.34170122
11.45188337 11.55762696 11.65893199 11.75579845 11.84822636 11.93621569
12.01976647 12.09887868 12.17355233 12.24378742 12.30958394 12.3709419
12.4278613 12.48034214]
Estimate RLM:
[9]:
resrlm = sm.RLM(y2, X).fit()
print(resrlm.params)
print(resrlm.bse)
[ 4.95713094e+00 5.00427653e-01 -8.65734908e-04]
[0.11168887 0.01724325 0.00152576]
Draw a plot to compare OLS estimates to the robust estimates:
[10]:
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(111)
ax.plot(x1, y2, 'o',label="data")
ax.plot(x1, y_true2, 'b-', label="True")
prstd, iv_l, iv_u = wls_prediction_std(res)
ax.plot(x1, res.fittedvalues, 'r-', label="OLS")
ax.plot(x1, iv_u, 'r--')
ax.plot(x1, iv_l, 'r--')
ax.plot(x1, resrlm.fittedvalues, 'g.-', label="RLM")
ax.legend(loc="best")
[10]:
<matplotlib.legend.Legend at 0x7f9b3ef4baf0>

Example 2: linear function with linear truth¶
Fit a new OLS model using only the linear term and the constant:
[11]:
X2 = X[:,[0,1]]
res2 = sm.OLS(y2, X2).fit()
print(res2.params)
print(res2.bse)
[5.55898946 0.38956555]
[0.40807626 0.0351615 ]
Estimate RLM:
[12]:
resrlm2 = sm.RLM(y2, X2).fit()
print(resrlm2.params)
print(resrlm2.bse)
[4.98688118 0.49245455]
[0.09120802 0.00785885]
Draw a plot to compare OLS estimates to the robust estimates:
[13]:
prstd, iv_l, iv_u = wls_prediction_std(res2)
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(x1, y2, 'o', label="data")
ax.plot(x1, y_true2, 'b-', label="True")
ax.plot(x1, res2.fittedvalues, 'r-', label="OLS")
ax.plot(x1, iv_u, 'r--')
ax.plot(x1, iv_l, 'r--')
ax.plot(x1, resrlm2.fittedvalues, 'g.-', label="RLM")
legend = ax.legend(loc="best")
