Robust Linear Models

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.sandbox.regression.predstd import wls_prediction_std
/build/statsmodels-GE7Zhw/statsmodels-0.8.0/.pybuild/cpython3_3.6_statsmodels/build/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

Estimation

Load data:

In [2]:
data = sm.datasets.stackloss.load()
data.exog = sm.add_constant(data.exog)

Huber's T norm with the (default) median absolute deviation scaling

In [3]:
huber_t = sm.RLM(data.endog, data.exog, M=sm.robust.norms.HuberT())
hub_results = huber_t.fit()
print(hub_results.params)
print(hub_results.bse)
print(hub_results.summary(yname='y',
            xname=['var_%d' % i for i in range(len(hub_results.params))]))
[-41.02649835   0.82938433   0.92606597  -0.12784672]
[9.79189854 0.11100521 0.30293016 0.12864961]
                    Robust linear Model Regression Results                    
==============================================================================
Dep. Variable:                      y   No. Observations:                   21
Model:                            RLM   Df Residuals:                       17
Method:                          IRLS   Df Model:                            3
Norm:                          HuberT                                         
Scale Est.:                       mad                                         
Cov Type:                          H1                                         
Date:                Wed, 27 Jun 2018                                         
Time:                        23:25:01                                         
No. Iterations:                    19                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
var_0        -41.0265      9.792     -4.190      0.000     -60.218     -21.835
var_1          0.8294      0.111      7.472      0.000       0.612       1.047
var_2          0.9261      0.303      3.057      0.002       0.332       1.520
var_3         -0.1278      0.129     -0.994      0.320      -0.380       0.124
==============================================================================

If the model instance has been used for another fit with different fit
parameters, then the fit options might not be the correct ones anymore .

Huber's T norm with 'H2' covariance matrix

In [4]:
hub_results2 = huber_t.fit(cov="H2")
print(hub_results2.params)
print(hub_results2.bse)
[-41.02649835   0.82938433   0.92606597  -0.12784672]
[9.08950419 0.11945975 0.32235497 0.11796313]

Andrew's Wave norm with Huber's Proposal 2 scaling and 'H3' covariance matrix

In [5]:
andrew_mod = sm.RLM(data.endog, data.exog, M=sm.robust.norms.AndrewWave())
andrew_results = andrew_mod.fit(scale_est=sm.robust.scale.HuberScale(), cov="H3")
print('Parameters: ', andrew_results.params)
Parameters:  [-40.8817957    0.79276138   1.04857556  -0.13360865]

See help(sm.RLM.fit) for more options and module sm.robust.scale for scale options

Comparing OLS and RLM

Artificial data with outliers:

In [6]:
nsample = 50
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, (x1-5)**2))
X = sm.add_constant(X)
sig = 0.3   # smaller error variance makes OLS<->RLM contrast bigger
beta = [5, 0.5, -0.0]
y_true2 = np.dot(X, beta)
y2 = y_true2 + sig*1. * np.random.normal(size=nsample)
y2[[39,41,43,45,48]] -= 5   # add some outliers (10% of nsample)

Example 1: quadratic function with linear truth

Note that the quadratic term in OLS regression will capture outlier effects.

In [7]:
res = sm.OLS(y2, X).fit()
print(res.params)
print(res.bse)
print(res.predict())
[ 5.34732195  0.48482493 -0.01051869]
[0.46294788 0.0714729  0.00632425]
[ 5.08435467  5.32342344  5.55898745  5.79104668  6.01960115  6.24465084
  6.46619576  6.68423592  6.8987713   7.10980191  7.31732776  7.52134883
  7.72186514  7.91887667  8.11238343  8.30238543  8.48888265  8.6718751
  8.85136279  9.0273457   9.19982384  9.36879722  9.53426582  9.69622966
  9.85468872 10.00964301 10.16109254 10.30903729 10.45347727 10.59441249
 10.73184293 10.8657686  10.99618951 11.12310564 11.246517   11.3664236
 11.48282542 11.59572247 11.70511475 11.81100227 11.91338501 12.01226298
 12.10763619 12.19950462 12.28786828 12.37272718 12.4540813  12.53193065
 12.60627523 12.67711505]

Estimate RLM:

In [8]:
resrlm = sm.RLM(y2, X).fit()
print(resrlm.params)
print(resrlm.bse)
[5.28274263e+00 4.67502972e-01 6.67470120e-04]
[0.12590252 0.01943765 0.00171993]

Draw a plot to compare OLS estimates to the robust estimates:

In [9]:
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(111)
ax.plot(x1, y2, 'o',label="data")
ax.plot(x1, y_true2, 'b-', label="True")
prstd, iv_l, iv_u = wls_prediction_std(res)
ax.plot(x1, res.fittedvalues, 'r-', label="OLS")
ax.plot(x1, iv_u, 'r--')
ax.plot(x1, iv_l, 'r--')
ax.plot(x1, resrlm.fittedvalues, 'g.-', label="RLM")
ax.legend(loc="best")
Out[9]:
<matplotlib.legend.Legend at 0x7fa754569860>

Example 2: linear function with linear truth

Fit a new OLS model using only the linear term and the constant:

In [10]:
X2 = X[:,[0,1]] 
res2 = sm.OLS(y2, X2).fit()
print(res2.params)
print(res2.bse)
[5.77128959 0.37963802]
[0.39349633 0.03390523]

Estimate RLM:

In [11]:
resrlm2 = sm.RLM(y2, X2).fit()
print(resrlm2.params)
print(resrlm2.bse)
[5.26197684 0.47316981]
[0.10381327 0.00894497]

Draw a plot to compare OLS estimates to the robust estimates:

In [12]:
prstd, iv_l, iv_u = wls_prediction_std(res2)

fig, ax = plt.subplots(figsize=(8,6))
ax.plot(x1, y2, 'o', label="data")
ax.plot(x1, y_true2, 'b-', label="True")
ax.plot(x1, res2.fittedvalues, 'r-', label="OLS")
ax.plot(x1, iv_u, 'r--')
ax.plot(x1, iv_l, 'r--')
ax.plot(x1, resrlm2.fittedvalues, 'g.-', label="RLM")
legend = ax.legend(loc="best")