M-Estimators for Robust Linear ModelingΒΆ

Link to Notebook GitHub

In [1]:
from __future__ import print_function
from statsmodels.compat import lmap
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

import statsmodels.api as sm
  • An M-estimator minimizes the function

$$Q(e_i, \rho) = \sum_i~\rho \left (\frac{e_i}{s}\right )$$

where $\rho$ is a symmetric function of the residuals

  • The effect of $\rho$ is to reduce the influence of outliers
  • $s$ is an estimate of scale.
  • The robust estimates $\hat{\beta}$ are computed by the iteratively re-weighted least squares algorithm
  • We have several choices available for the weighting functions to be used
In [2]:
norms = sm.robust.norms
In [3]:
def plot_weights(support, weights_func, xlabels, xticks):
    fig = plt.figure(figsize=(12,8))
    ax = fig.add_subplot(111)
    ax.plot(support, weights_func(support))
    ax.set_xticks(xticks)
    ax.set_xticklabels(xlabels, fontsize=16)
    ax.set_ylim(-.1, 1.1)
    return ax

Andrew's Wave

In [4]:
help(norms.AndrewWave.weights)
Help on method weights in module statsmodels.robust.norms:

weights(self, z) unbound statsmodels.robust.norms.AndrewWave method
    Andrew's wave weighting function for the IRLS algorithm

    The psi function scaled by z

    Parameters
    ----------
    z : array-like
        1d array

    Returns
    -------
    weights : array
        weights(z) = sin(z/a)/(z/a)     for \|z\| <= a*pi

        weights(z) = 0                  for \|z\| > a*pi


In [5]:
a = 1.339
support = np.linspace(-np.pi*a, np.pi*a, 100)
andrew = norms.AndrewWave(a=a)
plot_weights(support, andrew.weights, ['$-\pi*a$', '0', '$\pi*a$'], [-np.pi*a, 0, np.pi*a]);

Hampel's 17A

In [6]:
help(norms.Hampel.weights)
Help on method weights in module statsmodels.robust.norms:

weights(self, z) unbound statsmodels.robust.norms.Hampel method
    Hampel weighting function for the IRLS algorithm

    The psi function scaled by z

    Parameters
    ----------
    z : array-like
        1d array

    Returns
    -------
    weights : array
        weights(z) = 1                            for \|z\| <= a

        weights(z) = a/\|z\|                        for a < \|z\| <= b

        weights(z) = a*(c - \|z\|)/(\|z\|*(c-b))      for b < \|z\| <= c

        weights(z) = 0                            for \|z\| > c


In [7]:
c = 8
support = np.linspace(-3*c, 3*c, 1000)
hampel = norms.Hampel(a=2., b=4., c=c)
plot_weights(support, hampel.weights, ['3*c', '0', '3*c'], [-3*c, 0, 3*c]);

Huber's t

In [8]:
help(norms.HuberT.weights)
Help on method weights in module statsmodels.robust.norms:

weights(self, z) unbound statsmodels.robust.norms.HuberT method
    Huber's t weighting function for the IRLS algorithm

    The psi function scaled by z

    Parameters
    ----------
    z : array-like
        1d array

    Returns
    -------
    weights : array
        weights(z) = 1          for \|z\| <= t

        weights(z) = t/\|z\|      for \|z\| > t


In [9]:
t = 1.345
support = np.linspace(-3*t, 3*t, 1000)
huber = norms.HuberT(t=t)
plot_weights(support, huber.weights, ['-3*t', '0', '3*t'], [-3*t, 0, 3*t]);

Least Squares

In [10]:
help(norms.LeastSquares.weights)
Help on method weights in module statsmodels.robust.norms:

weights(self, z) unbound statsmodels.robust.norms.LeastSquares method
    The least squares estimator weighting function for the IRLS algorithm.

    The psi function scaled by the input z

    Parameters
    ----------
    z : array-like
        1d array

    Returns
    -------
    weights : array
        weights(z) = np.ones(z.shape)


In [11]:
support = np.linspace(-3, 3, 1000)
lst_sq = norms.LeastSquares()
plot_weights(support, lst_sq.weights, ['-3', '0', '3'], [-3, 0, 3]);

Ramsay's Ea

In [12]:
help(norms.RamsayE.weights)
Help on method weights in module statsmodels.robust.norms:

weights(self, z) unbound statsmodels.robust.norms.RamsayE method
    Ramsay's Ea weighting function for the IRLS algorithm

    The psi function scaled by z

    Parameters
    ----------
    z : array-like
        1d array

    Returns
    -------
    weights : array
        weights(z) = exp(-a*\|z\|)


In [13]:
a = .3
support = np.linspace(-3*a, 3*a, 1000)
ramsay = norms.RamsayE(a=a)
plot_weights(support, ramsay.weights, ['-3*a', '0', '3*a'], [-3*a, 0, 3*a]);

Trimmed Mean

In [14]:
help(norms.TrimmedMean.weights)
Help on method weights in module statsmodels.robust.norms:

weights(self, z) unbound statsmodels.robust.norms.TrimmedMean method
    Least trimmed mean weighting function for the IRLS algorithm

    The psi function scaled by z

    Parameters
    ----------
    z : array-like
        1d array

    Returns
    -------
    weights : array
        weights(z) = 1             for \|z\| <= c

        weights(z) = 0             for \|z\| > c


In [15]:
c = 2
support = np.linspace(-3*c, 3*c, 1000)
trimmed = norms.TrimmedMean(c=c)
plot_weights(support, trimmed.weights, ['-3*c', '0', '3*c'], [-3*c, 0, 3*c]);

Tukey's Biweight

In [16]:
help(norms.TukeyBiweight.weights)
Help on method weights in module statsmodels.robust.norms:

weights(self, z) unbound statsmodels.robust.norms.TukeyBiweight method
    Tukey's biweight weighting function for the IRLS algorithm

    The psi function scaled by z

    Parameters
    ----------
    z : array-like
        1d array

    Returns
    -------
    weights : array
        psi(z) = (1 - (z/c)**2)**2          for \|z\| <= R

        psi(z) = 0                          for \|z\| > R


In [17]:
c = 4.685
support = np.linspace(-3*c, 3*c, 1000)
tukey = norms.TukeyBiweight(c=c)
plot_weights(support, tukey.weights, ['-3*c', '0', '3*c'], [-3*c, 0, 3*c]);

Scale Estimators

  • Robust estimates of the location
In [18]:
x = np.array([1, 2, 3, 4, 500])
  • The mean is not a robust estimator of location
In [19]:
x.mean()
Out[19]:
102.0
  • The median, on the other hand, is a robust estimator with a breakdown point of 50%
In [20]:
np.median(x)
Out[20]:
3.0
  • Analagously for the scale
  • The standard deviation is not robust
In [21]:
x.std()
Out[21]:
199.00251254695254

Median Absolute Deviation

$$ median_i |X_i - median_j(X_j)|) $$

Standardized Median Absolute Deviation is a consistent estimator for $\hat{\sigma}$

$$\hat{\sigma}=K \cdot MAD$$

where $K$ depends on the distribution. For the normal distribution for example,

$$K = \Phi^{-1}(.75)$$

In [22]:
stats.norm.ppf(.75)
Out[22]:
0.67448975019608171
In [23]:
print(x)
[  1   2   3   4 500]

In [24]:
sm.robust.scale.stand_mad(x)
/build/statsmodels-coKHG6/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/robust/scale.py:49: FutureWarning: stand_mad is deprecated and will be removed in 0.7.0. Use mad instead.
  "instead.", FutureWarning)

Out[24]:
1.482602218505602
In [25]:
np.array([1,2,3,4,5.]).std()
Out[25]:
1.4142135623730951
  • The default for Robust Linear Models is MAD
  • another popular choice is Huber's proposal 2
In [26]:
np.random.seed(12345)
fat_tails = stats.t(6).rvs(40)
In [27]:
kde = sm.nonparametric.KDE(fat_tails)
kde.fit()
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(111)
ax.plot(kde.support, kde.density);
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-123-605f8dff1422> in <module>()
----> 1 kde = sm.nonparametric.KDE(fat_tails)
      2 kde.fit()
      3 fig = plt.figure(figsize=(12,8))
      4 ax = fig.add_subplot(111)
      5 ax.plot(kde.support, kde.density);

AttributeError: 'module' object has no attribute 'KDE'
In [28]:
print(fat_tails.mean(), fat_tails.std())
0.0688231044811 1.34716332297

In [29]:
print(stats.norm.fit(fat_tails))
(0.068823104481087499, 1.3471633229698652)

In [30]:
print(stats.t.fit(fat_tails, f0=6))
(6, 0.039009187170278181, 1.0564230978488927)

In [31]:
huber = sm.robust.scale.Huber()
loc, scale = huber(fat_tails)
print(loc, scale)
0.0404898433327 1.15571400476

In [32]:
sm.robust.stand_mad(fat_tails)
Out[32]:
1.1153350011654151
In [33]:
sm.robust.stand_mad(fat_tails, c=stats.t(6).ppf(.75))
Out[33]:
1.0483916565928972
In [34]:
sm.robust.scale.mad(fat_tails)
Out[34]:
1.1153350011654151

Duncan's Occupational Prestige data - M-estimation for outliers

In [35]:
from statsmodels.graphics.api import abline_plot
from statsmodels.formula.api import ols, rlm
In [36]:
prestige = sm.datasets.get_rdataset("Duncan", "car", cache=True).data
---------------------------------------------------------------------------
URLError                                  Traceback (most recent call last)
<ipython-input-132-d85a43c9ce16> in <module>()
----> 1 prestige = sm.datasets.get_rdataset("Duncan", "car", cache=True).data

/build/statsmodels-coKHG6/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in get_rdataset(dataname, package, cache)
    284                      "master/doc/"+package+"/rst/")
    285     cache = _get_cache(cache)
--> 286     data, from_cache = _get_data(data_base_url, dataname, cache)
    287     data = read_csv(data, index_col=0)
    288     data = _maybe_reset_index(data)

/build/statsmodels-coKHG6/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in _get_data(base_url, dataname, cache, extension)
    215     url = base_url + (dataname + ".%s") % extension
    216     try:
--> 217         data, from_cache = _urlopen_cached(url, cache)
    218     except HTTPError as err:
    219         if '404' in str(err):

/build/statsmodels-coKHG6/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in _urlopen_cached(url, cache)
    206     # not using the cache or didn't find it in cache
    207     if not from_cache:
--> 208         data = urlopen(url).read()
    209         if cache is not None:  # then put it in the cache
    210             _cache_it(data, cache_path)

/usr/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    152     else:
    153         opener = _opener
--> 154     return opener.open(url, data, timeout)
    155 
    156 def install_opener(opener):

/usr/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
    427             req = meth(req)
    428 
--> 429         response = self._open(req, data)
    430 
    431         # post-process response

/usr/lib/python2.7/urllib2.pyc in _open(self, req, data)
    445         protocol = req.get_type()
    446         result = self._call_chain(self.handle_open, protocol, protocol +
--> 447                                   '_open', req)
    448         if result:
    449             return result

/usr/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
    405             func = getattr(handler, meth_name)
    406 
--> 407             result = func(*args)
    408             if result is not None:
    409                 return result

/usr/lib/python2.7/urllib2.pyc in https_open(self, req)
   1239         def https_open(self, req):
   1240             return self.do_open(httplib.HTTPSConnection, req,
-> 1241                 context=self._context)
   1242 
   1243         https_request = AbstractHTTPHandler.do_request_

/usr/lib/python2.7/urllib2.pyc in do_open(self, http_class, req, **http_conn_args)
   1196         except socket.error, err: # XXX what error?
   1197             h.close()
-> 1198             raise URLError(err)
   1199         else:
   1200             try:

URLError: <urlopen error [Errno -2] Name or service not known>
In [37]:
print(prestige.head(10))
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-133-d3ced65c396c> in <module>()
----> 1 print(prestige.head(10))

NameError: name 'prestige' is not defined
In [38]:
fig = plt.figure(figsize=(12,12))
ax1 = fig.add_subplot(211, xlabel='Income', ylabel='Prestige')
ax1.scatter(prestige.income, prestige.prestige)
xy_outlier = prestige.ix['minister'][['income','prestige']]
ax1.annotate('Minister', xy_outlier, xy_outlier+1, fontsize=16)
ax2 = fig.add_subplot(212, xlabel='Education',
                           ylabel='Prestige')
ax2.scatter(prestige.education, prestige.prestige);
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-134-ab347cc850d5> in <module>()
      1 fig = plt.figure(figsize=(12,12))
      2 ax1 = fig.add_subplot(211, xlabel='Income', ylabel='Prestige')
----> 3 ax1.scatter(prestige.income, prestige.prestige)
      4 xy_outlier = prestige.ix['minister'][['income','prestige']]
      5 ax1.annotate('Minister', xy_outlier, xy_outlier+1, fontsize=16)

NameError: name 'prestige' is not defined
In [39]:
ols_model = ols('prestige ~ income + education', prestige).fit()
print(ols_model.summary())
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-135-94340cd202f8> in <module>()
----> 1 ols_model = ols('prestige ~ income + education', prestige).fit()
      2 print(ols_model.summary())

NameError: name 'prestige' is not defined
In [40]:
infl = ols_model.get_influence()
student = infl.summary_frame()['student_resid']
print(student)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-136-762835c5010b> in <module>()
----> 1 infl = ols_model.get_influence()
      2 student = infl.summary_frame()['student_resid']
      3 print(student)

AttributeError: 'OLS' object has no attribute 'get_influence'
In [41]:
print(student.ix[np.abs(student) > 2])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-137-4de70e5d7415> in <module>()
----> 1 print(student.ix[np.abs(student) > 2])

NameError: name 'student' is not defined
In [42]:
print(infl.summary_frame().ix['minister'])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-138-1b2a9fb1d022> in <module>()
----> 1 print(infl.summary_frame().ix['minister'])

/usr/lib/python2.7/dist-packages/pandas/core/indexing.pyc in __getitem__(self, key)
     68             return self._getitem_tuple(key)
     69         else:
---> 70             return self._getitem_axis(key, axis=0)
     71 
     72     def _get_label(self, label, axis=0):

/usr/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _getitem_axis(self, key, axis)
    965                     return self._get_loc(key, axis=axis)
    966 
--> 967             return self._get_label(key, axis=axis)
    968 
    969     def _getitem_iterable(self, key, axis=0):

/usr/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _get_label(self, label, axis)
     84             raise IndexingError('no slices here, handle elsewhere')
     85 
---> 86         return self.obj._xs(label, axis=axis)
     87 
     88     def _get_loc(self, key, axis=0):

/usr/lib/python2.7/dist-packages/pandas/core/generic.pyc in xs(self, key, axis, level, copy, drop_level)
   1484                                                       drop_level=drop_level)
   1485         else:
-> 1486             loc = self.index.get_loc(key)
   1487 
   1488             if isinstance(loc, np.ndarray):

/usr/lib/python2.7/dist-packages/pandas/core/index.pyc in get_loc(self, key, method, tolerance)
   1757                                  'backfill or nearest lookups')
   1758             key = _values_from_object(key)
-> 1759             return self._engine.get_loc(key)
   1760 
   1761         indexer = self.get_indexer([key], method=method,

/usr/lib/python2.7/dist-packages/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3979)()

/usr/lib/python2.7/dist-packages/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3908)()

KeyError: 'minister'
In [43]:
sidak = ols_model.outlier_test('sidak')
sidak.sort('unadj_p', inplace=True)
print(sidak)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-139-88381d4a1ea2> in <module>()
----> 1 sidak = ols_model.outlier_test('sidak')
      2 sidak.sort('unadj_p', inplace=True)
      3 print(sidak)

AttributeError: 'OLS' object has no attribute 'outlier_test'
In [44]:
fdr = ols_model.outlier_test('fdr_bh')
fdr.sort('unadj_p', inplace=True)
print(fdr)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-140-69e9cb113222> in <module>()
----> 1 fdr = ols_model.outlier_test('fdr_bh')
      2 fdr.sort('unadj_p', inplace=True)
      3 print(fdr)

AttributeError: 'OLS' object has no attribute 'outlier_test'
In [45]:
rlm_model = rlm('prestige ~ income + education', prestige).fit()
print(rlm_model.summary())
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-141-f16d19a78b97> in <module>()
----> 1 rlm_model = rlm('prestige ~ income + education', prestige).fit()
      2 print(rlm_model.summary())

NameError: name 'prestige' is not defined
In [46]:
print(rlm_model.weights)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-142-2a0e93853fea> in <module>()
----> 1 print(rlm_model.weights)

NameError: name 'rlm_model' is not defined

Hertzprung Russell data for Star Cluster CYG 0B1 - Leverage Points

  • Data is on the luminosity and temperature of 47 stars in the direction of Cygnus.
In [47]:
dta = sm.datasets.get_rdataset("starsCYG", "robustbase", cache=True).data
---------------------------------------------------------------------------
URLError                                  Traceback (most recent call last)
<ipython-input-143-4880a5abdb55> in <module>()
----> 1 dta = sm.datasets.get_rdataset("starsCYG", "robustbase", cache=True).data

/build/statsmodels-coKHG6/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in get_rdataset(dataname, package, cache)
    284                      "master/doc/"+package+"/rst/")
    285     cache = _get_cache(cache)
--> 286     data, from_cache = _get_data(data_base_url, dataname, cache)
    287     data = read_csv(data, index_col=0)
    288     data = _maybe_reset_index(data)

/build/statsmodels-coKHG6/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in _get_data(base_url, dataname, cache, extension)
    215     url = base_url + (dataname + ".%s") % extension
    216     try:
--> 217         data, from_cache = _urlopen_cached(url, cache)
    218     except HTTPError as err:
    219         if '404' in str(err):

/build/statsmodels-coKHG6/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in _urlopen_cached(url, cache)
    206     # not using the cache or didn't find it in cache
    207     if not from_cache:
--> 208         data = urlopen(url).read()
    209         if cache is not None:  # then put it in the cache
    210             _cache_it(data, cache_path)

/usr/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    152     else:
    153         opener = _opener
--> 154     return opener.open(url, data, timeout)
    155 
    156 def install_opener(opener):

/usr/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
    427             req = meth(req)
    428 
--> 429         response = self._open(req, data)
    430 
    431         # post-process response

/usr/lib/python2.7/urllib2.pyc in _open(self, req, data)
    445         protocol = req.get_type()
    446         result = self._call_chain(self.handle_open, protocol, protocol +
--> 447                                   '_open', req)
    448         if result:
    449             return result

/usr/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
    405             func = getattr(handler, meth_name)
    406 
--> 407             result = func(*args)
    408             if result is not None:
    409                 return result

/usr/lib/python2.7/urllib2.pyc in https_open(self, req)
   1239         def https_open(self, req):
   1240             return self.do_open(httplib.HTTPSConnection, req,
-> 1241                 context=self._context)
   1242 
   1243         https_request = AbstractHTTPHandler.do_request_

/usr/lib/python2.7/urllib2.pyc in do_open(self, http_class, req, **http_conn_args)
   1196         except socket.error, err: # XXX what error?
   1197             h.close()
-> 1198             raise URLError(err)
   1199         else:
   1200             try:

URLError: <urlopen error [Errno -2] Name or service not known>
In [48]:
from matplotlib.patches import Ellipse
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(111, xlabel='log(Temp)', ylabel='log(Light)', title='Hertzsprung-Russell Diagram of Star Cluster CYG OB1')
ax.scatter(*dta.values.T)
# highlight outliers
e = Ellipse((3.5, 6), .2, 1, alpha=.25, color='r')
ax.add_patch(e);
ax.annotate('Red giants', xy=(3.6, 6), xytext=(3.8, 6),
            arrowprops=dict(facecolor='black', shrink=0.05, width=2),
            horizontalalignment='left', verticalalignment='bottom',
            clip_on=True, # clip to the axes bounding box
            fontsize=16,
     )
# annotate these with their index
for i,row in dta.ix[dta['log.Te'] < 3.8].iterrows():
    ax.annotate(i, row, row + .01, fontsize=14)
xlim, ylim = ax.get_xlim(), ax.get_ylim()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-144-d061dc5cc6d1> in <module>()
      2 fig = plt.figure(figsize=(12,8))
      3 ax = fig.add_subplot(111, xlabel='log(Temp)', ylabel='log(Light)', title='Hertzsprung-Russell Diagram of Star Cluster CYG OB1')
----> 4 ax.scatter(*dta.values.T)
      5 # highlight outliers
      6 e = Ellipse((3.5, 6), .2, 1, alpha=.25, color='r')

/usr/lib/python2.7/dist-packages/matplotlib/__init__.pyc in inner(ax, *args, **kwargs)
   1812                     warnings.warn(msg % (label_namer, func.__name__),
   1813                                   RuntimeWarning, stacklevel=2)
-> 1814             return func(ax, *args, **kwargs)
   1815         pre_doc = inner.__doc__
   1816         if pre_doc is None:

TypeError: scatter() takes at most 14 arguments (24 given)
In [49]:
from IPython.display import Image
Image(filename='star_diagram.png')
Out[49]:
In [50]:
y = dta['log.light']
X = sm.add_constant(dta['log.Te'], prepend=True)
ols_model = sm.OLS(y, X).fit()
abline_plot(model_results=ols_model, ax=ax)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-146-6c7349bbf59f> in <module>()
----> 1 y = dta['log.light']
      2 X = sm.add_constant(dta['log.Te'], prepend=True)
      3 ols_model = sm.OLS(y, X).fit()
      4 abline_plot(model_results=ols_model, ax=ax)

/usr/lib/python2.7/dist-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1967             return self._getitem_multilevel(key)
   1968         else:
-> 1969             return self._getitem_column(key)
   1970 
   1971     def _getitem_column(self, key):

/usr/lib/python2.7/dist-packages/pandas/core/frame.pyc in _getitem_column(self, key)
   1974         # get column
   1975         if self.columns.is_unique:
-> 1976             return self._get_item_cache(key)
   1977 
   1978         # duplicate columns & possible reduce dimensionality

/usr/lib/python2.7/dist-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
   1089         res = cache.get(item)
   1090         if res is None:
-> 1091             values = self._data.get(item)
   1092             res = self._box_item_values(item, values)
   1093             cache[item] = res

/usr/lib/python2.7/dist-packages/pandas/core/internals.pyc in get(self, item, fastpath)
   3209 
   3210             if not isnull(item):
-> 3211                 loc = self.items.get_loc(item)
   3212             else:
   3213                 indexer = np.arange(len(self.items))[isnull(self.items)]

/usr/lib/python2.7/dist-packages/pandas/core/index.pyc in get_loc(self, key, method, tolerance)
   1757                                  'backfill or nearest lookups')
   1758             key = _values_from_object(key)
-> 1759             return self._engine.get_loc(key)
   1760 
   1761         indexer = self.get_indexer([key], method=method,

/usr/lib/python2.7/dist-packages/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3979)()

/usr/lib/python2.7/dist-packages/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3843)()

/usr/lib/python2.7/dist-packages/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12265)()

/usr/lib/python2.7/dist-packages/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12216)()

KeyError: 'log.light'
In [51]:
rlm_mod = sm.RLM(y, X, sm.robust.norms.TrimmedMean(.5)).fit()
abline_plot(model_results=rlm_mod, ax=ax, color='red')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-147-9210cc7d7045> in <module>()
      1 rlm_mod = sm.RLM(y, X, sm.robust.norms.TrimmedMean(.5)).fit()
----> 2 abline_plot(model_results=rlm_mod, ax=ax, color='red')

/build/statsmodels-coKHG6/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/graphics/regressionplots.pyc in abline_plot(intercept, slope, horiz, vert, model_results, ax, **kwargs)
    654 
    655     if model_results:
--> 656         intercept, slope = model_results.params
    657         if x is None:
    658             x = [model_results.model.exog[:, 1].min(),

ValueError: too many values to unpack
  • Why? Because M-estimators are not robust to leverage points.
In [52]:
infl = ols_model.get_influence()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-148-5ff696623192> in <module>()
----> 1 infl = ols_model.get_influence()

AttributeError: 'OLS' object has no attribute 'get_influence'
In [53]:
h_bar = 2*(ols_model.df_model + 1 )/ols_model.nobs
hat_diag = infl.summary_frame()['hat_diag']
hat_diag.ix[hat_diag > h_bar]
Out[53]:
Series([], Name: hat_diag, dtype: float64)
In [54]:
sidak2 = ols_model.outlier_test('sidak')
sidak2.sort('unadj_p', inplace=True)
print(sidak2)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-150-a6c575f5e012> in <module>()
----> 1 sidak2 = ols_model.outlier_test('sidak')
      2 sidak2.sort('unadj_p', inplace=True)
      3 print(sidak2)

AttributeError: 'OLS' object has no attribute 'outlier_test'
In [55]:
fdr2 = ols_model.outlier_test('fdr_bh')
fdr2.sort('unadj_p', inplace=True)
print(fdr2)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-151-cf26bbbd14f6> in <module>()
----> 1 fdr2 = ols_model.outlier_test('fdr_bh')
      2 fdr2.sort('unadj_p', inplace=True)
      3 print(fdr2)

AttributeError: 'OLS' object has no attribute 'outlier_test'
  • Let's delete that line
In [56]:
del ax.lines[-1]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-152-346dae874d03> in <module>()
----> 1 del ax.lines[-1]

IndexError: list assignment index out of range
In [57]:
weights = np.ones(len(X))
weights[X[X['log.Te'] < 3.8].index.values - 1] = 0
wls_model = sm.WLS(y, X, weights=weights).fit()
abline_plot(model_results=wls_model, ax=ax, color='green')
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-153-e3ec53a40864> in <module>()
      1 weights = np.ones(len(X))
----> 2 weights[X[X['log.Te'] < 3.8].index.values - 1] = 0
      3 wls_model = sm.WLS(y, X, weights=weights).fit()
      4 abline_plot(model_results=wls_model, ax=ax, color='green')

/usr/lib/python2.7/dist-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1967             return self._getitem_multilevel(key)
   1968         else:
-> 1969             return self._getitem_column(key)
   1970 
   1971     def _getitem_column(self, key):

/usr/lib/python2.7/dist-packages/pandas/core/frame.pyc in _getitem_column(self, key)
   1974         # get column
   1975         if self.columns.is_unique:
-> 1976             return self._get_item_cache(key)
   1977 
   1978         # duplicate columns & possible reduce dimensionality

/usr/lib/python2.7/dist-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
   1089         res = cache.get(item)
   1090         if res is None:
-> 1091             values = self._data.get(item)
   1092             res = self._box_item_values(item, values)
   1093             cache[item] = res

/usr/lib/python2.7/dist-packages/pandas/core/internals.pyc in get(self, item, fastpath)
   3209 
   3210             if not isnull(item):
-> 3211                 loc = self.items.get_loc(item)
   3212             else:
   3213                 indexer = np.arange(len(self.items))[isnull(self.items)]

/usr/lib/python2.7/dist-packages/pandas/core/index.pyc in get_loc(self, key, method, tolerance)
   1757                                  'backfill or nearest lookups')
   1758             key = _values_from_object(key)
-> 1759             return self._engine.get_loc(key)
   1760 
   1761         indexer = self.get_indexer([key], method=method,

/usr/lib/python2.7/dist-packages/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3979)()

/usr/lib/python2.7/dist-packages/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3843)()

/usr/lib/python2.7/dist-packages/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12265)()

/usr/lib/python2.7/dist-packages/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12216)()

KeyError: 'log.Te'
  • MM estimators are good for this type of problem, unfortunately, we don't yet have these yet.
  • It's being worked on, but it gives a good excuse to look at the R cell magics in the notebook.
In [58]:
yy = y.values[:,None]
xx = X['log.Te'].values[:,None]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-154-a5672cb240e0> in <module>()
      1 yy = y.values[:,None]
----> 2 xx = X['log.Te'].values[:,None]

/usr/lib/python2.7/dist-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1967             return self._getitem_multilevel(key)
   1968         else:
-> 1969             return self._getitem_column(key)
   1970 
   1971     def _getitem_column(self, key):

/usr/lib/python2.7/dist-packages/pandas/core/frame.pyc in _getitem_column(self, key)
   1974         # get column
   1975         if self.columns.is_unique:
-> 1976             return self._get_item_cache(key)
   1977 
   1978         # duplicate columns & possible reduce dimensionality

/usr/lib/python2.7/dist-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
   1089         res = cache.get(item)
   1090         if res is None:
-> 1091             values = self._data.get(item)
   1092             res = self._box_item_values(item, values)
   1093             cache[item] = res

/usr/lib/python2.7/dist-packages/pandas/core/internals.pyc in get(self, item, fastpath)
   3209 
   3210             if not isnull(item):
-> 3211                 loc = self.items.get_loc(item)
   3212             else:
   3213                 indexer = np.arange(len(self.items))[isnull(self.items)]

/usr/lib/python2.7/dist-packages/pandas/core/index.pyc in get_loc(self, key, method, tolerance)
   1757                                  'backfill or nearest lookups')
   1758             key = _values_from_object(key)
-> 1759             return self._engine.get_loc(key)
   1760 
   1761         indexer = self.get_indexer([key], method=method,

/usr/lib/python2.7/dist-packages/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3979)()

/usr/lib/python2.7/dist-packages/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3843)()

/usr/lib/python2.7/dist-packages/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12265)()

/usr/lib/python2.7/dist-packages/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12216)()

KeyError: 'log.Te'
In [59]:
%load_ext rmagic

%R library(robustbase)
%Rpush yy xx
%R mod <- lmrob(yy ~ xx);
%R params <- mod$coefficients;
%Rpull params
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-155-53d6967d0401> in <module>()
----> 1 get_ipython().magic(u'load_ext rmagic')
      2 
      3 get_ipython().magic(u'R library(robustbase)')
      4 get_ipython().magic(u'Rpush yy xx')
      5 get_ipython().magic(u'R mod <- lmrob(yy ~ xx);')

/usr/lib/python2.7/dist-packages/IPython/core/interactiveshell.pyc in magic(self, arg_s)
   2203         magic_name, _, magic_arg_s = arg_s.partition(' ')
   2204         magic_name = magic_name.lstrip(prefilter.ESC_MAGIC)
-> 2205         return self.run_line_magic(magic_name, magic_arg_s)
   2206 
   2207     #-------------------------------------------------------------------------

/usr/lib/python2.7/dist-packages/IPython/core/interactiveshell.pyc in run_line_magic(self, magic_name, line)
   2124                 kwargs['local_ns'] = sys._getframe(stack_depth).f_locals
   2125             with self.builtin_trap:
-> 2126                 result = fn(*args,**kwargs)
   2127             return result
   2128 

<decorator-gen-64> in load_ext(self, module_str)

/usr/lib/python2.7/dist-packages/IPython/core/magic.pyc in <lambda>(f, *a, **k)
    191     # but it's overkill for just that one bit of state.
    192     def magic_deco(arg):
--> 193         call = lambda f, *a, **k: f(*a, **k)
    194 
    195         if callable(arg):

/usr/lib/python2.7/dist-packages/IPython/core/magics/extension.pyc in load_ext(self, module_str)
     61         if not module_str:
     62             raise UsageError('Missing module name.')
---> 63         res = self.shell.extension_manager.load_extension(module_str)
     64 
     65         if res == 'already loaded':

/usr/lib/python2.7/dist-packages/IPython/core/extensions.pyc in load_extension(self, module_str)
     96             if module_str not in sys.modules:
     97                 with prepended_to_syspath(self.ipython_extension_dir):
---> 98                     __import__(module_str)
     99             mod = sys.modules[module_str]
    100             if self._call_load_ipython_extension(mod):

/usr/lib/python2.7/dist-packages/IPython/extensions/rmagic.py in <module>()
     54 import numpy as np
     55 
---> 56 import rpy2.rinterface as ri
     57 import rpy2.robjects as ro
     58 try:

ImportError: No module named rpy2.rinterface
In [60]:
%R print(mod)
ERROR: Line magic function `%R` not found.

In [61]:
print(params)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-157-73ac4b936803> in <module>()
----> 1 print(params)

NameError: name 'params' is not defined
In [62]:
abline_plot(intercept=params[0], slope=params[1], ax=ax, color='green')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-158-e1a9f35b3320> in <module>()
----> 1 abline_plot(intercept=params[0], slope=params[1], ax=ax, color='green')

NameError: name 'params' is not defined

Exercise: Breakdown points of M-estimator

In [63]:
np.random.seed(12345)
nobs = 200
beta_true = np.array([3, 1, 2.5, 3, -4])
X = np.random.uniform(-20,20, size=(nobs, len(beta_true)-1))
# stack a constant in front
X = sm.add_constant(X, prepend=True) # np.c_[np.ones(nobs), X]
mc_iter = 500
contaminate = .25 # percentage of response variables to contaminate
In [64]:
all_betas = []
for i in range(mc_iter):
    y = np.dot(X, beta_true) + np.random.normal(size=200)
    random_idx = np.random.randint(0, nobs, size=int(contaminate * nobs))
    y[random_idx] = np.random.uniform(-750, 750)
    beta_hat = sm.RLM(y, X).fit().params
    all_betas.append(beta_hat)
In [65]:
all_betas = np.asarray(all_betas)
se_loss = lambda x : np.linalg.norm(x, ord=2)**2
se_beta = lmap(se_loss, all_betas - beta_true)

Squared error loss

In [66]:
np.array(se_beta).mean()
Out[66]:
0.44502948730686182
In [67]:
all_betas.mean(0)
Out[67]:
array([ 2.99711706,  0.99898147,  2.49909344,  2.99712918, -3.99626521])
In [68]:
beta_true
Out[68]:
array([ 3. ,  1. ,  2.5,  3. , -4. ])
In [69]:
se_loss(all_betas.mean(0) - beta_true)
Out[69]:
3.2360913286754188e-05