Logo

Maximum Likelihood Estimation (Generic models)ΒΆ

Link to Notebook GitHub

This tutorial explains how to quickly implement new maximum likelihood models in statsmodels. We give two examples:

  1. Probit model for binary dependent variables
  2. Negative binomial model for count data

The GenericLikelihoodModel class eases the process by providing tools such as automatic numeric differentiation and a unified interface to scipy optimization functions. Using statsmodels, users can fit new MLE models simply by "plugging-in" a log-likelihood function.

Example 1: Probit model

In [1]:
from __future__ import print_function
import numpy as np
from scipy import stats
import statsmodels.api as sm
from statsmodels.base.model import GenericLikelihoodModel

The Spector dataset is distributed with statsmodels. You can access a vector of values for the dependent variable (endog) and a matrix of regressors (exog) like this:

In [2]:
data = sm.datasets.spector.load_pandas()
exog = data.exog
endog = data.endog
print(sm.datasets.spector.NOTE)
print(data.exog.head())
::

    Number of Observations - 32

    Number of Variables - 4

    Variable name definitions::

        Grade - binary variable indicating whether or not a student's grade
                improved.  1 indicates an improvement.
        TUCE  - Test score on economics test
        PSI   - participation in program
        GPA   - Student's grade point average

    GPA  TUCE  PSI
0  2.66    20    0
1  2.89    22    0
2  3.28    24    0
3  2.92    12    0
4  4.00    21    0

Them, we add a constant to the matrix of regressors:

In [3]:
exog = sm.add_constant(exog, prepend=True)

To create your own Likelihood Model, you simply need to overwrite the loglike method.

In [4]:
class MyProbit(GenericLikelihoodModel):
    def loglike(self, params):
        exog = self.exog
        endog = self.endog
        q = 2 * endog - 1
        return stats.norm.logcdf(q*np.dot(exog, params)).sum()

Estimate the model and print a summary:

In [5]:
sm_probit_manual = MyProbit(endog, exog).fit()
print(sm_probit_manual.summary())
Optimization terminated successfully.
         Current function value: 0.400588
         Iterations: 292
         Function evaluations: 494
                               MyProbit Results
==============================================================================
Dep. Variable:                  GRADE   Log-Likelihood:                -12.819
Model:                       MyProbit   AIC:                             33.64
Method:            Maximum Likelihood   BIC:                             39.50
Date:                Wed, 24 Dec 2014
Time:                        13:10:55
No. Observations:                  32
Df Residuals:                      28
Df Model:                           3
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         -7.4523      2.542     -2.931      0.003       -12.435    -2.469
GPA            1.6258      0.694      2.343      0.019         0.266     2.986
TUCE           0.0517      0.084      0.617      0.537        -0.113     0.216
PSI            1.4263      0.595      2.397      0.017         0.260     2.593
==============================================================================

Compare your Probit implementation to statsmodels' "canned" implementation:

In [6]:
sm_probit_canned = sm.Probit(endog, exog).fit()
Optimization terminated successfully.
         Current function value: 0.400588
         Iterations 6

In [7]:
print(sm_probit_canned.params)
print(sm_probit_manual.params)
const   -7.452320
GPA      1.625810
TUCE     0.051729
PSI      1.426332
dtype: float64
[-7.4523  1.6258  0.0517  1.4263]

In [8]:
print(sm_probit_canned.cov_params())
print(sm_probit_manual.cov_params())
          const       GPA      TUCE       PSI
const  6.464166 -1.169668 -0.101173 -0.594792
GPA   -1.169668  0.481473 -0.018914  0.105439
TUCE  -0.101173 -0.018914  0.007038  0.002472
PSI   -0.594792  0.105439  0.002472  0.354070
[[ 6.4642 -1.1697 -0.1012 -0.5948]
 [-1.1697  0.4815 -0.0189  0.1054]
 [-0.1012 -0.0189  0.007   0.0025]
 [-0.5948  0.1054  0.0025  0.3541]]

Notice that the GenericMaximumLikelihood class provides automatic differentiation, so we didn't have to provide Hessian or Score functions in order to calculate the covariance estimates.

Example 2: Negative Binomial Regression for Count Data

Consider a negative binomial regression model for count data with log-likelihood (type NB-2) function expressed as:

$$ \mathcal{L}(\beta_j; y, \alpha) = \sum_{i=1}^n y_i ln \left ( \frac{\alpha exp(X_i'\beta)}{1+\alpha exp(X_i'\beta)} \right ) - \frac{1}{\alpha} ln(1+\alpha exp(X_i'\beta)) + ln \Gamma (y_i + 1/\alpha) - ln \Gamma (y_i+1) - ln \Gamma (1/\alpha) $$

with a matrix of regressors $X$, a vector of coefficients $\beta$, and the negative binomial heterogeneity parameter $\alpha$.

Using the nbinom distribution from scipy, we can write this likelihood simply as:

In [9]:
import numpy as np
from scipy.stats import nbinom
In [10]:
def _ll_nb2(y, X, beta, alph):
    mu = np.exp(np.dot(X, beta))
    size = 1/alph
    prob = size/(size+mu)
    ll = nbinom.logpmf(y, size, prob)
    return ll

New Model Class

We create a new model class which inherits from GenericLikelihoodModel:

In [11]:
from statsmodels.base.model import GenericLikelihoodModel
In [12]:
class NBin(GenericLikelihoodModel):
    def __init__(self, endog, exog, **kwds):
        super(NBin, self).__init__(endog, exog, **kwds)

    def nloglikeobs(self, params):
        alph = params[-1]
        beta = params[:-1]
        ll = _ll_nb2(self.endog, self.exog, beta, alph)
        return -ll

    def fit(self, start_params=None, maxiter=10000, maxfun=5000, **kwds):
        # we have one additional parameter and we need to add it for summary
        self.exog_names.append('alpha')
        if start_params == None:
            # Reasonable starting values
            start_params = np.append(np.zeros(self.exog.shape[1]), .5)
            # intercept
            start_params[-2] = np.log(self.endog.mean())
        return super(NBin, self).fit(start_params=start_params,
                                     maxiter=maxiter, maxfun=maxfun,
                                     **kwds)

Two important things to notice:

  • nloglikeobs: This function should return one evaluation of the negative log-likelihood function per observation in your dataset (i.e. rows of the endog/X matrix).
  • start_params: A one-dimensional array of starting values needs to be provided. The size of this array determines the number of parameters that will be used in optimization.

That's it! You're done!

Usage Example

The Medpar dataset is hosted in CSV format at the Rdatasets repository. We use the read_csv function from the Pandas library to load the data in memory. We then print the first few columns:

In [13]:
import statsmodels.api as sm
In [14]:
medpar = sm.datasets.get_rdataset("medpar", "COUNT", cache=True).data

medpar.head()
---------------------------------------------------------------------------
URLError                                  Traceback (most recent call last)
<ipython-input-468-9195ad73837e> in <module>()
----> 1 medpar = sm.datasets.get_rdataset("medpar", "COUNT", cache=True).data
      2 
      3 medpar.head()

/build/buildd/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in get_rdataset(dataname, package, cache)
    284                      "master/doc/"+package+"/rst/")
    285     cache = _get_cache(cache)
--> 286     data, from_cache = _get_data(data_base_url, dataname, cache)
    287     data = read_csv(data, index_col=0)
    288     data = _maybe_reset_index(data)

/build/buildd/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in _get_data(base_url, dataname, cache, extension)
    215     url = base_url + (dataname + ".%s") % extension
    216     try:
--> 217         data, from_cache = _urlopen_cached(url, cache)
    218     except HTTPError as err:
    219         if '404' in str(err):

/build/buildd/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in _urlopen_cached(url, cache)
    206     # not using the cache or didn't find it in cache
    207     if not from_cache:
--> 208         data = urlopen(url).read()
    209         if cache is not None:  # then put it in the cache
    210             _cache_it(data, cache_path)

/usr/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    152     else:
    153         opener = _opener
--> 154     return opener.open(url, data, timeout)
    155 
    156 def install_opener(opener):

/usr/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
    429             req = meth(req)
    430 
--> 431         response = self._open(req, data)
    432 
    433         # post-process response

/usr/lib/python2.7/urllib2.pyc in _open(self, req, data)
    447         protocol = req.get_type()
    448         result = self._call_chain(self.handle_open, protocol, protocol +
--> 449                                   '_open', req)
    450         if result:
    451             return result

/usr/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
    407             func = getattr(handler, meth_name)
    408 
--> 409             result = func(*args)
    410             if result is not None:
    411                 return result

/usr/lib/python2.7/urllib2.pyc in https_open(self, req)
   1238         def https_open(self, req):
   1239             return self.do_open(httplib.HTTPSConnection, req,
-> 1240                 context=self._context)
   1241 
   1242         https_request = AbstractHTTPHandler.do_request_

/usr/lib/python2.7/urllib2.pyc in do_open(self, http_class, req, **http_conn_args)
   1195         except socket.error, err: # XXX what error?
   1196             h.close()
-> 1197             raise URLError(err)
   1198         else:
   1199             try:

URLError: <urlopen error [Errno -2] Name or service not known>

The model we are interested in has a vector of non-negative integers as dependent variable (los), and 5 regressors: Intercept, type2, type3, hmo, white.

For estimation, we need to create two variables to hold our regressors and the outcome variable. These can be ndarrays or pandas objects.

In [15]:
y = medpar.los
X = medpar[["type2", "type3", "hmo", "white"]]
X["constant"] = 1
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-469-0ead9bc6ed8a> in <module>()
----> 1 y = medpar.los
      2 X = medpar[["type2", "type3", "hmo", "white"]]
      3 X["constant"] = 1

NameError: name 'medpar' is not defined

Then, we fit the model and extract some information:

In [16]:
mod = NBin(y, X)
res = mod.fit()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-470-da1a4b1dfc20> in <module>()
----> 1 mod = NBin(y, X)
      2 res = mod.fit()

<ipython-input-466-2754543c01a4> in __init__(self, endog, exog, **kwds)
      1 class NBin(GenericLikelihoodModel):
      2     def __init__(self, endog, exog, **kwds):
----> 3         super(NBin, self).__init__(endog, exog, **kwds)
      4 
      5     def nloglikeobs(self, params):

/build/buildd/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/base/model.pyc in __init__(self, endog, exog, loglike, score, hessian, missing, extra_params_names, **kwds)
    537         #somewhere: CacheWriteWarning: 'df_model' cannot be overwritten
    538         super(GenericLikelihoodModel, self).__init__(endog, exog,
--> 539                                                      missing=missing)
    540 
    541         # this won't work for ru2nmnl, maybe np.ndim of a dict?

/build/buildd/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/base/model.pyc in __init__(self, endog, exog, **kwargs)
    184 
    185     def __init__(self, endog, exog=None, **kwargs):
--> 186         super(LikelihoodModel, self).__init__(endog, exog, **kwargs)
    187         self.initialize()
    188 

/build/buildd/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/base/model.pyc in __init__(self, endog, exog, **kwargs)
     58         hasconst = kwargs.pop('hasconst', None)
     59         self.data = self._handle_data(endog, exog, missing, hasconst,
---> 60                                       **kwargs)
     61         self.k_constant = self.data.k_constant
     62         self.exog = self.data.exog

/build/buildd/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/base/model.pyc in _handle_data(self, endog, exog, missing, hasconst, **kwargs)
     82 
     83     def _handle_data(self, endog, exog, missing, hasconst, **kwargs):
---> 84         data = handle_data(endog, exog, missing, hasconst, **kwargs)
     85         # kwargs arrays could have changed, easier to just attach here
     86         for key in kwargs:

/build/buildd/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/base/data.pyc in handle_data(endog, exog, missing, hasconst, **kwargs)
    564     klass = handle_data_class_factory(endog, exog)
    565     return klass(endog, exog=exog, missing=missing, hasconst=hasconst,
--> 566                  **kwargs)

/build/buildd/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/base/data.pyc in __init__(self, endog, exog, missing, hasconst, **kwargs)
     74         # this has side-effects, attaches k_constant and const_idx
     75         self._handle_constant(hasconst)
---> 76         self._check_integrity()
     77         self._cache = resettable_cache()
     78 

/build/buildd/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/base/data.pyc in _check_integrity(self)
    363         if self.exog is not None:
    364             if len(self.exog) != len(self.endog):
--> 365                 raise ValueError("endog and exog matrices are different sizes")
    366 
    367     def wrap_output(self, obj, how='columns', names=None):

ValueError: endog and exog matrices are different sizes

Extract parameter estimates, standard errors, p-values, AIC, etc.:

In [17]:
print('Parameters: ', res.params)
print('Standard errors: ', res.bse)
print('P-values: ', res.pvalues)
print('AIC: ', res.aic)
Parameters:  [ 4.9943  0.5347 -0.0147]
Standard errors:  [ 0.4543  0.0701  0.0062]
P-values:  [ 0.      0.      0.0223]
AIC:  175.612579864

As usual, you can obtain a full list of available information by typing dir(res). We can also look at the summary of the estimation results.

In [18]:
print(res.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.757
Model:                            OLS   Adj. R-squared:                  0.747
Method:                 Least Squares   F-statistic:                     73.32
Date:                Wed, 24 Dec 2014   Prob (F-statistic):           3.55e-15
Time:                        13:10:58   Log-Likelihood:                -84.806
No. Observations:                  50   AIC:                             175.6
Df Residuals:                      47   BIC:                             181.3
Df Model:                           2
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          4.9943      0.454     10.994      0.000         4.080     5.908
x1             0.5347      0.070      7.624      0.000         0.394     0.676
x2            -0.0147      0.006     -2.363      0.022        -0.027    -0.002
==============================================================================
Omnibus:                       17.299   Durbin-Watson:                   3.020
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               23.296
Skew:                          -1.187   Prob(JB):                     8.74e-06
Kurtosis:                       5.355   Cond. No.                         214.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Testing

We can check the results by using the statsmodels implementation of the Negative Binomial model, which uses the analytic score function and Hessian.

In [19]:
res_nbin = sm.NegativeBinomial(y, X).fit(disp=0)
print(res_nbin.summary())
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-473-8468673374f5> in <module>()
----> 1 res_nbin = sm.NegativeBinomial(y, X).fit(disp=0)
      2 print(res_nbin.summary())

/build/buildd/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/discrete/discrete_model.pyc in __init__(self, endog, exog, loglike_method, offset, exposure, missing, **kwargs)
   1960         super(NegativeBinomial, self).__init__(endog, exog, offset=offset,
   1961                                                exposure=exposure,
-> 1962                                                missing=missing, **kwargs)
   1963         self.loglike_method = loglike_method
   1964         self._initialize()

/build/buildd/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/discrete/discrete_model.pyc in __init__(self, endog, exog, offset, exposure, missing, **kwargs)
    710         super(CountModel, self).__init__(endog, exog, missing=missing,
    711                                          offset=offset,
--> 712                                          exposure=exposure, **kwargs)
    713         if exposure is not None:
    714             self.exposure = np.log(self.exposure)

/build/buildd/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/discrete/discrete_model.pyc in __init__(self, endog, exog, **kwargs)
    152     """
    153     def __init__(self, endog, exog, **kwargs):
--> 154         super(DiscreteModel, self).__init__(endog, exog, **kwargs)
    155         self.raise_on_perfect_prediction = True
    156 

/build/buildd/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/base/model.pyc in __init__(self, endog, exog, **kwargs)
    184 
    185     def __init__(self, endog, exog=None, **kwargs):
--> 186         super(LikelihoodModel, self).__init__(endog, exog, **kwargs)
    187         self.initialize()
    188 

/build/buildd/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/base/model.pyc in __init__(self, endog, exog, **kwargs)
     58         hasconst = kwargs.pop('hasconst', None)
     59         self.data = self._handle_data(endog, exog, missing, hasconst,
---> 60                                       **kwargs)
     61         self.k_constant = self.data.k_constant
     62         self.exog = self.data.exog

/build/buildd/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/base/model.pyc in _handle_data(self, endog, exog, missing, hasconst, **kwargs)
     82 
     83     def _handle_data(self, endog, exog, missing, hasconst, **kwargs):
---> 84         data = handle_data(endog, exog, missing, hasconst, **kwargs)
     85         # kwargs arrays could have changed, easier to just attach here
     86         for key in kwargs:

/build/buildd/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/base/data.pyc in handle_data(endog, exog, missing, hasconst, **kwargs)
    564     klass = handle_data_class_factory(endog, exog)
    565     return klass(endog, exog=exog, missing=missing, hasconst=hasconst,
--> 566                  **kwargs)

/build/buildd/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/base/data.pyc in __init__(self, endog, exog, missing, hasconst, **kwargs)
     74         # this has side-effects, attaches k_constant and const_idx
     75         self._handle_constant(hasconst)
---> 76         self._check_integrity()
     77         self._cache = resettable_cache()
     78 

/build/buildd/statsmodels-0.6.1/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/base/data.pyc in _check_integrity(self)
    363         if self.exog is not None:
    364             if len(self.exog) != len(self.endog):
--> 365                 raise ValueError("endog and exog matrices are different sizes")
    366 
    367     def wrap_output(self, obj, how='columns', names=None):

ValueError: endog and exog matrices are different sizes
In [20]:
print(res_nbin.params)
[-0.058  -0.2678  0.0412 -0.0381  0.269   0.0382 -0.0441  0.0172  0.178
  0.6636  1.293 ]

In [21]:
print(res_nbin.bse)
[ 0.0061  0.0227  0.0041  0.0034  0.0299  0.0015  0.0201  0.0361  0.0743
  0.0248  0.0186]

Or we could compare them to results obtained using the MASS implementation for R:

url = 'http://vincentarelbundock.github.com/Rdatasets/csv/COUNT/medpar.csv'
medpar = read.csv(url)
f = los~factor(type)+hmo+white

library(MASS)
mod = glm.nb(f, medpar)
coef(summary(mod))
                 Estimate Std. Error   z value      Pr(>|z|)
(Intercept)    2.31027893 0.06744676 34.253370 3.885556e-257
factor(type)2  0.22124898 0.05045746  4.384861  1.160597e-05
factor(type)3  0.70615882 0.07599849  9.291748  1.517751e-20
hmo           -0.06795522 0.05321375 -1.277024  2.015939e-01
white         -0.12906544 0.06836272 -1.887951  5.903257e-02

Numerical precision

The statsmodels generic MLE and R parameter estimates agree up to the fourth decimal. The standard errors, however, agree only up to the second decimal. This discrepancy is the result of imprecision in our Hessian numerical estimates. In the current context, the difference between MASS and statsmodels standard error estimates is substantively irrelevant, but it highlights the fact that users who need very precise estimates may not always want to rely on default settings when using numerical derivatives. In such cases, it is better to use analytical derivatives with the LikelihoodModel class.

This Page