Fitting models using R-style formulas

Since version 0.5.0, statsmodels allows users to fit statistical models using R-style formulas. Internally, statsmodels uses the patsy package to convert formulas and data to the matrices that are used in model fitting. The formula framework is quite powerful; this tutorial only scratches the surface. A full description of the formula language can be found in the patsy docs:

Loading modules and functions

In [1]: import statsmodels.formula.api as smf

In [2]: import numpy as np

In [3]: import pandas

Notice that we called statsmodels.formula.api instead of the usual statsmodels.api. The formula.api hosts many of the same functions found in api (e.g. OLS, GLM), but it also holds lower case counterparts for most of these models. In general, lower case models accept formula and df arguments, whereas upper case ones take endog and exog design matrices. formula accepts a string which describes the model in terms of a patsy formula. df takes a pandas data frame.

dir(smf) will print a list of available models.

Formula-compatible models have the following generic call signature: (formula, data, subset=None, *args, **kwargs)

OLS regression using formulas

To begin, we fit the linear model described on the Getting Started page. Download the data, subset columns, and list-wise delete to remove missing observations:

In [4]: df = sm.datasets.get_rdataset("Guerry", "HistData", cache=True).data
---------------------------------------------------------------------------
gaierror                                  Traceback (most recent call last)
/usr/lib/python3.7/urllib/request.py in do_open(self, http_class, req, **http_conn_args)
   1316                 h.request(req.get_method(), req.selector, req.data, headers,
-> 1317                           encode_chunked=req.has_header('Transfer-encoding'))
   1318             except OSError as err: # timeout error

/usr/lib/python3.7/http/client.py in request(self, method, url, body, headers, encode_chunked)
   1228         """Send a complete request to the server."""
-> 1229         self._send_request(method, url, body, headers, encode_chunked)
   1230 

/usr/lib/python3.7/http/client.py in _send_request(self, method, url, body, headers, encode_chunked)
   1274             body = _encode(body, 'body')
-> 1275         self.endheaders(body, encode_chunked=encode_chunked)
   1276 

/usr/lib/python3.7/http/client.py in endheaders(self, message_body, encode_chunked)
   1223             raise CannotSendHeader()
-> 1224         self._send_output(message_body, encode_chunked=encode_chunked)
   1225 

/usr/lib/python3.7/http/client.py in _send_output(self, message_body, encode_chunked)
   1015         del self._buffer[:]
-> 1016         self.send(msg)
   1017 

/usr/lib/python3.7/http/client.py in send(self, data)
    955             if self.auto_open:
--> 956                 self.connect()
    957             else:

/usr/lib/python3.7/http/client.py in connect(self)
   1383 
-> 1384             super().connect()
   1385 

/usr/lib/python3.7/http/client.py in connect(self)
    927         self.sock = self._create_connection(
--> 928             (self.host,self.port), self.timeout, self.source_address)
    929         self.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)

/usr/lib/python3.7/socket.py in create_connection(address, timeout, source_address)
    706     err = None
--> 707     for res in getaddrinfo(host, port, 0, SOCK_STREAM):
    708         af, socktype, proto, canonname, sa = res

/usr/lib/python3.7/socket.py in getaddrinfo(host, port, family, type, proto, flags)
    747     addrlist = []
--> 748     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
    749         af, socktype, proto, canonname, sa = res

gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

URLError                                  Traceback (most recent call last)
<ipython-input-4-cab9fdf84142> in <module>()
----> 1 df = sm.datasets.get_rdataset("Guerry", "HistData", cache=True).data

/build/statsmodels-IFPJo1/statsmodels-0.8.0/.pybuild/cpython3_3.7_statsmodels/build/statsmodels/datasets/utils.py in get_rdataset(dataname, package, cache)
    288                      "master/doc/"+package+"/rst/")
    289     cache = _get_cache(cache)
--> 290     data, from_cache = _get_data(data_base_url, dataname, cache)
    291     data = read_csv(data, index_col=0)
    292     data = _maybe_reset_index(data)

/build/statsmodels-IFPJo1/statsmodels-0.8.0/.pybuild/cpython3_3.7_statsmodels/build/statsmodels/datasets/utils.py in _get_data(base_url, dataname, cache, extension)
    219     url = base_url + (dataname + ".%s") % extension
    220     try:
--> 221         data, from_cache = _urlopen_cached(url, cache)
    222     except HTTPError as err:
    223         if '404' in str(err):

/build/statsmodels-IFPJo1/statsmodels-0.8.0/.pybuild/cpython3_3.7_statsmodels/build/statsmodels/datasets/utils.py in _urlopen_cached(url, cache)
    210     # not using the cache or didn't find it in cache
    211     if not from_cache:
--> 212         data = urlopen(url).read()
    213         if cache is not None:  # then put it in the cache
    214             _cache_it(data, cache_path)

/usr/lib/python3.7/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    220     else:
    221         opener = _opener
--> 222     return opener.open(url, data, timeout)
    223 
    224 def install_opener(opener):

/usr/lib/python3.7/urllib/request.py in open(self, fullurl, data, timeout)
    523             req = meth(req)
    524 
--> 525         response = self._open(req, data)
    526 
    527         # post-process response

/usr/lib/python3.7/urllib/request.py in _open(self, req, data)
    541         protocol = req.type
    542         result = self._call_chain(self.handle_open, protocol, protocol +
--> 543                                   '_open', req)
    544         if result:
    545             return result

/usr/lib/python3.7/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
    501         for handler in handlers:
    502             func = getattr(handler, meth_name)
--> 503             result = func(*args)
    504             if result is not None:
    505                 return result

/usr/lib/python3.7/urllib/request.py in https_open(self, req)
   1358         def https_open(self, req):
   1359             return self.do_open(http.client.HTTPSConnection, req,
-> 1360                 context=self._context, check_hostname=self._check_hostname)
   1361 
   1362         https_request = AbstractHTTPHandler.do_request_

/usr/lib/python3.7/urllib/request.py in do_open(self, http_class, req, **http_conn_args)
   1317                           encode_chunked=req.has_header('Transfer-encoding'))
   1318             except OSError as err: # timeout error
-> 1319                 raise URLError(err)
   1320             r = h.getresponse()
   1321         except:

URLError: <urlopen error [Errno -2] Name or service not known>

In [5]: df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-5-caf7541d7376> in <module>()
----> 1 df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()

NameError: name 'df' is not defined

In [6]: df.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-6-c42a15b2c7cf> in <module>()
----> 1 df.head()

NameError: name 'df' is not defined

Fit the model:

In [7]: mod = smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-1d998c845c19> in <module>()
----> 1 mod = smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)

NameError: name 'df' is not defined

In [8]: res = mod.fit()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-8-fa3ccf53f431> in <module>()
----> 1 res = mod.fit()

NameError: name 'mod' is not defined

In [9]: print(res.summary())
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-9-ba064a039ab1> in <module>()
----> 1 print(res.summary())

NameError: name 'res' is not defined

Categorical variables

Looking at the summary printed above, notice that patsy determined that elements of Region were text strings, so it treated Region as a categorical variable. patsy’s default is also to include an intercept, so we automatically dropped one of the Region categories.

If Region had been an integer variable that we wanted to treat explicitly as categorical, we could have done so by using the C() operator:

In [10]: res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-10-708bd6ac0f7c> in <module>()
----> 1 res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit()

NameError: name 'df' is not defined

In [11]: print(res.params)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-11-584aa721836e> in <module>()
----> 1 print(res.params)

NameError: name 'res' is not defined

Examples more advanced features patsy’s categorical variables function can be found here: Patsy: Contrast Coding Systems for categorical variables

Operators

We have already seen that “~” separates the left-hand side of the model from the right-hand side, and that “+” adds new columns to the design matrix.

Removing variables

The “-” sign can be used to remove columns/variables. For instance, we can remove the intercept from a model by:

In [12]: res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region) -1 ', data=df).fit()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-12-19c40b5bef49> in <module>()
----> 1 res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region) -1 ', data=df).fit()

NameError: name 'df' is not defined

In [13]: print(res.params)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-13-584aa721836e> in <module>()
----> 1 print(res.params)

NameError: name 'res' is not defined

Multiplicative interactions

“:” adds a new column to the design matrix with the product of the other two columns. “*” will also include the individual columns that were multiplied together:

In [14]: res1 = smf.ols(formula='Lottery ~ Literacy : Wealth - 1', data=df).fit()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-14-0f7935b7f846> in <module>()
----> 1 res1 = smf.ols(formula='Lottery ~ Literacy : Wealth - 1', data=df).fit()

NameError: name 'df' is not defined

In [15]: res2 = smf.ols(formula='Lottery ~ Literacy * Wealth - 1', data=df).fit()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-15-cab81ce225a1> in <module>()
----> 1 res2 = smf.ols(formula='Lottery ~ Literacy * Wealth - 1', data=df).fit()

NameError: name 'df' is not defined

In [16]: print(res1.params)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-16-6fb4b2ac7744> in <module>()
----> 1 print(res1.params)

NameError: name 'res1' is not defined

In [17]: print(res2.params)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-17-875b96e014ae> in <module>()
----> 1 print(res2.params)

NameError: name 'res2' is not defined

Many other things are possible with operators. Please consult the patsy docs to learn more.

Functions

You can apply vectorized functions to the variables in your model:

In [18]: res = smf.ols(formula='Lottery ~ np.log(Literacy)', data=df).fit()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-18-623bd3cda057> in <module>()
----> 1 res = smf.ols(formula='Lottery ~ np.log(Literacy)', data=df).fit()

NameError: name 'df' is not defined

In [19]: print(res.params)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-19-584aa721836e> in <module>()
----> 1 print(res.params)

NameError: name 'res' is not defined

Define a custom function:

In [20]: def log_plus_1(x):
   ....:     return np.log(x) + 1.
   ....: 

In [21]: print(res.params)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-21-584aa721836e> in <module>()
----> 1 print(res.params)

NameError: name 'res' is not defined

Namespaces

Notice that all of the above examples use the calling namespace to look for the functions to apply. The namespace used can be controlled via the eval_env keyword. For example, you may want to give a custom namespace using the patsy.EvalEnvironment or you may want to use a “clean” namespace, which we provide by passing eval_func=-1. The default is to use the caller’s namespace. This can have (un)expected consequences, if, for example, someone has a variable names C in the user namespace or in their data structure passed to patsy, and C is used in the formula to handle a categorical variable. See the Patsy API Reference for more information.

Using formulas with models that do not (yet) support them

Even if a given statsmodels function does not support formulas, you can still use patsy’s formula language to produce design matrices. Those matrices can then be fed to the fitting function as endog and exog arguments.

To generate numpy arrays:

In [22]: import patsy

In [23]: f = 'Lottery ~ Literacy * Wealth'

In [24]: y, X = patsy.dmatrices(f, df, return_type='dataframe')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-24-7628fcd61fc5> in <module>()
----> 1 y, X = patsy.dmatrices(f, df, return_type='dataframe')

NameError: name 'df' is not defined

In [25]: print(y[:5])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-25-165159ff2e7b> in <module>()
----> 1 print(y[:5])

NameError: name 'y' is not defined

In [26]: print(X[:5])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-26-f3111c92c0b2> in <module>()
----> 1 print(X[:5])

NameError: name 'X' is not defined

To generate pandas data frames:

In [27]: f = 'Lottery ~ Literacy * Wealth'

In [28]: y, X = patsy.dmatrices(f, df, return_type='dataframe')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-28-7628fcd61fc5> in <module>()
----> 1 y, X = patsy.dmatrices(f, df, return_type='dataframe')

NameError: name 'df' is not defined

In [29]: print(y[:5])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-29-165159ff2e7b> in <module>()
----> 1 print(y[:5])

NameError: name 'y' is not defined

In [30]: print(X[:5])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-30-f3111c92c0b2> in <module>()
----> 1 print(X[:5])

NameError: name 'X' is not defined
In [31]: print(smf.OLS(y, X).fit().summary())
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-31-fea027f8890a> in <module>()
----> 1 print(smf.OLS(y, X).fit().summary())

NameError: name 'y' is not defined