Fitting models using R-style formulas¶
Since version 0.5.0, statsmodels
allows users to fit statistical
models using R-style formulas. Internally, statsmodels
uses the
patsy package to convert formulas and
data to the matrices that are used in model fitting. The formula
framework is quite powerful; this tutorial only scratches the surface. A
full description of the formula language can be found in the patsy
docs:
Loading modules and functions¶
In [1]: import statsmodels.formula.api as smf
In [2]: import numpy as np
In [3]: import pandas
Notice that we called statsmodels.formula.api
instead of the usual
statsmodels.api
. The formula.api
hosts many of the same
functions found in api
(e.g. OLS, GLM), but it also holds lower case
counterparts for most of these models. In general, lower case models
accept formula
and df
arguments, whereas upper case ones take
endog
and exog
design matrices. formula
accepts a string
which describes the model in terms of a patsy
formula. df
takes
a pandas data frame.
dir(smf)
will print a list of available models.
Formula-compatible models have the following generic call signature:
(formula, data, subset=None, *args, **kwargs)
OLS regression using formulas¶
To begin, we fit the linear model described on the Getting Started page. Download the data, subset columns, and list-wise delete to remove missing observations:
In [4]: df = sm.datasets.get_rdataset("Guerry", "HistData").data
URLErrorTraceback (most recent call last)
<ipython-input-4-8e82bb04cf4f> in <module>()
----> 1 df = sm.datasets.get_rdataset("Guerry", "HistData").data
/build/statsmodels-lE1Zrp/statsmodels-0.8.0~rc1+git59-gef47cd9/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in get_rdataset(dataname, package, cache)
287 "master/doc/"+package+"/rst/")
288 cache = _get_cache(cache)
--> 289 data, from_cache = _get_data(data_base_url, dataname, cache)
290 data = read_csv(data, index_col=0)
291 data = _maybe_reset_index(data)
/build/statsmodels-lE1Zrp/statsmodels-0.8.0~rc1+git59-gef47cd9/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in _get_data(base_url, dataname, cache, extension)
218 url = base_url + (dataname + ".%s") % extension
219 try:
--> 220 data, from_cache = _urlopen_cached(url, cache)
221 except HTTPError as err:
222 if '404' in str(err):
/build/statsmodels-lE1Zrp/statsmodels-0.8.0~rc1+git59-gef47cd9/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in _urlopen_cached(url, cache)
209 # not using the cache or didn't find it in cache
210 if not from_cache:
--> 211 data = urlopen(url).read()
212 if cache is not None: # then put it in the cache
213 _cache_it(data, cache_path)
/usr/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout, cafile, capath, cadefault, context)
152 else:
153 opener = _opener
--> 154 return opener.open(url, data, timeout)
155
156 def install_opener(opener):
/usr/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
427 req = meth(req)
428
--> 429 response = self._open(req, data)
430
431 # post-process response
/usr/lib/python2.7/urllib2.pyc in _open(self, req, data)
445 protocol = req.get_type()
446 result = self._call_chain(self.handle_open, protocol, protocol +
--> 447 '_open', req)
448 if result:
449 return result
/usr/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
405 func = getattr(handler, meth_name)
406
--> 407 result = func(*args)
408 if result is not None:
409 return result
/usr/lib/python2.7/urllib2.pyc in https_open(self, req)
1239 def https_open(self, req):
1240 return self.do_open(httplib.HTTPSConnection, req,
-> 1241 context=self._context)
1242
1243 https_request = AbstractHTTPHandler.do_request_
/usr/lib/python2.7/urllib2.pyc in do_open(self, http_class, req, **http_conn_args)
1196 except socket.error, err: # XXX what error?
1197 h.close()
-> 1198 raise URLError(err)
1199 else:
1200 try:
URLError: <urlopen error [Errno -2] Name or service not known>
In [5]: df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
NameErrorTraceback (most recent call last)
<ipython-input-5-c0f7df8f22c7> in <module>()
----> 1 df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
NameError: name 'df' is not defined
In [6]: df.head()
NameErrorTraceback (most recent call last)
<ipython-input-6-2569c44faf66> in <module>()
----> 1 df.head()
NameError: name 'df' is not defined
Fit the model:
In [7]: mod = smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)
NameErrorTraceback (most recent call last)
<ipython-input-7-fc74d7ce0f53> in <module>()
----> 1 mod = smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)
NameError: name 'df' is not defined
In [8]: res = mod.fit()
NameErrorTraceback (most recent call last)
<ipython-input-8-deef2687e692> in <module>()
----> 1 res = mod.fit()
NameError: name 'mod' is not defined
In [9]: print(res.summary())
NameErrorTraceback (most recent call last)
<ipython-input-9-a8dc848a1f25> in <module>()
----> 1 print(res.summary())
NameError: name 'res' is not defined
Categorical variables¶
Looking at the summary printed above, notice that patsy
determined
that elements of Region were text strings, so it treated Region as a
categorical variable. patsy
‘s default is also to include an
intercept, so we automatically dropped one of the Region categories.
If Region had been an integer variable that we wanted to treat
explicitly as categorical, we could have done so by using the C()
operator:
In [10]: res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit()
NameErrorTraceback (most recent call last)
<ipython-input-10-512c713f4dd3> in <module>()
----> 1 res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit()
NameError: name 'df' is not defined
In [11]: print(res.params)
NameErrorTraceback (most recent call last)
<ipython-input-11-c61a950343b4> in <module>()
----> 1 print(res.params)
NameError: name 'res' is not defined
Examples more advanced features patsy
‘s categorical variables
function can be found here: Patsy: Contrast Coding Systems for
categorical variables
Operators¶
We have already seen that “~” separates the left-hand side of the model from the right-hand side, and that “+” adds new columns to the design matrix.
Removing variables¶
The “-” sign can be used to remove columns/variables. For instance, we can remove the intercept from a model by:
In [12]: res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region) -1 ', data=df).fit()
NameErrorTraceback (most recent call last)
<ipython-input-12-01cb7f19a578> in <module>()
----> 1 res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region) -1 ', data=df).fit()
NameError: name 'df' is not defined
In [13]: print(res.params)
NameErrorTraceback (most recent call last)
<ipython-input-13-c61a950343b4> in <module>()
----> 1 print(res.params)
NameError: name 'res' is not defined
Multiplicative interactions¶
”:” adds a new column to the design matrix with the product of the other two columns. “*” will also include the individual columns that were multiplied together:
In [14]: res1 = smf.ols(formula='Lottery ~ Literacy : Wealth - 1', data=df).fit()
NameErrorTraceback (most recent call last)
<ipython-input-14-b102884e9732> in <module>()
----> 1 res1 = smf.ols(formula='Lottery ~ Literacy : Wealth - 1', data=df).fit()
NameError: name 'df' is not defined
In [15]: res2 = smf.ols(formula='Lottery ~ Literacy * Wealth - 1', data=df).fit()
NameErrorTraceback (most recent call last)
<ipython-input-15-e7613c271b74> in <module>()
----> 1 res2 = smf.ols(formula='Lottery ~ Literacy * Wealth - 1', data=df).fit()
NameError: name 'df' is not defined
In [16]: print(res1.params)
NameErrorTraceback (most recent call last)
<ipython-input-16-9e7f9958623f> in <module>()
----> 1 print(res1.params)
NameError: name 'res1' is not defined
In [17]: print(res2.params)
NameErrorTraceback (most recent call last)
<ipython-input-17-5ef235c70f25> in <module>()
----> 1 print(res2.params)
NameError: name 'res2' is not defined
Many other things are possible with operators. Please consult the patsy docs to learn more.
Functions¶
You can apply vectorized functions to the variables in your model:
In [18]: res = smf.ols(formula='Lottery ~ np.log(Literacy)', data=df).fit()
NameErrorTraceback (most recent call last)
<ipython-input-18-4fe1546acec0> in <module>()
----> 1 res = smf.ols(formula='Lottery ~ np.log(Literacy)', data=df).fit()
NameError: name 'df' is not defined
In [19]: print(res.params)
NameErrorTraceback (most recent call last)
<ipython-input-19-c61a950343b4> in <module>()
----> 1 print(res.params)
NameError: name 'res' is not defined
Define a custom function:
In [20]: def log_plus_1(x):
....: return np.log(x) + 1.
....:
In [21]: print(res.params)
NameErrorTraceback (most recent call last)
<ipython-input-21-c61a950343b4> in <module>()
----> 1 print(res.params)
NameError: name 'res' is not defined
Namespaces¶
Notice that all of the above examples use the calling namespace to look for the functions to apply. The namespace used can be controlled via the eval_env
keyword. For example, you may want to give a custom namespace using the patsy:patsy.EvalEnvironment
or you may want to use a “clean” namespace, which we provide by passing eval_func=-1
. The default is to use the caller’s namespace. This can have (un)expected consequences, if, for example, someone has a variable names C
in the user namespace or in their data structure passed to patsy
, and C
is used in the formula to handle a categorical variable. See the Patsy API Reference for more information.
Using formulas with models that do not (yet) support them¶
Even if a given statsmodels
function does not support formulas, you
can still use patsy
‘s formula language to produce design matrices.
Those matrices can then be fed to the fitting function as endog
and
exog
arguments.
To generate numpy
arrays:
In [22]: import patsy
In [23]: f = 'Lottery ~ Literacy * Wealth'
In [24]: y, X = patsy.dmatrices(f, df, return_type='dataframe')
NameErrorTraceback (most recent call last)
<ipython-input-24-418e6636605c> in <module>()
----> 1 y, X = patsy.dmatrices(f, df, return_type='dataframe')
NameError: name 'df' is not defined
In [25]: print(y[:5])
NameErrorTraceback (most recent call last)
<ipython-input-25-8b56dfab7799> in <module>()
----> 1 print(y[:5])
NameError: name 'y' is not defined
In [26]: print(X[:5])
NameErrorTraceback (most recent call last)
<ipython-input-26-7dbdd9618cc6> in <module>()
----> 1 print(X[:5])
NameError: name 'X' is not defined
To generate pandas data frames:
In [27]: f = 'Lottery ~ Literacy * Wealth'
In [28]: y, X = patsy.dmatrices(f, df, return_type='dataframe')
NameErrorTraceback (most recent call last)
<ipython-input-28-418e6636605c> in <module>()
----> 1 y, X = patsy.dmatrices(f, df, return_type='dataframe')
NameError: name 'df' is not defined
In [29]: print(y[:5])
NameErrorTraceback (most recent call last)
<ipython-input-29-8b56dfab7799> in <module>()
----> 1 print(y[:5])
NameError: name 'y' is not defined
In [30]: print(X[:5])
NameErrorTraceback (most recent call last)
<ipython-input-30-7dbdd9618cc6> in <module>()
----> 1 print(X[:5])
NameError: name 'X' is not defined
In [31]: print(smf.OLS(y, X).fit().summary())
NameErrorTraceback (most recent call last)
<ipython-input-31-6bf304b23cf5> in <module>()
----> 1 print(smf.OLS(y, X).fit().summary())
NameError: name 'y' is not defined