The Datasets Package¶
statsmodels
provides data sets (i.e. data and meta-data) for use in
examples, tutorials, model testing, etc.
Using Datasets from Stata¶
webuse (data[, baseurl, as_df]) |
Download and return an example dataset from Stata. |
Using Datasets from R¶
The Rdatasets project gives access to the datasets available in R’s core datasets package and many other common R packages. All of these datasets are available to statsmodels by using the get_rdataset
function. The actual data is accessible by the data
attribute. For example:
In [1]: import statsmodels.api as sm
In [2]: duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")
URLErrorTraceback (most recent call last)
<ipython-input-2-82a20fbfd3c2> in <module>()
----> 1 duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")
/build/statsmodels-lE1Zrp/statsmodels-0.8.0~rc1+git59-gef47cd9/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in get_rdataset(dataname, package, cache)
287 "master/doc/"+package+"/rst/")
288 cache = _get_cache(cache)
--> 289 data, from_cache = _get_data(data_base_url, dataname, cache)
290 data = read_csv(data, index_col=0)
291 data = _maybe_reset_index(data)
/build/statsmodels-lE1Zrp/statsmodels-0.8.0~rc1+git59-gef47cd9/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in _get_data(base_url, dataname, cache, extension)
218 url = base_url + (dataname + ".%s") % extension
219 try:
--> 220 data, from_cache = _urlopen_cached(url, cache)
221 except HTTPError as err:
222 if '404' in str(err):
/build/statsmodels-lE1Zrp/statsmodels-0.8.0~rc1+git59-gef47cd9/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in _urlopen_cached(url, cache)
209 # not using the cache or didn't find it in cache
210 if not from_cache:
--> 211 data = urlopen(url).read()
212 if cache is not None: # then put it in the cache
213 _cache_it(data, cache_path)
/usr/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout, cafile, capath, cadefault, context)
152 else:
153 opener = _opener
--> 154 return opener.open(url, data, timeout)
155
156 def install_opener(opener):
/usr/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
427 req = meth(req)
428
--> 429 response = self._open(req, data)
430
431 # post-process response
/usr/lib/python2.7/urllib2.pyc in _open(self, req, data)
445 protocol = req.get_type()
446 result = self._call_chain(self.handle_open, protocol, protocol +
--> 447 '_open', req)
448 if result:
449 return result
/usr/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
405 func = getattr(handler, meth_name)
406
--> 407 result = func(*args)
408 if result is not None:
409 return result
/usr/lib/python2.7/urllib2.pyc in https_open(self, req)
1239 def https_open(self, req):
1240 return self.do_open(httplib.HTTPSConnection, req,
-> 1241 context=self._context)
1242
1243 https_request = AbstractHTTPHandler.do_request_
/usr/lib/python2.7/urllib2.pyc in do_open(self, http_class, req, **http_conn_args)
1196 except socket.error, err: # XXX what error?
1197 h.close()
-> 1198 raise URLError(err)
1199 else:
1200 try:
URLError: <urlopen error [Errno -2] Name or service not known>
In [3]: print(duncan_prestige.__doc__)
NameErrorTraceback (most recent call last)
<ipython-input-3-e850f273c413> in <module>()
----> 1 print(duncan_prestige.__doc__)
NameError: name 'duncan_prestige' is not defined
In [4]: duncan_prestige.data.head(5)
NameErrorTraceback (most recent call last)
<ipython-input-4-12a4942bb33d> in <module>()
----> 1 duncan_prestige.data.head(5)
NameError: name 'duncan_prestige' is not defined
R Datasets Function Reference¶
get_rdataset (dataname[, package, cache]) |
download and return R dataset |
get_data_home ([data_home]) |
Return the path of the statsmodels data dir. |
clear_data_home ([data_home]) |
Delete all the content of the data home cache. |
Available Datasets¶
- American National Election Survey 1996
- Breast Cancer Data
- Bill Greene’s credit scoring data.
- Smoking and lung cancer in eight cities in China.
- Mauna Loa Weekly Atmospheric CO2 Data
- First 100 days of the US House of Representatives 1995
- World Copper Market 1951-1975 Dataset
- US Capital Punishment dataset.
- El Nino - Sea Surface Temperatures
- Engel (1857) food expenditure data
- Affairs dataset
- World Bank Fertility Data
- Grunfeld (1950) Investment Data
- Transplant Survival Data
- Longley dataset
- United States Macroeconomic data
- Travel Mode Choice
- Nile River flows at Ashwan 1871-1970
- RAND Health Insurance Experiment Data
- Taxation Powers Vote for the Scottish Parliamant 1997
- Spector and Mazzeo (1980) - Program Effectiveness Data
- Stack loss data
- Star98 Educational Dataset
- Statewide Crime Data 2009
- U.S. Strike Duration Data
- Yearly sunspots data 1700-2008
Usage¶
Load a dataset:
In [5]: import statsmodels.api as sm
In [6]: data = sm.datasets.longley.load()
The Dataset object follows the bunch pattern explained in proposal. The full dataset is available in the data
attribute.
In [7]: data.data
Out[7]:
rec.array([(60323.0, 83.0, 234289.0, 2356.0, 1590.0, 107608.0, 1947.0),
(61122.0, 88.5, 259426.0, 2325.0, 1456.0, 108632.0, 1948.0),
(60171.0, 88.2, 258054.0, 3682.0, 1616.0, 109773.0, 1949.0),
(61187.0, 89.5, 284599.0, 3351.0, 1650.0, 110929.0, 1950.0),
(63221.0, 96.2, 328975.0, 2099.0, 3099.0, 112075.0, 1951.0),
(63639.0, 98.1, 346999.0, 1932.0, 3594.0, 113270.0, 1952.0),
(64989.0, 99.0, 365385.0, 1870.0, 3547.0, 115094.0, 1953.0),
(63761.0, 100.0, 363112.0, 3578.0, 3350.0, 116219.0, 1954.0),
(66019.0, 101.2, 397469.0, 2904.0, 3048.0, 117388.0, 1955.0),
(67857.0, 104.6, 419180.0, 2822.0, 2857.0, 118734.0, 1956.0),
(68169.0, 108.4, 442769.0, 2936.0, 2798.0, 120445.0, 1957.0),
(66513.0, 110.8, 444546.0, 4681.0, 2637.0, 121950.0, 1958.0),
(68655.0, 112.6, 482704.0, 3813.0, 2552.0, 123366.0, 1959.0),
(69564.0, 114.2, 502601.0, 3931.0, 2514.0, 125368.0, 1960.0),
(69331.0, 115.7, 518173.0, 4806.0, 2572.0, 127852.0, 1961.0),
(70551.0, 116.9, 554894.0, 4007.0, 2827.0, 130081.0, 1962.0)],
dtype=[('TOTEMP', '<f8'), ('GNPDEFL', '<f8'), ('GNP', '<f8'), ('UNEMP', '<f8'), ('ARMED', '<f8'), ('POP', '<f8'), ('YEAR', '<f8')])
Most datasets hold convenient representations of the data in the attributes endog and exog:
In [8]: data.endog[:5]
Out[8]: array([ 60323., 61122., 60171., 61187., 63221.])
In [9]: data.exog[:5,:]
Out[9]:
array([[ 83. , 234289. , 2356. , 1590. , 107608. , 1947. ],
[ 88.5, 259426. , 2325. , 1456. , 108632. , 1948. ],
[ 88.2, 258054. , 3682. , 1616. , 109773. , 1949. ],
[ 89.5, 284599. , 3351. , 1650. , 110929. , 1950. ],
[ 96.2, 328975. , 2099. , 3099. , 112075. , 1951. ]])
Univariate datasets, however, do not have an exog attribute.
Variable names can be obtained by typing:
In [10]: data.endog_name
Out[10]: 'TOTEMP'
In [11]: data.exog_name
Out[11]: ['GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']
If the dataset does not have a clear interpretation of what should be an endog and exog, then you can always access the data or raw_data attributes. This is the case for the macrodata dataset, which is a collection of US macroeconomic data rather than a dataset with a specific example in mind. The data attribute contains a record array of the full dataset and the raw_data attribute contains an ndarray with the names of the columns given by the names attribute.
In [12]: type(data.data)
Out[12]: numpy.recarray
In [13]: type(data.raw_data)
Out[13]: numpy.recarray
In [14]: data.names
Out[14]: ['TOTEMP', 'GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']
Loading data as pandas objects¶
For many users it may be preferable to get the datasets as a pandas DataFrame or
Series object. Each of the dataset modules is equipped with a load_pandas
method which returns a Dataset
instance with the data readily available as pandas objects:
In [15]: data = sm.datasets.longley.load_pandas()
In [16]: data.exog
Out[16]:
GNPDEFL GNP UNEMP ARMED POP YEAR
0 83.0 234289.0 2356.0 1590.0 107608.0 1947.0
1 88.5 259426.0 2325.0 1456.0 108632.0 1948.0
2 88.2 258054.0 3682.0 1616.0 109773.0 1949.0
3 89.5 284599.0 3351.0 1650.0 110929.0 1950.0
4 96.2 328975.0 2099.0 3099.0 112075.0 1951.0
5 98.1 346999.0 1932.0 3594.0 113270.0 1952.0
6 99.0 365385.0 1870.0 3547.0 115094.0 1953.0
7 100.0 363112.0 3578.0 3350.0 116219.0 1954.0
8 101.2 397469.0 2904.0 3048.0 117388.0 1955.0
9 104.6 419180.0 2822.0 2857.0 118734.0 1956.0
10 108.4 442769.0 2936.0 2798.0 120445.0 1957.0
11 110.8 444546.0 4681.0 2637.0 121950.0 1958.0
12 112.6 482704.0 3813.0 2552.0 123366.0 1959.0
13 114.2 502601.0 3931.0 2514.0 125368.0 1960.0
14 115.7 518173.0 4806.0 2572.0 127852.0 1961.0
15 116.9 554894.0 4007.0 2827.0 130081.0 1962.0
In [17]: data.endog
Out[17]:
0 60323.0
1 61122.0
2 60171.0
3 61187.0
4 63221.0
5 63639.0
6 64989.0
7 63761.0
8 66019.0
9 67857.0
10 68169.0
11 66513.0
12 68655.0
13 69564.0
14 69331.0
15 70551.0
Name: TOTEMP, dtype: float64
The full DataFrame is available in the data
attribute of the Dataset object
In [18]: data.data
Out[18]:
TOTEMP GNPDEFL GNP UNEMP ARMED POP YEAR
0 60323.0 83.0 234289.0 2356.0 1590.0 107608.0 1947.0
1 61122.0 88.5 259426.0 2325.0 1456.0 108632.0 1948.0
2 60171.0 88.2 258054.0 3682.0 1616.0 109773.0 1949.0
3 61187.0 89.5 284599.0 3351.0 1650.0 110929.0 1950.0
4 63221.0 96.2 328975.0 2099.0 3099.0 112075.0 1951.0
5 63639.0 98.1 346999.0 1932.0 3594.0 113270.0 1952.0
6 64989.0 99.0 365385.0 1870.0 3547.0 115094.0 1953.0
7 63761.0 100.0 363112.0 3578.0 3350.0 116219.0 1954.0
8 66019.0 101.2 397469.0 2904.0 3048.0 117388.0 1955.0
9 67857.0 104.6 419180.0 2822.0 2857.0 118734.0 1956.0
10 68169.0 108.4 442769.0 2936.0 2798.0 120445.0 1957.0
11 66513.0 110.8 444546.0 4681.0 2637.0 121950.0 1958.0
12 68655.0 112.6 482704.0 3813.0 2552.0 123366.0 1959.0
13 69564.0 114.2 502601.0 3931.0 2514.0 125368.0 1960.0
14 69331.0 115.7 518173.0 4806.0 2572.0 127852.0 1961.0
15 70551.0 116.9 554894.0 4007.0 2827.0 130081.0 1962.0
With pandas integration in the estimation classes, the metadata will be attached to model results:
Extra Information¶
If you want to know more about the dataset itself, you can access the following, again using the Longley dataset as an example
>>> dir(sm.datasets.longley)[:6]
['COPYRIGHT', 'DESCRLONG', 'DESCRSHORT', 'NOTE', 'SOURCE', 'TITLE']
Additional information¶
- The idea for a datasets package was originally proposed by David Cournapeau and can be found here with updates by Skipper Seabold.
- To add datasets, see the notes on adding a dataset.