skbio.stats.distance.
bioenv
(distance_matrix, data_frame, columns=None)[source]¶Find subset of variables maximally correlated with distances.
State: Experimental as of 0.4.0.
Finds subsets of variables whose Euclidean distances (after scaling the variables; see Notes section below for details) are maximally rank-correlated with the distance matrix. For example, the distance matrix might contain distances between communities, and the variables might be numeric environmental variables (e.g., pH). Correlation between the community distance matrix and Euclidean environmental distance matrix is computed using Spearman’s rank correlation coefficient (\(\rho\)).
Subsets of environmental variables range in size from 1 to the total number of variables (inclusive). For example, if there are 3 variables, the “best” variable subsets will be computed for subset sizes 1, 2, and 3.
The “best” subset is chosen by computing the correlation between the community distance matrix and all possible Euclidean environmental distance matrices at the given subset size. The combination of environmental variables with maximum correlation is chosen as the “best” subset.
Parameters: |
|
---|---|
Returns: | Data frame containing the “best” subset of variables at each subset size, as well as the correlation coefficient of each. |
Return type: | pandas.DataFrame |
Raises: |
|
See also
scipy.stats.spearmanr()
Notes
See [1] for the original method reference (originally called BIO-ENV).
The general algorithm and interface are similar to vegan::bioenv
,
available in R’s vegan package [2]. This method can also be found in
PRIMER-E [3] (originally called BIO-ENV, but is now called BEST).
Warning
This method can take a long time to run if a large number of variables are specified, as all possible subsets are evaluated at each subset size.
The variables are scaled before computing the Euclidean distance: each column is centered and then scaled by its standard deviation.
References
[1] | Clarke, K. R & Ainsworth, M. 1993. “A method of linking multivariate community structure to environmental variables”. Marine Ecology Progress Series, 92, 205-219. |
[2] | http://cran.r-project.org/web/packages/vegan/index.html |
[3] | http://www.primer-e.com/primer.htm |
Examples
Import the functionality we’ll use in the following examples:
>>> import pandas as pd
>>> from skbio import DistanceMatrix
>>> from skbio.stats.distance import bioenv
Load a 4x4 community distance matrix:
>>> dm = DistanceMatrix([[0.0, 0.5, 0.25, 0.75],
... [0.5, 0.0, 0.1, 0.42],
... [0.25, 0.1, 0.0, 0.33],
... [0.75, 0.42, 0.33, 0.0]],
... ['A', 'B', 'C', 'D'])
Load a pandas.DataFrame
with two environmental variables, pH and
elevation:
>>> df = pd.DataFrame([[7.0, 400],
... [8.0, 530],
... [7.5, 450],
... [8.5, 810]],
... index=['A','B','C','D'],
... columns=['pH', 'Elevation'])
Note that the data frame is indexed with the same IDs ('A'
, 'B'
,
'C'
, and 'D'
) that are in the distance matrix. This is necessary in
order to link the environmental variables (metadata) to each of the objects
in the distance matrix. In this example, the IDs appear in the same order
in both the distance matrix and data frame, but this is not necessary.
Find the best subsets of environmental variables that are correlated with community distances:
>>> bioenv(dm, df) # doctest: +NORMALIZE_WHITESPACE
size correlation
vars
pH 1 0.771517
pH, Elevation 2 0.714286
We see that in this simple example, pH alone is maximally rank-correlated with the community distances (\(\rho=0.771517\)).