skbio.stats.distance.
permdisp
(distance_matrix, grouping, column=None, test='median', permutations=999)[source]¶Test for Homogeneity of Multivariate Groups Disperisons using Marti
State: Experimental as of 0.5.2. Anderson’s PERMDISP2 procedure.
PERMDISP is a multivariate analogue of Levene’s test for homogeneity of multivariate variances. Distances are handled by reducing the original distances to principal coordinates. PERMDISP calculates an F-statistic to assess whether the dispersions between groups is significant
Parameters: |
|
---|---|
Returns: | Results of the statistical test, including |
Return type: | pandas.Series |
Raises: |
|
See also
Notes
The significance of the results from this function will be the same as the results found in vegan’s betadisper, however due to floating point variability the F-statistic results may vary slightly.
See [1] for the original method reference, as well as
vegan::betadisper
, available in R’s vegan package [2].
References
[1] | Anderson, Marti J. “Distance-Based Tests for Homogeneity of Multivariate Dispersions.” Biometrics 62 (2006):245-253 |
[2] | http://cran.r-project.org/web/packages/vegan/index.html |
Examples
Load a 6x6 distance matrix and grouping vector denoting 2 groups of objects:
>>> from skbio import DistanceMatrix
>>> dm = DistanceMatrix([[0, 0.5, 0.75, 1, 0.66, 0.33],
... [0.5, 0, 0.25, 0.33, 0.77, 0.61],
... [0.75, 0.25, 0, 0.1, 0.44, 0.55],
... [1, 0.33, 0.1, 0, 0.75, 0.88],
... [0.66, 0.77, 0.44, 0.75, 0, 0.77],
... [0.33, 0.61, 0.55, 0.88, 0.77, 0]],
... ['s1', 's2', 's3', 's4', 's5', 's6'])
>>> grouping = ['G1', 'G1', 'G1', 'G2', 'G2', 'G2']
Run PERMDISP using 99 permutations to caluculate the p-value:
>>> from skbio.stats.distance import permdisp
>>> import numpy as np
>>> #make output deterministic, should not be included during normal use
>>> np.random.seed(0)
>>> permdisp(dm, grouping, permutations=99)
method name PERMDISP
test statistic name F-value
sample size 6
number of groups 2
test statistic 1.03296
p-value 0.35
number of permutations 99
Name: PERMDISP results, dtype: object
The return value is a pandas.Series
object containing the results of
the statistical test.
To suppress calculation of the p-value and only obtain the F statistic, specify zero permutations:
>>> permdisp(dm, grouping, permutations=0)
method name PERMDISP
test statistic name F-value
sample size 6
number of groups 2
test statistic 1.03296
p-value NaN
number of permutations 0
Name: PERMDISP results, dtype: object
PERMDISP computes variances based on two types of tests, using either centroids or spatial medians, also commonly referred to as a geometric median. The spatial median is thought to yield a more robust test statistic, and this test is used by default. Spatial medians are computed using an iterative algorithm to find the optimally minimum point from all other points in a group while centroids are computed using a deterministic formula. As such the two different tests yeild slightly different F statistics.
>>> np.random.seed(0)
>>> permdisp(dm, grouping, test='centroid', permutations=6)
method name PERMDISP
test statistic name F-value
sample size 6
number of groups 2
test statistic 3.67082
p-value 0.428571
number of permutations 6
Name: PERMDISP results, dtype: object
You can also provide a pandas.DataFrame
and a column denoting the
grouping instead of a grouping vector. The following DataFrame’s
Grouping column specifies the same grouping as the vector we used in the
previous examples.:
>>> import pandas as pd
>>> df = pd.DataFrame.from_dict(
... {'Grouping': {'s1': 'G1', 's2': 'G1', 's3': 'G1', 's4': 'G2',
... 's5': 'G2', 's6': 'G2'}})
>>> permdisp(dm, df, 'Grouping', permutations=6, test='centroid')
method name PERMDISP
test statistic name F-value
sample size 6
number of groups 2
test statistic 3.67082
p-value 0.428571
number of permutations 6
Name: PERMDISP results, dtype: object
Note that when providing a DataFrame
, the ordering of rows and/or
columns does not affect the grouping vector that is extracted. The
DataFrame
must be indexed by the distance matrix IDs (i.e., the row
labels must be distance matrix IDs).
If IDs (rows) are present in the DataFrame
but not in the distance
matrix, they are ignored. The previous example’s s7
ID illustrates this
behavior: note that even though the DataFrame
had 7 objects, only 6
were used in the test (see the “Sample size” row in the results above to
confirm this). Thus, the DataFrame
can be a superset of the distance
matrix IDs. Note that the reverse is not true: IDs in the distance matrix
must be present in the DataFrame
or an error will be raised.
PERMDISP should be used to determine whether the dispersions between the groups in your distance matrix are significantly separated. A non-significant test result indicates that group dispersions are similar to each other. PERMANOVA or ANOSIM should then be used in conjunction to determine whether clustering within groups is significant.