skbio.stats.power.
subsample_power
(test, samples, draw_mode='ind', alpha_pwr=0.05, ratio=None, max_counts=50, counts_interval=10, min_counts=None, num_iter=500, num_runs=10)[source]¶Subsamples data to iteratively calculate power
State: Experimental as of 0.4.0.
Parameters: |
|
---|---|
Returns: |
|
Raises: |
|
Examples
Let’s say we wanted to look at the relationship between the presence of a specific bacteria, Gardnerella vaginalis in the vaginal community, and the probability of a pre or post menopausal woman experiencing a urinary tract infection (UTI). Healthy women were enrolled in the study either before or after menopause, and followed for eight weeks. Participants submitted fecal samples at the beginning of the study, and were then followed for clinical symptoms of a UTI. A confirmed UTI was an endpoint in the study.
Using available literature and 16S sequencing, a set of candidate taxa were identified as correlated with UTIs, including G. vaginalis. In the 100 women (50 premenopausal and 50 postmenopausal samples) who had UTIs, the presence or absence of G. vaginalis was confirmed with quantitative PCR.
We can model the probability that detectable G. vaginalis was found in these samples using a binomial model. (Note that this is a simulation.)
>>> import numpy as np
>>> np.random.seed(25)
>>> pre_rate = np.random.binomial(1, 0.85, size=(50,))
>>> pre_rate.sum()
45
>>> pos_rate = np.random.binomial(1, 0.40, size=(50,))
>>> pos_rate.sum()
21
Let’s set up a test function, so we can test the probability of finding a difference in frequency between the two groups. We’ll use scipy.stats.chisquare to look for the difference in frequency between groups.
>>> from scipy.stats import chisquare
>>> test = lambda x: chisquare(np.array([x[i].sum() for i in
... range(len(x))]))[1]
Let’s make sure that our two distributions are different.
>>> print(round(test([pre_rate, pos_rate]), 3))
0.003
Since there are an even number of samples, and we don’t have enough information to try controlling the data, we’ll use skbio.stats.power.subsample_power to compare the two groups. If we had metadata about other risk factors, like a reproductive history, BMI, tobacco use, we might want to use skbio.stats.power.subsample_paired_power. We’ll also use “ind” draw_mode, since there is no linkage between the two groups of samples.
>>> from skbio.stats.power import subsample_power
>>> pwr_est, counts = subsample_power(test=test,
... samples=[pre_rate, pos_rate],
... num_iter=100,
... num_runs=5,
... counts_interval=5)
>>> counts
array([ 5, 10, 15, 20, 25, 30, 35, 40, 45])
>>> np.nanmean(pwr_est, axis=0) # doctest: +NORMALIZE_WHITESPACE
array([ 0.056, 0.074, 0.226, 0.46 , 0.61 , 0.806, 0.952, 1. ,
1. ])
>>> counts[np.nanmean(pwr_est, axis=0) > 0.8].min()
30
So, we can estimate that we will see a significant difference in the presence of G. vaginalis in the stool of pre and post women with UTIs if we have at least 30 samples per group.
If we wanted to test the relationship of a second candidate taxa which is more rare in the population, but may have a similar effect, based on available literature, we might also start by trying to identify 30 samples per group where the second candidate taxa is present.
Suppose, now, that we want to test that a secondary metabolite seen only in the presence of G vaginalis to see if it is also correlated with UTIs. We can model the abundance of the metabolite as a normal distribution.
>>> met_pos = (np.random.randn(pre_rate.sum() + pos_rate.sum()) * 2000 +
... 2500)
>>> met_pos[met_pos < 0] = 0
>>> met_neg = met_neg = (np.random.randn(100 - (pre_rate.sum() +
... pos_rate.sum())) * 2000 + 500)
>>> met_neg[met_neg < 0] = 0
Let’s compare the populations with a kruskal-wallis test. Physically, there cannot be a negative concentration of a chemical, so we’ve set the lower bound at 0. This means that we can no longer assume our distribution is normal.
>>> from scipy.stats import kruskal
>>> def metabolite_test(x):
... return kruskal(x[0], x[1])[1]
>>> print(round(metabolite_test([met_pos, met_neg]), 3))
0.005
When we go to perform the statistical test on all the data, you might notice that there are twice as many samples from women with G. vaginalis than those without. It might make sense to account for this difference when we’re testing power. So, we’re going to set the ratio parameter, which lets us draw twice as many samples from women with G. vaginalis.
>>> pwr_est2, counts2 = subsample_power(test=metabolite_test,
... samples=[met_pos, met_neg],
... counts_interval=5,
... num_iter=100,
... num_runs=5,
... ratio=[2, 1])
>>> counts2
array([ 5., 10., 15., 20., 25., 30.])
>>> np.nanmean(pwr_est2, axis=0)
array([ 0.14 , 0.272, 0.426, 0.646, 0.824, 0.996])
>>> counts2[np.nanmean(pwr_est2, axis=0) > 0.8].min()
25.0
When we consider the number of samples per group needed in the power analysis, we need to look at the ratio. The analysis says that we need 25 samples in the smallest group, in this case, the group of women without G. vaginalis and 50 samples from women with G. vaginalis to see a significant difference in the abundance of our secondary metabolite at 80% power.