skbio.stats.power.
subsample_paired_power
(test, meta, cat, control_cats, order=None, strict_match=True, alpha_pwr=0.05, max_counts=50, counts_interval=10, min_counts=None, num_iter=500, num_runs=10)[source]¶Estimates power iteratively using samples with matching metadata
State: Experimental as of 0.4.0.
Parameters: |
|
---|---|
Returns: |
|
Raises: |
|
Examples
Assume you are interested in the role of a specific cytokine of protein translocation in myeloid-lineage cells. You are able to culture two macrophage lineages (bone marrow derived phagocytes and peritoneally-derived macrophages). Due to unfortunate circumstances, your growth media must be acquired from multiple sources (lab, company A, company B). Also unfortunate, you must use labor-intensive low throughput assays. You have some preliminary measurements, and you’d like to predict how many (more) cells you need to analyze for 80% power.
You have information about 60 cells, which we’ll simulate below. Note that we are setting a random seed value for consistency.
>>> import numpy as np
>>> import pandas as pd
>>> np.random.seed(25)
>>> data = pd.DataFrame.from_dict({
... 'CELL_LINE': np.random.binomial(1, 0.5, size=(60,)),
... 'SOURCE': np.random.binomial(2, 0.33, size=(60,)),
... 'TREATMENT': np.hstack((np.zeros((30)), np.ones((30)))),
... 'INCUBATOR': np.random.binomial(1, 0.2, size=(60,))})
>>> data['OUTCOME'] = (0.25 + data.TREATMENT * 0.25) + \
... np.random.randn(60) * (0.1 + data.SOURCE/10 + data.CELL_LINE/5)
>>> data.loc[data.OUTCOME < 0, 'OUTCOME'] = 0
>>> data.loc[data.OUTCOME > 1, 'OUTCOME'] = 1
We will approach this by assuming that the distribution of our outcome is not normally distributed, and apply a kruskal-wallis test to compare between the cytokine treated and untreated cells.
>>> from scipy.stats import kruskal
>>> f = lambda x: kruskal(*[data.loc[i, 'OUTCOME'] for i in x])[1]
Let’s check that cytokine treatment has a significant effect across all the cells.
>>> treatment_stat = [g for g in data.groupby('TREATMENT').groups.values()]
>>> f(treatment_stat)
0.0019386336266250209
Now, let’s pick the control categories. It seems reasonable to assume there may be an effect of cell line on the treatment outcome, which may be attributed to differences in receptor expression. It may also be possible that there are differences due cytokine source. Incubators were maintained under the same conditions throughout the experiment, within one degree of temperature difference at any given time, and the same level of CO2. So, at least initially, let’s ignore differences due to the incubator.
It’s recommended that as a first pass analysis, control variables be selected based on an idea of what may be biologically relevant to the system, although further iteration might encourage the consideration of variable with effect sizes similar, or larger than the variable of interest.
>>> control_cats = ['SOURCE', 'CELL_LINE']
>>> from skbio.stats.power import subsample_paired_power
>>> pwr, cnt = subsample_paired_power(test=f,
... meta=data,
... cat='TREATMENT',
... control_cats=control_cats,
... counts_interval=5,
... num_iter=25,
... num_runs=5)
>>> cnt
array([ 5., 10., 15., 20.])
>>> pwr.mean(0)
array([ 0.24 , 0.528, 0.68 , 0.88 ])
>>> pwr.std(0).round(3)
array([ 0.088, 0.127, 0.168, 0.08 ])
Estimating off the power curve, it looks like 20 cells per group may provide adequate power for this experiment, although the large variance in power might suggest extending the curves or increasing the number of samples per group.