skbio.stats.
isubsample
(items, maximum, minimum=1, buf_size=1000, bin_f=None)[source]¶Randomly subsample items from bins, without replacement.
State: Experimental as of 0.4.0.
Randomly subsample items without replacement from an unknown number of input items, that may fall into an unknown number of bins. This method is intended for data that either a) cannot fit into memory or b) subsampling collections of arbitrary datatypes.
Parameters: |
|
---|---|
Returns: | (bin, item) |
Return type: | generator |
Raises: |
|
See also
Notes
Randomly get up to maximum
items for each bin. If the bin has less than
maximum
, only those bins that have >= minimum
items are
returned.
This method will at most hold maximum
* N data, where N is the number
of bins.
All items associated to a bin have an equal probability of being retained.
Examples
Randomly keep up to 2 sequences per sample from a set of demultiplexed sequences:
>>> from skbio.stats import isubsample
>>> import numpy as np
>>> np.random.seed(123)
>>> seqs = [('sampleA', 'AATTGG'),
... ('sampleB', 'ATATATAT'),
... ('sampleC', 'ATGGCC'),
... ('sampleB', 'ATGGCT'),
... ('sampleB', 'ATGGCG'),
... ('sampleA', 'ATGGCA')]
>>> bin_f = lambda item: item[0]
>>> for bin_, item in sorted(isubsample(seqs, 2, bin_f=bin_f)):
... print(bin_, item[1])
sampleA AATTGG
sampleA ATGGCA
sampleB ATATATAT
sampleB ATGGCG
sampleC ATGGCC
Now, let’s set the minimum to 2:
>>> bin_f = lambda item: item[0]
>>> for bin_, item in sorted(isubsample(seqs, 2, 2, bin_f=bin_f)):
... print(bin_, item[1])
sampleA AATTGG
sampleA ATGGCA
sampleB ATATATAT
sampleB ATGGCG