skbio.sequence.
Sequence
(sequence, metadata=None, positional_metadata=None, interval_metadata=None, lowercase=False)[source]¶Store generic sequence data and optional associated metadata.
Sequence
objects do not enforce an alphabet or grammar and are thus the
most generic objects for storing sequence data. Sequence
objects do not
necessarily represent biological sequences. For example, Sequence
can
be used to represent a position in a multiple sequence alignment.
Subclasses DNA
, RNA
, and Protein
enforce the IUPAC character
set [1] for, and provide operations specific to, each respective molecule
type.
Sequence
objects consist of the underlying sequence data, as well
as optional metadata and positional metadata. The underlying sequence
is immutable, while the metdata and positional metadata are mutable.
Parameters: |
|
---|
References
[1] | Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. Nucleic Acids Res. May 10, 1985; 13(9): 3021-3030. A Cornish-Bowden |
Examples
>>> from pprint import pprint
>>> from skbio import Sequence
>>> from skbio.metadata import IntervalMetadata
Creating sequences:
Create a sequence without any metadata:
>>> seq = Sequence('GGUCGUGAAGGA')
>>> seq
Sequence
---------------
Stats:
length: 12
---------------
0 GGUCGUGAAG GA
Create a sequence with metadata and positional metadata:
>>> metadata = {'authors': ['Alice'], 'desc':'seq desc', 'id':'seq-id'}
>>> positional_metadata = {'exons': [True, True, False, True],
... 'quality': [3, 3, 4, 10]}
>>> interval_metadata = IntervalMetadata(4)
>>> interval = interval_metadata.add([(1, 3)], metadata={'gene': 'sagA'})
>>> seq = Sequence('ACGT', metadata=metadata,
... positional_metadata=positional_metadata,
... interval_metadata=interval_metadata)
>>> seq
Sequence
-----------------------------
Metadata:
'authors': <class 'list'>
'desc': 'seq desc'
'id': 'seq-id'
Positional metadata:
'exons': <dtype: bool>
'quality': <dtype: int64>
Interval metadata:
1 interval feature
Stats:
length: 4
-----------------------------
0 ACGT
Retrieving underlying sequence data:
Retrieve underlying sequence:
>>> seq.values # doctest: +NORMALIZE_WHITESPACE
array([b'A', b'C', b'G', b'T'],
dtype='|S1')
Underlying sequence immutable:
>>> seq.values = np.array([b'T', b'C', b'G', b'A'], dtype='|S1')
Traceback (most recent call last):
...
AttributeError: can't set attribute
>>> seq.values[0] = b'T'
Traceback (most recent call last):
...
ValueError: assignment destination is read-only
Retrieving sequence metadata:
Retrieve metadata:
>>> pprint(seq.metadata) # using pprint to display dict in sorted order
{'authors': ['Alice'], 'desc': 'seq desc', 'id': 'seq-id'}
Retrieve positional metadata:
>>> seq.positional_metadata
exons quality
0 True 3
1 True 3
2 False 4
3 True 10
Retrieve interval metadata:
>>> seq.interval_metadata # doctest: +ELLIPSIS
1 interval feature
------------------
Interval(interval_metadata=<...>, bounds=[(1, 3)], fuzzy=[(False, False)], metadata={'gene': 'sagA'})
Updating sequence metadata:
Warning
Be aware that a shallow copy of metadata
and
positional_metadata
is made for performance. Since a deep copy is
not made, changes made to mutable Python objects stored as metadata may
affect the metadata of other Sequence
objects or anything else that
shares a reference to the object. The following examples illustrate this
behavior.
First, let’s create a sequence and update its metadata:
>>> metadata = {'id':'seq-id', 'desc':'seq desc', 'authors': ['Alice']}
>>> seq = Sequence('ACGT', metadata=metadata)
>>> seq.metadata['id'] = 'new-id'
>>> seq.metadata['pubmed'] = 12345
>>> pprint(seq.metadata)
{'authors': ['Alice'], 'desc': 'seq desc', 'id': 'new-id', 'pubmed': 12345}
Note that the original metadata dictionary (stored in variable
metadata
) hasn’t changed because a shallow copy was made:
>>> pprint(metadata)
{'authors': ['Alice'], 'desc': 'seq desc', 'id': 'seq-id'}
>>> seq.metadata == metadata
False
Note however that since only a shallow copy was made, updates to mutable objects will also change the original metadata dictionary:
>>> seq.metadata['authors'].append('Bob')
>>> seq.metadata['authors']
['Alice', 'Bob']
>>> metadata['authors']
['Alice', 'Bob']
This behavior can also occur when manipulating a sequence that has been derived from another sequence:
>>> subseq = seq[1:3]
>>> subseq
Sequence
-----------------------------
Metadata:
'authors': <class 'list'>
'desc': 'seq desc'
'id': 'new-id'
'pubmed': 12345
Stats:
length: 2
-----------------------------
0 CG
>>> pprint(subseq.metadata)
{'authors': ['Alice', 'Bob'],
'desc': 'seq desc',
'id': 'new-id',
'pubmed': 12345}
The subsequence has inherited the metadata of its parent sequence. If we update the subsequence’s author list, we see the changes propagated in the parent sequence and original metadata dictionary:
>>> subseq.metadata['authors'].append('Carol')
>>> subseq.metadata['authors']
['Alice', 'Bob', 'Carol']
>>> seq.metadata['authors']
['Alice', 'Bob', 'Carol']
>>> metadata['authors']
['Alice', 'Bob', 'Carol']
The behavior for updating positional metadata is similar. Let’s create a
new sequence with positional metadata that is already stored in a
pd.DataFrame
:
>>> positional_metadata = pd.DataFrame(
... {'list': [[], [], [], []], 'quality': [3, 3, 4, 10]})
>>> seq = Sequence('ACGT', positional_metadata=positional_metadata)
>>> seq
Sequence
-----------------------------
Positional metadata:
'list': <dtype: object>
'quality': <dtype: int64>
Stats:
length: 4
-----------------------------
0 ACGT
>>> seq.positional_metadata
list quality
0 [] 3
1 [] 3
2 [] 4
3 [] 10
Now let’s update the sequence’s positional metadata by adding a new column and changing a value in another column:
>>> seq.positional_metadata['gaps'] = [False, False, False, False]
>>> seq.positional_metadata.loc[0, 'quality'] = 999
>>> seq.positional_metadata
list quality gaps
0 [] 999 False
1 [] 3 False
2 [] 4 False
3 [] 10 False
Note that the original positional metadata (stored in variable
positional_metadata
) hasn’t changed because a shallow copy was made:
>>> positional_metadata
list quality
0 [] 3
1 [] 3
2 [] 4
3 [] 10
>>> seq.positional_metadata.equals(positional_metadata)
False
Next let’s create a sequence that has been derived from another sequence:
>>> subseq = seq[1:3]
>>> subseq
Sequence
-----------------------------
Positional metadata:
'list': <dtype: object>
'quality': <dtype: int64>
'gaps': <dtype: bool>
Stats:
length: 2
-----------------------------
0 CG
>>> subseq.positional_metadata
list quality gaps
0 [] 3 False
1 [] 4 False
As described above for metadata, since only a shallow copy was made of
the positional metadata, updates to mutable objects will also change the
parent sequence’s positional metadata and the original positional metadata
pd.DataFrame
:
>>> subseq.positional_metadata.loc[0, 'list'].append('item')
>>> subseq.positional_metadata
list quality gaps
0 [item] 3 False
1 [] 4 False
>>> seq.positional_metadata
list quality gaps
0 [] 999 False
1 [item] 3 False
2 [] 4 False
3 [] 10 False
>>> positional_metadata
list quality
0 [] 3
1 [item] 3
2 [] 4
3 [] 10
You can also update the interval metadata. Let’s re-create a
Sequence
object with interval metadata at first:
>>> seq = Sequence('ACGT')
>>> interval = seq.interval_metadata.add(
... [(1, 3)], metadata={'gene': 'foo'})
You can update directly on the Interval
object:
>>> interval # doctest: +ELLIPSIS
Interval(interval_metadata=<...>, bounds=[(1, 3)], fuzzy=[(False, False)], metadata={'gene': 'foo'})
>>> interval.bounds = [(0, 2)]
>>> interval # doctest: +ELLIPSIS
Interval(interval_metadata=<...>, bounds=[(0, 2)], fuzzy=[(False, False)], metadata={'gene': 'foo'})
You can also query and obtain the interval features you are interested and then modify them:
>>> intervals = list(seq.interval_metadata.query(metadata={'gene': 'foo'}))
>>> intervals[0].fuzzy = [(True, False)]
>>> print(intervals[0]) # doctest: +ELLIPSIS
Interval(interval_metadata=<...>, bounds=[(0, 2)], fuzzy=[(True, False)], metadata={'gene': 'foo'})
Attributes
default_write_format |
|
interval_metadata |
IntervalMetadata object containing info about interval features. |
metadata |
dict containing metadata which applies to the entire object. |
observed_chars |
Set of observed characters in the sequence. |
positional_metadata |
pd.DataFrame containing metadata along an axis. |
values |
Array containing underlying sequence characters. |
Built-ins
bool(sequence) |
Returns truth value (truthiness) of sequence. |
x in sequence |
Determine if a subsequence is contained in this sequence. |
copy.copy(sequence) |
Return a shallow copy of this sequence. |
copy.deepcopy(sequence) |
Return a deep copy of this sequence. |
sequence1 == sequence2 |
Determine if this sequence is equal to another. |
sequence[x] |
Slice this sequence. |
__init_subclass__ |
This method is called when a class is subclassed. |
iter(sequence) |
Iterate over positions in this sequence. |
len(sequence) |
Return the number of characters in this sequence. |
sequence1 != sequence2 |
Determine if this sequence is not equal to another. |
reversed(sequence) |
Iterate over positions in this sequence in reverse order. |
str(sequence) |
Return sequence characters as a string. |
Methods
concat (sequences[, how]) |
Concatenate an iterable of Sequence objects. |
count (subsequence[, start, end]) |
Count occurrences of a subsequence in this sequence. |
distance (other[, metric]) |
Compute the distance to another sequence. |
find_with_regex (regex[, ignore]) |
Generate slices for patterns matched by a regular expression. |
frequencies ([chars, relative]) |
Compute frequencies of characters in the sequence. |
has_interval_metadata () |
Determine if the object has interval metadata. |
has_metadata () |
Determine if the object has metadata. |
has_positional_metadata () |
Determine if the object has positional metadata. |
index (subsequence[, start, end]) |
Find position where subsequence first occurs in the sequence. |
iter_contiguous (included[, min_length, invert]) |
Yield contiguous subsequences based on included. |
iter_kmers (k[, overlap]) |
Generate kmers of length k from this sequence. |
kmer_frequencies (k[, overlap, relative]) |
Return counts of words of length k from this sequence. |
lowercase (lowercase) |
Return a case-sensitive string representation of the sequence. |
match_frequency (other[, relative]) |
Return count of positions that are the same between two sequences. |
matches (other) |
Find positions that match with another sequence. |
mismatch_frequency (other[, relative]) |
Return count of positions that differ between two sequences. |
mismatches (other) |
Find positions that do not match with another sequence. |
read (file[, format]) |
Create a new Sequence instance from a file. |
replace (where, character) |
Replace values in this sequence with a different character. |
write (file[, format]) |
Write an instance of Sequence to a file. |