skbio.io.format.blast7
)¶The BLAST+7 format (blast+7
) stores the results of a BLAST [1] database
search. This format is produced by both BLAST+ output
format 7 and legacy BLAST output format 9. The results
are stored in a simple tabular format with headers. Values are separated by the
tab character.
An example BLAST+7-formatted file comparing two nucleotide sequences, taken
from [2] (tab characters represented by <tab>
):
# BLASTN 2.2.18+
# Query: gi|1786181|gb|AE000111.1|AE000111
# Subject: ecoli
# Fields: query acc., subject acc., evalue, q. start, q. end, s. start, s. end
# 5 hits found
AE000111<tab>AE000111<tab>0.0<tab>1<tab>10596<tab>1<tab>10596
AE000111<tab>AE000174<tab>8e-30<tab>5565<tab>5671<tab>6928<tab>6821
AE000111<tab>AE000394<tab>1e-27<tab>5587<tab>5671<tab>135<tab>219
AE000111<tab>AE000425<tab>6e-26<tab>5587<tab>5671<tab>8552<tab>8468
AE000111<tab>AE000171<tab>3e-24<tab>5587<tab>5671<tab>2214<tab>2130
Has Sniffer: Yes
State: Experimental as of 0.4.1.
Reader | Writer | Object Class |
---|---|---|
Yes | No | pandas.DataFrame |
There are two BLAST+7 file formats supported by scikit-bio: BLAST+ output
format 7 (-outfmt 7
) and legacy BLAST output format 9 (-m 9
). Both file
formats are structurally similar, with minor differences.
Example BLAST+ output format 7 file:
# BLASTP 2.2.31+
# Query: query1
# Subject: subject2
# Fields: q. start, q. end, s. start, s. end, identical, mismatches, sbjctframe, query acc.ver, subject acc.ver
# 2 hits found
1 8 3 10 8 0 1 query1 subject2
2 5 2 15 8 0 2 query1 subject2
Note
Database searches without hits may occur in BLAST+ output format 7 files. scikit-bio ignores these “empty” records:
# BLASTP 2.2.31+
# Query: query1
# Subject: subject1
# 0 hits found
Example legacy BLAST output format 9 file:
# BLASTN 2.2.3 [May-13-2002]
# Database: other_vertebrate
# Query: AF178033
# Fields:
Query id,Subject id,% identity,alignment length,mismatches,gap openings,q. start,q. end,s. start,s. end,e-value,bit score
AF178033 EMORG:AF178033 100.00 811 0 0 1 811 1 811 0.0 1566.6
AF178033 EMORG:AF031394 99.63 811 3 0 1 811 99 909 0.0 1542.8
Note
scikit-bio requires fields to be consistent within a file.
The following column types are output by BLAST and supported by scikit-bio.
For more information on these column types, see skbio.io.format.blast6
.
Field Name | DataFrame Column Name |
---|---|
query id | qseqid |
query gi | qgi |
query acc. | qacc |
query acc.ver | qaccver |
query length | qlen |
subject id | sseqid |
subject ids | sallseqid |
subject gi | sgi |
subject gis | sallgi |
subject acc. | sacc |
subject acc.ver | saccver |
subject accs | sallacc |
subject length | slen |
q. start | qstart |
q. end | qend |
s. start | sstart |
s. end | send |
query seq | qseq |
subject seq | sseq |
evalue | evalue |
bit score | bitscore |
score | score |
alignment length | length |
% identity | pident |
identical | nident |
mismatches | mismatch |
positives | positive |
gap opens | gapopen |
gaps | gaps |
% positives | ppos |
query/sbjct frames | frames |
query frame | qframe |
sbjct frame | sframe |
BTOP | btop |
subject tax ids | staxids |
subject sci names | sscinames |
subject com names | scomnames |
subject blast names | sblastnames |
subject super kingdoms | sskingdoms |
subject title | stitle |
subject strand | sstrand |
subject titles | salltitles |
% query coverage per subject | qcovs |
% query coverage per hsp | qcovhsp |
Examples
Suppose we have a BLAST+7 file:
>>> from io import StringIO
>>> import skbio.io
>>> import pandas as pd
>>> fs = '\n'.join([
... '# BLASTN 2.2.18+',
... '# Query: gi|1786181|gb|AE000111.1|AE000111',
... '# Database: ecoli',
... '# Fields: query acc., subject acc., evalue, q. start, q. end, s. start, s. end',
... '# 5 hits found',
... 'AE000111\tAE000111\t0.0\t1\t10596\t1\t10596',
... 'AE000111\tAE000174\t8e-30\t5565\t5671\t6928\t6821',
... 'AE000111\tAE000171\t3e-24\t5587\t5671\t2214\t2130',
... 'AE000111\tAE000425\t6e-26\t5587\t5671\t8552\t8468'
... ])
>>> fh = StringIO(fs)
Read the file into a pd.DataFrame
:
>>> df = skbio.io.read(fh, into=pd.DataFrame)
>>> df # doctest: +NORMALIZE_WHITESPACE
qacc sacc evalue qstart qend sstart send
0 AE000111 AE000111 0.000000e+00 1.0 10596.0 1.0 10596.0
1 AE000111 AE000174 8.000000e-30 5565.0 5671.0 6928.0 6821.0
2 AE000111 AE000171 3.000000e-24 5587.0 5671.0 2214.0 2130.0
3 AE000111 AE000425 6.000000e-26 5587.0 5671.0 8552.0 8468.0
Suppose we have a legacy BLAST 9 file:
>>> from io import StringIO
>>> import skbio.io
>>> import pandas as pd
>>> fs = '\n'.join([
... '# BLASTN 2.2.3 [May-13-2002]',
... '# Database: other_vertebrate',
... '# Query: AF178033',
... '# Fields: ',
... 'Query id,Subject id,% identity,alignment length,mismatches,gap openings,q. start,q. end,s. start,s. end,e-value,bit score',
... 'AF178033\tEMORG:AF178033\t100.00\t811\t0\t0\t1\t811\t1\t811\t0.0\t1566.6',
... 'AF178033\tEMORG:AF178032\t94.57\t811\t44\t0\t1\t811\t1\t811\t0.0\t1217.7',
... 'AF178033\tEMORG:AF178031\t94.82\t811\t42\t0\t1\t811\t1\t811\t0.0\t1233.5'
... ])
>>> fh = StringIO(fs)
Read the file into a pd.DataFrame
:
>>> df = skbio.io.read(fh, into=pd.DataFrame)
>>> df # doctest: +NORMALIZE_WHITESPACE
qseqid sseqid pident length mismatch gapopen qstart qend \
0 AF178033 EMORG:AF178033 100.00 811.0 0.0 0.0 1.0 811.0
1 AF178033 EMORG:AF178032 94.57 811.0 44.0 0.0 1.0 811.0
2 AF178033 EMORG:AF178031 94.82 811.0 42.0 0.0 1.0 811.0
<BLANKLINE>
sstart send evalue bitscore
0 1.0 811.0 0.0 1566.6
1 1.0 811.0 0.0 1217.7
2 1.0 811.0 0.0 1233.5
References
[1] | Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) “Basic local alignment search tool.” J. Mol. Biol. 215:403-410. |
[2] | http://www.ncbi.nlm.nih.gov/books/NBK279682/ |