skip to navigation
skip to content

multifastadb 0.2.0

present a collection of indexed fasta files as a single source

Latest Version: 0.2.10

MultiFastaDB presents a collection of indexed fasta files as a single source. The intent is to simplify accessing a virtual database of sequences that is distributed across multiple files.

$ pip install multifastadb

$ python

>>> from multifastadb import MultiFastaDB

The simplest use is by passing a list of files or directories:

>>> mfdb = MultiFastaDB(['tests/data/ncbi'])

By default, MultiFastaDB looks for files ending in .fasta, .fa, .faa, .fna, and compressed versions of these ending in .gz. (NOTE: One must use bgzip for compression; using gzip will fail on reading.)

Fasta files from NCBI contain multiple identifiers for a single sequence encoded in the accession line, such as (gi|53292629|ref|NP_001005405.1|). Optionally, MultiFastaDB will create a meta index to the ref entries:

>>> mfdb = MultiFastaDB(['tests/data/ncbi'], use_meta_index=True)

Sequences may be retrieved by the fetch() method, with optional sequence start and end bounds (in 0-based or interbase coordinates):

>>> seq = mfdb.fetch('NP_001005405.1')
>>> seq = mfdb.fetch('NP_001005405.1',0,10)

NOTE: Fetching subsequences with bounds is much more efficient than:

>>> seq = mfdb.fetch('NP_001005405.1')[0:10]    # Don't do this!

If a sequence occurs more than once, only the first version is returned (intentionally).

Attribute-based retrieval is also supported:

>>> seq = mfdb['NP_001005405.1']
>>> seq = mfdb['NP_001005405.1'][0:10]

Attribute-based retrieval does not fetch any sequence immediately. Instead it returns a SequenceProxy object that fetches sequence lazily and transparently. This is particularly useful for accessing large sequences (e.g., chromosomes).

The locations of a given accession may be found with the where_is() method:

>>> mfdb.where_is('gi|53292629|ref|NP_001005405.1|')   # doctest: +ELLIPSIS
[('tests/data/ncbi/f1.human.protein.small.faa.gz', <pysam.cfaidx.Fastafile object at ...>)]
File Type Py Version Uploaded on Size
multifastadb-0.2.0-py2.7.egg (md5) Python Egg 2.7 2014-09-01 2KB
multifastadb-0.2.0.tar.gz (md5) Source 2014-09-01 392KB