Skip to main content

fast, memory-efficient, pythonic (and command-line) access to fasta sequence files

Project description

Author:

Brent Pedersen (brentp)

License:

MIT

Implementation

Requires Python >= 2.6. Stores a flattened version of the fasta file without spaces or headers and uses either a mmap of numpy binary format or fseek/fread so the sequence data is never read into memory. Saves a pickle (.gdx) of the start, stop (for fseek/mmap) locations of each header in the fasta file for internal use.

Usage

>>> from pyfasta import Fasta

>>> f = Fasta('tests/data/three_chrs.fasta')
>>> sorted(f.keys())
['chr1', 'chr2', 'chr3']

>>> f['chr1']
NpyFastaRecord(0..80)

Slicing

# get full the sequence:
>>> a = str(f['chr1'])
>>> b = f['chr1'][:]
>>> a == b
True

>>> f['chr1'][:10]
'ACTGACTGAC'

# get the 1st basepair in every codon (it's python yo)
>>> f['chr1'][::3]
'AGTCAGTCAGTCAGTCAGTCAGTCAGT'

# can query by a 'feature' dictionary (note this is one based coordinates)
>>> f.sequence({'chr': 'chr1', 'start': 2, 'stop': 9})
'CTGACTGA'

# same as:
>>> f['chr1'][1:9]
'CTGACTGA'

# use python, zero based coords
>>> f.sequence({'chr': 'chr1', 'start': 2, 'stop': 9}, one_based=False)
'TGACTGA'

# with reverse complement (automatic for - strand)
>>> f.sequence({'chr': 'chr1', 'start': 2, 'stop': 9, 'strand': '-'})
'TCAGTCAG'

Key Function

Sometimes your fasta will have a long header like: “AT1G51370.2 | Symbols: | F-box family protein | chr1:19045615-19046748 FORWARD” when you only want to key off: “AT1G51370.2”. In this case, specify the key_fn argument to the constructor:

>>> fkey = Fasta('tests/data/key.fasta', key_fn=lambda key: key.split()[0])
>>> sorted(fkey.keys())
['a', 'b', 'c']

Numpy

The default is to use a memmaped numpy array as the backend. In which case it’s possible to get back an array directly…

>>> f['chr1'].as_string = False
>>> f['chr1'][:10] # doctest: +NORMALIZE_WHITESPACE
memmap(['A', 'C', 'T', 'G', 'A', 'C', 'T', 'G', 'A', 'C'], dtype='|S1')

>>> import numpy as np
>>> a = np.array(f['chr2'])
>>> a.shape[0] == len(f['chr2'])
True

>>> a[10:14] # doctest: +NORMALIZE_WHITESPACE
array(['A', 'A', 'A', 'A'], dtype='|S1')

mask a sub-sequence

>>> a[11:13] = np.array('N', dtype='S1')
>>> a[10:14].tostring()
'ANNA'

Backends (Record class)

It’s also possible to specify another record class as the underlying work-horse for slicing and reading. Currently, there’s just the default:

  • NpyFastaRecord which uses numpy memmap

  • FastaRecord, which uses using fseek/fread

  • MemoryRecord which reads everything into memory and must reparse the original fasta every time.

  • TCRecord which is identical to NpyFastaRecord except that it saves the index in a TokyoCabinet hash database, for cases when there are enough records that loading the entire index from a pickle into memory is unwise. (NOTE: that the sequence is not loaded into memory in either case).

It’s possible to specify the class used with the record_class kwarg to the Fasta constructor:

>>> from pyfasta import FastaRecord # default is NpyFastaRecord
>>> f = Fasta('tests/data/three_chrs.fasta', record_class=FastaRecord)
>>> f['chr1']
FastaRecord('tests/data/three_chrs.fasta.flat', 0..80)

other than the repr, it should behave exactly like the Npy record class backend

it’s possible to create your own using a sub-class of FastaRecord. see the source in pyfasta/records.py for details.

Flattening

In order to efficiently access the sequence content, pyfasta saves a separate, flattened file with all newlines and headers removed from the sequence. In the case of large fasta files, one may not wish to save 2 copies of a 5GG+ file. In that case, it’s possible to flatten the file “inplace”, keeping all the headers, and retaining the validity of the fasta file – with the only change being that the new-lines are removed from each sequence. This can be specified via flatten_inplace = True

>>> import os
>>> os.unlink('tests/data/three_chrs.fasta.gdx') # cleanup non-inplace idx
>>> f = Fasta('tests/data/three_chrs.fasta', flatten_inplace=True)
>>> f['chr1']  # note the difference in the output from above.
NpyFastaRecord(6..86)

# sequence from is same as when requested from non-flat file above.
>>> f['chr1'][1:9]
'CTGACTGA'

# the flattened file is kept as a place holder without the sequence data.
>>> open('tests/data/three_chrs.fasta.flat').read()
'@flattened@'

Command Line Interface

there’s also a command line interface to manipulate / view fasta files. the pyfasta executable is installed via setuptools, running it will show help text.

split a fasta file into 6 new files of relatively even size:

$ pyfasta split -n 6 original.fasta

split the fasta file into one new file per header with “%(seqid)s” being filled into each filename.:

$ pyfasta split –header “%(seqid)s.fasta” original.fasta

create 1 new fasta file with the sequence split into 10K-mers:

$ pyfasta split -n 1 -k 10000 original.fasta

2 new fasta files with the sequence split into 10K-mers with 2K overlap:

$ pyfasta split -n 2 -k 10000 -o 2000 original.fasta

show some info about the file (and show gc content):

$ pyfasta info –gc test/data/three_chrs.fasta

extract sequence from the file. use the header flag to make a new fasta file. the args are a list of sequences to extract.

$ pyfasta extract –header –fasta test/data/three_chrs.fasta seqa seqb seqc

extract sequence from a file using a file containing the headers not wanted in the new file:

$ pyfasta extract –header –fasta input.fasta –exclude –file seqids_to_exclude.txt

extract sequence from a fasta file with complex keys where we only want to lookup based on the part before the space.

$ pyfasta extract –header –fasta input.with.keys.fasta –space –file seqids.txt

flatten a file inplace, for faster later use by pyfasta, and without creating another copy. (Flattening)

$ pyfasta flatten input.fasta

cleanup

(though for real use these will remain for faster access)

>>> os.unlink('tests/data/three_chrs.fasta.gdx')
>>> os.unlink('tests/data/three_chrs.fasta.flat')

Testing

there is currently > 99% test coverage for the 2 modules and all included record classes. to run the tests:

$ python setup.py nosetests

Changes

0.5.2

fix complement (@mruffalo)

0.5.0

python 3 compatibility thanks to mruffalo

0.4.5

pyfasta split can handle > 52 files. (thanks Devtulya)

0.4.4

fix pyfasta extract

0.4.3

Add 0 or 1-based intervals in sequence() thanks @jamescasbon

0.4.2

update for latest numpy (can’t close memmap)

0.4.1

check for duplicate headers.

0.4.0

  • add key_fn kwarg to constuctor

0.3.9

  • only require ‘r’ (not r+) for memory map.

0.3.8

  • clean up logic for mixing inplace/non-inplace flattened files. if the inplace is available, it is always used.

0.3.6/7

  • dont re-flatten the file every time!

  • allow spaces before and after the header in the orginal fasta.

0.3.5

  • update docs in README.txt for new CLI stuff.

  • allow flattening inplace.

  • get rid of memmap (results in faster parsing).

0.3.4

  • restore python2.5 compatiblity.

  • CLI: add ability to exclude sequence from extract

  • CLI: allow spliting based on header.

0.3.3

  • include this file in the tar ball (thanks wen h.)

0.3.2

  • separate out backends into records.py

  • use nosetests (python setup.py nosetests)

  • add a TCRecord backend for next-gen sequencing availabe if tc is (easy-)installed.

  • improve test coverage.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyfasta-0.5.2.tar.gz (19.1 kB view hashes)

Uploaded source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page