skip to navigation
skip to content

Not Logged In

pyfasta 0.5.0

fast, memory-efficient, pythonic (and command-line) access to fasta sequence files

Latest Version: 0.5.2

Implementation

Requires Python >= 2.6. Stores a flattened version of the fasta file without spaces or headers and uses either a mmap of numpy binary format or fseek/fread so the sequence data is never read into memory. Saves a pickle (.gdx) of the start, stop (for fseek/mmap) locations of each header in the fasta file for internal use.

Usage

>>> from pyfasta import Fasta

>>> f = Fasta('tests/data/three_chrs.fasta')
>>> sorted(f.keys())
['chr1', 'chr2', 'chr3']

>>> f['chr1']
NpyFastaRecord(0..80)

Slicing

# get full the sequence:
>>> a = str(f['chr1'])
>>> b = f['chr1'][:]
>>> a == b
True

>>> f['chr1'][:10]
'ACTGACTGAC'

# get the 1st basepair in every codon (it's python yo)
>>> f['chr1'][::3]
'AGTCAGTCAGTCAGTCAGTCAGTCAGT'

# can query by a 'feature' dictionary (note this is one based coordinates)
>>> f.sequence({'chr': 'chr1', 'start': 2, 'stop': 9})
'CTGACTGA'

# same as:
>>> f['chr1'][1:9]
'CTGACTGA'

# use python, zero based coords
>>> f.sequence({'chr': 'chr1', 'start': 2, 'stop': 9}, one_based=False)
'TGACTGA'

# with reverse complement (automatic for - strand)
>>> f.sequence({'chr': 'chr1', 'start': 2, 'stop': 9, 'strand': '-'})
'TCAGTCAG'

Key Function

Sometimes your fasta will have a long header like: "AT1G51370.2 | Symbols: | F-box family protein | chr1:19045615-19046748 FORWARD" when you only want to key off: "AT1G51370.2". In this case, specify the key_fn argument to the constructor:

>>> fkey = Fasta('tests/data/key.fasta', key_fn=lambda key: key.split()[0])
>>> sorted(fkey.keys())
['a', 'b', 'c']

Numpy

The default is to use a memmaped numpy array as the backend. In which case it's possible to get back an array directly...

>>> f['chr1'].as_string = False
>>> f['chr1'][:10] # doctest: +NORMALIZE_WHITESPACE
memmap(['A', 'C', 'T', 'G', 'A', 'C', 'T', 'G', 'A', 'C'], dtype='|S1')

>>> import numpy as np
>>> a = np.array(f['chr2'])
>>> a.shape[0] == len(f['chr2'])
True

>>> a[10:14] # doctest: +NORMALIZE_WHITESPACE
array(['A', 'A', 'A', 'A'], dtype='|S1')

mask a sub-sequence

>>> a[11:13] = np.array('N', dtype='S1')
>>> a[10:14].tostring()
'ANNA'

Backends (Record class)

It's also possible to specify another record class as the underlying work-horse for slicing and reading. Currently, there's just the default:

  • NpyFastaRecord which uses numpy memmap
  • FastaRecord, which uses using fseek/fread
  • MemoryRecord which reads everything into memory and must reparse the original fasta every time.
  • TCRecord which is identical to NpyFastaRecord except that it saves the index in a TokyoCabinet hash database, for cases when there are enough records that loading the entire index from a pickle into memory is unwise. (NOTE: that the sequence is not loaded into memory in either case).

It's possible to specify the class used with the record_class kwarg to the Fasta constructor:

>>> from pyfasta import FastaRecord # default is NpyFastaRecord
>>> f = Fasta('tests/data/three_chrs.fasta', record_class=FastaRecord)
>>> f['chr1']
FastaRecord('tests/data/three_chrs.fasta.flat', 0..80)

other than the repr, it should behave exactly like the Npy record class backend

it's possible to create your own using a sub-class of FastaRecord. see the source in pyfasta/records.py for details.

Flattening

In order to efficiently access the sequence content, pyfasta saves a separate, flattened file with all newlines and headers removed from the sequence. In the case of large fasta files, one may not wish to save 2 copies of a 5GG+ file. In that case, it's possible to flatten the file "inplace", keeping all the headers, and retaining the validity of the fasta file -- with the only change being that the new-lines are removed from each sequence. This can be specified via flatten_inplace = True

>>> import os
>>> os.unlink('tests/data/three_chrs.fasta.gdx') # cleanup non-inplace idx
>>> f = Fasta('tests/data/three_chrs.fasta', flatten_inplace=True)
>>> f['chr1']  # note the difference in the output from above.
NpyFastaRecord(6..86)

# sequence from is same as when requested from non-flat file above.
>>> f['chr1'][1:9]
'CTGACTGA'

# the flattened file is kept as a place holder without the sequence data.
>>> open('tests/data/three_chrs.fasta.flat').read()
'@flattened@'

Command Line Interface

there's also a command line interface to manipulate / view fasta files. the pyfasta executable is installed via setuptools, running it will show help text.

split a fasta file into 6 new files of relatively even size:

$ pyfasta split -n 6 original.fasta

split the fasta file into one new file per header with "%(seqid)s" being filled into each filename.:

$ pyfasta split --header "%(seqid)s.fasta" original.fasta

create 1 new fasta file with the sequence split into 10K-mers:

$ pyfasta split -n 1 -k 10000 original.fasta

2 new fasta files with the sequence split into 10K-mers with 2K overlap:

$ pyfasta split -n 2 -k 10000 -o 2000 original.fasta

show some info about the file (and show gc content):

$ pyfasta info --gc test/data/three_chrs.fasta

extract sequence from the file. use the header flag to make a new fasta file. the args are a list of sequences to extract.

$ pyfasta extract --header --fasta test/data/three_chrs.fasta seqa seqb seqc

extract sequence from a file using a file containing the headers not wanted in the new file:

$ pyfasta extract --header --fasta input.fasta --exclude --file seqids_to_exclude.txt

extract sequence from a fasta file with complex keys where we only want to lookup based on the part before the space.

$ pyfasta extract --header --fasta input.with.keys.fasta --space --file seqids.txt

flatten a file inplace, for faster later use by pyfasta, and without creating another copy. (Flattening)

$ pyfasta flatten input.fasta

cleanup

(though for real use these will remain for faster access)

>>> os.unlink('tests/data/three_chrs.fasta.gdx')
>>> os.unlink('tests/data/three_chrs.fasta.flat')

Testing

there is currently > 99% test coverage for the 2 modules and all included record classes. to run the tests:

$ python setup.py nosetests

Changes

0.5.0

python 3 compatibility thanks to mruffalo

0.4.5

pyfasta split can handle > 52 files. (thanks Devtulya)

0.4.4

fix pyfasta extract

0.4.3

Add 0 or 1-based intervals in sequence() thanks @jamescasbon

0.4.2

update for latest numpy (can't close memmap)

0.4.1

check for duplicate headers.

0.4.0

  • add key_fn kwarg to constuctor

0.3.9

  • only require 'r' (not r+) for memory map.

0.3.8

  • clean up logic for mixing inplace/non-inplace flattened files. if the inplace is available, it is always used.

0.3.6/7

  • dont re-flatten the file every time!
  • allow spaces before and after the header in the orginal fasta.

0.3.5

  • update docs in README.txt for new CLI stuff.
  • allow flattening inplace.
  • get rid of memmap (results in faster parsing).

0.3.4

  • restore python2.5 compatiblity.
  • CLI: add ability to exclude sequence from extract
  • CLI: allow spliting based on header.

0.3.3

  • include this file in the tar ball (thanks wen h.)

0.3.2

  • separate out backends into records.py
  • use nosetests (python setup.py nosetests)
  • add a TCRecord backend for next-gen sequencing availabe if tc is (easy-)installed.
  • improve test coverage.
 
File Type Py Version Uploaded on Size
pyfasta-0.5.0.tar.gz (md5) Source 2013-08-29 18KB
  • Downloads (All Versions):
  • 96 downloads in the last day
  • 422 downloads in the last week
  • 2817 downloads in the last month