pyfasta

fast, memory-efficient, pythonic access to fasta sequence files

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Author:: Brent Pedersen (brentp)
Email:: bpederse@gmail.com
License:: MIT

Implementation

Requires Python >= 2.5. Stores a flattened version of the fasta file without spaces or headers and uses either a mmap of numpy binary format or fseek/fread so the sequence data is never read into memory. Saves a pickle (.gdx) of the start, stop (for fseek/mmap) locations of each header in the fasta file for internal use.

Usage

>>> from pyfasta import Fasta

>>> f = Fasta('tests/data/three_chrs.fasta')
>>> sorted(f.keys())
['chr1', 'chr2', 'chr3']

>>> f['chr1']
NpyFastaRecord(0..80)

Slicing

>>> f['chr1'][:10]
'ACTGACTGAC'

# get the 1st basepair in every codon (it's python yo)
>>> f['chr1'][::3]
'AGTCAGTCAGTCAGTCAGTCAGTCAGT'


# the index stores the start and stop of each header from the flattened
# fasta file. (you should never need this)
>>> f.index
{'chr3': (160, 3760), 'chr2': (80, 160), 'chr1': (0, 80)}


# can query by a 'feature' dictionary
>>> f.sequence({'chr': 'chr1', 'start': 2, 'stop': 9})
'CTGACTGA'

# same as:
>>> f['chr1'][1:9]
'CTGACTGA'

# with reverse complement for - strand
>>> f.sequence({'chr': 'chr1', 'start': 2, 'stop': 9, 'strand': '-'})
'TCAGTCAG'

Numpy

The default is to use a memmaped numpy array as the backend. In which case it’s possible to get back an array directly…

>>> f['chr1'].tostring = False
>>> f['chr1'][:10] # doctest: +NORMALIZE_WHITESPACE
memmap(['A', 'C', 'T', 'G', 'A', 'C', 'T', 'G', 'A', 'C'], dtype='|S1')

>>> import numpy as np
>>> a = np.array(f['chr2'])
>>> a.shape[0] == len(f['chr2'])
True

>>> a[10:14]
array(['A', 'A', 'A', 'A'],
      dtype='|S1')

mask a sub-sequence:

>>> a[11:13] = np.array('N', dtype='c')
>>> a[10:14].tostring()
'ANNA'

Backends (Record class)

It’s also possible to specify another record class as the underlying work-horse for slicing and reading. Currently, there’s just the default: NpyFastaRecord which uses numpy memmap FastaRecord, which uses using fseek/fread. It’s possible to create your own using a sub-class of FastaRecord. see the source for details. Next addition will be a pytables/hdf5 backend.

>>> from pyfasta import FastaRecord # default is NpyFastaRecord
>>> f = Fasta('tests/data/three_chrs.fasta', record_class=FastaRecord)
>>> f['chr1']
FastaRecord('tests/data/three_chrs.fasta.flat', 0..80)

other than the repr, it should behave exactly like the Npy record class backend

cleanup (though for real use these will remain for faster access)

>>> import os
>>> os.unlink('tests/data/three_chrs.fasta.gdx')
>>> os.unlink('tests/data/three_chrs.fasta.npy')
>>> os.unlink('tests/data/three_chrs.fasta.flat')

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.5.2

Apr 3, 2014

0.5.1

Oct 3, 2013

0.5.0

Aug 29, 2013

0.4.5

Feb 21, 2012

0.4.4

Oct 12, 2011

0.4.3

May 31, 2011

0.4.2

Apr 5, 2011

0.4.1

Dec 1, 2010

0.4.0

Oct 25, 2010

0.3.9

Mar 17, 2010

0.3.7

Dec 21, 2009

0.3.6

Dec 21, 2009

0.3.5

Dec 20, 2009

0.3.4

Dec 15, 2009

0.3.3

Dec 6, 2009

0.3.2

Dec 3, 2009

0.3.1

Nov 17, 2009

0.3.0

Nov 17, 2009

This version

0.2.9

Nov 10, 2009

0.2.8

Nov 6, 2009

0.2.5

Sep 23, 2009

0.2.4

Sep 9, 2009

0.2.3

Sep 8, 2009

0.2.2

Sep 8, 2009

0.2.1

Jul 13, 2009

0.2

Jul 13, 2009

0.1

May 27, 2009

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyfasta-0.2.9.tar.gz (6.4 kB view hashes)

Uploaded Nov 10, 2009 Source

Hashes for pyfasta-0.2.9.tar.gz

Hashes for pyfasta-0.2.9.tar.gz
Algorithm	Hash digest
SHA256	`ccfa05be44ac9649f732de9efe3d3fdeda67804f424f51941d9758282512e506`
MD5	`294a1b1c77d48ade89be35507d6a8837`
BLAKE2b-256	`ef43f6c4e3bde5be1c0bfed3bfcea7381b0d4591ec4f39fd45b67ed702f11af1`