Skip to main content

pyfaidx: efficient pythonic random access to fasta subsequences

Project description

Travis PyPI Code Health Coveralls

Description

Samtools provides a function “faidx” (FAsta InDeX), which creates a small flat index file “.fai” allowing for fast random access to any subsequence in the indexed FASTA file, while loading a minimal amount of the file in to memory. This python module implements pure Python classes for indexing, retrieval, and in-place modification of FASTA files.

A manuscript is currently under preparation.

Installation

This package is tested under Linux, MacOS, and Windows using Python 3.2-3.4, 2.7, 2.6, and pypy.

pip install pyfaidx

or

python setup.py install

Usage

>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta')
>>> genes
Fasta("tests/data/genes.fasta")  # set strict_bounds=True for bounds checking

Acts like a dictionary.

>>> genes.keys() ('AB821309.1', 'KF435150.1', 'KF435149.1', 'NR_104216.1', 'NR_104215.1', 'NR_104212.1', 'NM_001282545.1', 'NM_001282543.1', 'NM_000465.3', 'NM_001282549.1', 'NM_001282548.1', 'XM_005249645.1', 'XM_005249644.1', 'XM_005249643.1', 'XM_005249642.1', 'XM_005265508.1', 'XM_005265507.1', 'XR_241081.1', 'XR_241080.1', 'XR_241079.1')

>>> genes['NM_001282543.1'][200:230]
>NM_001282543.1:201-230
CTCGTTCCGCGCCCGCCATGGAACCGGATG

>>> genes['NM_001282543.1'][200:230].seq
'CTCGTTCCGCGCCCGCCATGGAACCGGATG'

>>> genes['NM_001282543.1'][200:230].name
'NM_001282543.1'

>>> genes['NM_001282543.1'][200:230].start
201

>>> genes['NM_001282543.1'][200:230].end
230

>>> genes['NM_001282543.1'][200:230].longname
'NM_001282543.1:201-230'

>>> len(genes['NM_001282543.1'])
5466

Indexes like a list:

>>> genes[0][:50]
>AB821309.1:1-50
ATGGTCAGCTGGGGTCGTTTCATCTGCCTGGTCGTGGTCACCATGGCAAC

Slices just like a string:

>>> genes['NM_001282543.1'][200:230][:10]
>NM_001282543.1:201-210
CTCGTTCCGC

>>> genes['NM_001282543.1'][200:230][::-1]
>NM_001282543.1:230-201
GTAGGCCAAGGTACCGCCCGCGCCTTGCTC

>>> genes['NM_001282543.1'][200:230][::3]
>NM_001282543.1:201-230
CGCCCCTACA

>>> genes['NM_001282543.1'][:]
>NM_001282543.1:1-5466
CCCCGCCCCT........
  • Start and end coordinates are 0-based, just like Python.

Sequence can be buffered in memory using a read-ahead buffer for fast sequential access:

>>> from timeit import timeit
>>> fetch = "genes['NM_001282543.1'][200:230]"
>>> read_ahead = "import pyfaidx; genes = pyfaidx.Fasta('tests/data/genes.fasta', read_ahead=10000)"
>>> no_read_ahead = "import pyfaidx; genes = pyfaidx.Fasta('tests/data/genes.fasta')"
>>> string_slicing = "genes = {}; genes['NM_001282543.1'] = 'N'*10000"

>>> timeit(fetch, no_read_ahead, number=10000)
0.2204863309962093
>>> timeit(fetch, read_ahead, number=10000)
0.1121859749982832
>>> timeit(fetch, string_slicing, number=10000)
0.0033553699977346696

Read-ahead buffering can reduce runtime by 1/2 for sequential accesses to buffered regions.

Complements and reverse complements just like DNA

>>> genes['NM_001282543.1'][200:230].complement
>NM_001282543.1 (complement):201-230
GAGCAAGGCGCGGGCGGTACCTTGGCCTAC

>>> genes['NM_001282543.1'][200:230].reverse
>NM_001282543.1:230-201
GTAGGCCAAGGTACCGCCCGCGCCTTGCTC

>>> -genes['NM_001282543.1'][200:230]
>NM_001282543.1 (complement):230-201
CATCCGGTTCCATGGCGGGCGCGGAACGAG

Custom key functions provide cleaner access:

>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta', key_function = lambda x: x.split('.')[0])
>>> genes.keys()
dict_keys(['NR_104212', 'NM_001282543', 'XM_005249644', 'XM_005249645', 'NR_104216', 'XM_005249643', 'NR_104215', 'KF435150', 'AB821309', 'NM_001282549', 'XR_241081', 'KF435149', 'XR_241079', 'NM_000465', 'XM_005265508', 'XR_241080', 'XM_005249642', 'NM_001282545', 'XM_005265507', 'NM_001282548'])
>>> genes['NR_104212'][:10]
>NR_104212:1-10
CCCCGCCCCT

Filter functions (returning True) limit the index:

# new in v0.3.8
>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta', filt_function = lambda x: x[0] == 'N')
>>> genes.keys()
dict_keys(['NR_104212', 'NM_001282543', 'NR_104216', 'NR_104215', 'NM_001282549', 'NM_000465', 'NM_001282545', 'NM_001282548'])
>>> genes['XM_005249644']
KeyError: XM_005249644 not in tests/data/genes.fasta.

Or just get a Python string:

>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta', as_raw=True)
>>> genes
Fasta("tests/data/genes.fasta", as_raw=True)

>>> genes['NM_001282543.1'][200:230]
CTCGTTCCGCGCCCGCCATGGAACCGGATG

You can also perform line-based iteration, receiving the sequence lines as they appear in the FASTA file:

>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta')
>>> for line in genes['NM_001282543.1']:
...   print(line)
CCCCGCCCCTCTGGCGGCCCGCCGTCCCAGACGCGGGAAGAGCTTGGCCGGTTTCGAGTCGCTGGCCTGC
AGCTTCCCTGTGGTTTCCCGAGGCTTCCTTGCTTCCCGCTCTGCGAGGAGCCTTTCATCCGAAGGCGGGA
CGATGCCGGATAATCGGCAGCCGAGGAACCGGCAGCCGAGGATCCGCTCCGGGAACGAGCCTCGTTCCGC
...

Sequence names are truncated on any whitespace. This is a limitation of the indexing strategy. However, full names can be recovered:

# new in v0.3.7
>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta')
>>> for record in genes:
...   print(record.name)
...   print(record.long_name)
...
gi|563317589|dbj|AB821309.1|
gi|563317589|dbj|AB821309.1| Homo sapiens FGFR2-AHCYL1 mRNA for FGFR2-AHCYL1 fusion kinase protein, complete cds
gi|557361099|gb|KF435150.1|
gi|557361099|gb|KF435150.1| Homo sapiens MDM4 protein variant Y (MDM4) mRNA, complete cds, alternatively spliced
gi|557361097|gb|KF435149.1|
gi|557361097|gb|KF435149.1| Homo sapiens MDM4 protein variant G (MDM4) mRNA, complete cds
...

If you want to modify the contents of your FASTA file in-place, you can use the mutable argument. Any portion of the FastaRecord can be replaced with an equivalent-length string. Warning: This will change the contents of your file immediately and permanently:

>>> genes = Fasta('tests/data/genes.fasta', mutable=True)
>>> type(genes['NM_001282543.1'])
<class 'pyfaidx.MutableFastaRecord'>

>>> genes['NM_001282543.1'][:10]
>NM_001282543.1:1-10
CCCCGCCCCT
>>> genes['NM_001282543.1'][:10] = 'NNNNNNNNNN'
>>> genes['NM_001282543.1'][:15]
>NM_001282543.1:1-15
NNNNNNNNNNCTGGC

It also provides a command-line script:

cli script: faidx

For usage type faidx -h.

$ faidx tests/data/genes.fasta NM_001282543.1:201-210 NM_001282543.1:300-320
>NM_001282543.1:201-210
CTCGTTCCGC
>NM_001282543.1:300-320
GTAATTGTGTAAGTGACTGCA

$ faidx --full-names tests/data/genes.fasta NM_001282543.1:201-210
>NM_001282543.1| Homo sapiens BRCA1 associated RING domain 1 (BARD1), transcript variant 2, mRNA
CTCGTTCCGC

$ faidx --no-names tests/data/genes.fasta NM_001282543.1:201-210 NM_001282543.1:300-320
CTCGTTCCGC
GTAATTGTGTAAGTGACTGCA

$ faidx --complement tests/data/genes.fasta NM_001282543.1:201-210
>NM_001282543.1:201-210 (complement)
GAGCAAGGCG

$ faidx --reverse tests/data/genes.fasta NM_001282543.1:201-210
>NM_001282543.1:210-201
CGCCTTGCTC

$ faidx --reverse --complement tests/data/genes.fasta NM_001282543.1:201-210
>NM_001282543.1:210-201 (complement)
GCGGAACGAG

$ faidx tests/data/genes.fasta NM_001282543.1
>NM_001282543.1:1-5466
CCCCGCCCCT........
..................
..................
..................

$ faidx --regex "^NM_00128254[35]" genes.fasta
>NM_001282543.1
..................
..................
..................
>NM_001282545.1
..................
..................
..................

$ faidx --lazy tests/data/genes.fasta NM_001282543.1:5460-5480
>NM_001282543.1:5460-5480
AAAAAAANNNNNNNNNNNNNN

$ faidx --lazy --default-seq='Q' tests/data/genes.fasta NM_001282543.1:5460-5480
>NM_001282543.1:5460-5480
AAAAAAAQQQQQQQQQQQQQQ

$ faidx tests/data/genes.fasta --bed regions.bed
...

$ faidx --stats tests/data/genes.fasta
AB821309.1  3510
KF435150.1  481
KF435149.1  642
NR_104216.1 4573
NR_104215.1 5317
NR_104212.1 5374
NM_001282545.1      4170
NM_001282543.1      5466
NM_000465.3 5523
NM_001282549.1      3984
NM_001282548.1      4113
XM_005249645.1      2752
XM_005249644.1      3004
XM_005249643.1      3109
XM_005249642.1      3097
XM_005265508.1      2794
XM_005265507.1      2848
XR_241081.1 1009
XR_241080.1 4884
XR_241079.1 2819

$ faidx --split-files tests/data/genes.fasta
$ ls
AB821309.1.fasta    NM_001282549.1.fasta    XM_005249645.1.fasta
KF435149.1.fasta    NR_104212.1.fasta       XM_005265507.1.fasta
KF435150.1.fasta    NR_104215.1.fasta       XM_005265508.1.fasta
NM_000465.3.fasta   NR_104216.1.fasta       XR_241079.1.fasta
NM_001282543.1.fasta        XM_005249642.1.fasta    XR_241080.1.fasta
NM_001282545.1.fasta        XM_005249643.1.fasta    XR_241081.1.fasta
NM_001282548.1.fasta        XM_005249644.1.fasta

$ faidx --delimiter='_' tests/data/genes.fasta 000465.3
>000465.3
CCCCGCCCCTCTGGCGGCCCGCCGTCCCAGACGCGGGAAGAGCTTGGCCGGTTTCGAGTCGCTGGCCTGC
AGCTTCCCTGTGGTTTCCCGAGGCTTCCTTGCTTCCCGCTCTGCGAGGAGCCTTTCATCCGAAGGCGGGA
.......

Similar syntax as samtools faidx

A lower-level Faidx class is also available:

>>> from pyfaidx import Faidx
>>> fa = Faidx('genes.fa')  # can return str with as_raw=True
>>> fa.index
OrderedDict([('AB821309.1', IndexRecord(rlen=3510, offset=12, lenc=70, lenb=71)), ('KF435150.1', IndexRecord(rlen=481, offset=3585, lenc=70, lenb=71)),... ])

>>> fa.index['AB821309.1'].rlen
3510

fa.fetch('AB821309.1', 1, 10)  # these are 1-based genomic coordinates
>AB821309.1:1-10
ATGGTCAGCT
  • If the FASTA file is not indexed, when Faidx is initialized the build_index method will automatically run, and the index will be written to “filename.fa.fai” with write_fai(). where “filename.fa” is the original FASTA file.

  • Start and end coordinates are 1-based.

Changelog

Please see the releases for a comprehensive list of version changes.

Acknowledgements

This project is freely licensed by the author, Matthew Shirley, and was completed under the mentorship and financial support of Drs. Sarah Wheelan and Vasan Yegnasubramanian at the Sidney Kimmel Comprehensive Cancer Center in the Department of Oncology.

Project details


Release history Release notifications | RSS feed

This version

0.3.9

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyfaidx-0.3.9.tar.gz (15.8 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page