gtf_to_genes 1.08
Fast GTF parser
***************************************
Overview
***************************************
We want an extremely fast, lightweight way to access gene data stored in GTF format.
The parsed data is held in an intuitive
Gene
-> transcript
-> transcript
with exons being stored as intervals
Our aim is to
* cache data in binary format, which can be
* re-read in < 10s for even the largest genomes
Currently initial parsing Ensembl Homo sapiens release 56 takes around 4.5 minutes.
The binary data can be reloaded in < 10s.
This contains *all* of the data structure in the original GTF file
Note that we sacrifice memory usage for speed. This is seldom a problem for modern computers
and genome sizes (There are around ~400,000 exons but there are stored as intervals / int pairs)
***************************************
A Simple example
***************************************
::
gene_structures = t_parse_gtf("Mus musculus")
#
# used cached data for speed
#
ignore_cache = False
#
# get all protein coding genes only
#
genes_by_type = gene_structures.get_genes(gtf_file, logger, ["protein_coding"], ignore_cache = ignore_cache)
#
# print out gene counts
#
t_parse_gtf.log_gene_types (logger, genes_by_type)
return genes_by_type
- Author: Leo Goodstadt
- Home Page: http://code.google.com/p/gtf-to-genes/
- Keywords: GTF Ensembl gene transcript parser GFF bioinformatics science
- License: MIT
-
Categories
- Development Status :: 5 - Production/Stable
- Environment :: Console
- Intended Audience :: Developers
- Intended Audience :: End Users/Desktop
- Intended Audience :: Information Technology
- Intended Audience :: Science/Research
- License :: OSI Approved :: MIT License
- Programming Language :: Python
- Topic :: Scientific/Engineering
- Topic :: Scientific/Engineering :: Bio-Informatics
- Package Index Owner: bunbun
- DOAP record: gtf_to_genes-1.08.xml
