skip to navigation
skip to content

Not Logged In

gtf_to_genes 1.09

Fast GTF parser

    We want an extremely fast, lightweight way to access gene data stored in GTF format.

    The parsed data is held in an intuitive
             -> transcript
             -> transcript
        with exons being stored as intervals

    Our aim is to
       * cache data in binary format, which can be
       * re-read in < 10s for even the largest genomes

    Currently initial parsing Ensembl Homo sapiens release 56 takes around 4.5 minutes.
    The binary data can be reloaded in < 10s.
    This contains *all* of the data structure in the original GTF file

    Note that we sacrifice memory usage for speed. This is seldom a problem for modern computers
    and genome sizes (There are around ~400,000 exons but there are stored as intervals / int pairs)

A Simple example
        gene_structures = t_parse_gtf("Mus musculus")

        #   used cached data for speed
        ignore_cache = False

        #   get all protein coding genes only
        genes_by_type = gene_structures.get_genes(gtf_file, logger, ["protein_coding"], ignore_cache = ignore_cache)

        #   print out gene counts
        t_parse_gtf.log_gene_types (logger, genes_by_type)

        return genes_by_type
File Type Py Version Uploaded on Size
gtf_to_genes-1.09.tar.gz (md5) Source 2012-07-12 23KB
  • Downloads (All Versions):
  • 31 downloads in the last day
  • 93 downloads in the last week
  • 740 downloads in the last month