Skip to main content

GO enrichment analysis from a list of gene names using a precomputed database

Project description

GO Enrichment package

This package execute GO enrichment analysis froma list of gene names using a precomputed database. The GO terms are analyze using a hypergeometric test.

GO enrichment database

The GO graph structure is created from the Gene Ontology OBO file http://current.geneontology.org/ontology/go.obo

NCBI gene

The NCBI gene database is used to include genes to the GO terms graph. The required files are: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz and ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz

These files can be filter for a specific taxonomy id. This example is for human: 9606

gunzip -c gene_info.gz | grep -P "^9606\t" > gene_info_${taxid}
gzip gene_info_${taxid}
gunzip -c gene2go.gz | grep -P "^9606\t" > gene2go_${taxid}
gzip gene2go_${taxid}

Uniprot GOA

The Uniprot GOA files can be also used to add more genes to the GO graph. The complete file is: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/goa_uniprot_all.gaf.gz

Uniprot GOA also include some pre-filtered organism: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/

TSV file: gene<tab>GO term

Any TSV file with the relationship between gene names and GO term can also be included into the database. The file just need to include in the first column the gene name and in the second column the GO term. Any other extra column will be ignored.

Ensembl BioMart

The Ensembl data can be alos include using their BioMart tool. Go to the Ensembl Biomart website: http://useast.ensembl.org/biomart/. Using this tool a TSV file can be generated with gene names in the first column and GO term in the second column.

Database creation

This example is for human. Please, note all input files should be gzipped.

goenrich_createdb --gene_info gene_info.gz --gene2go gene2go.gz --goa_uniprot goa_uniprot_all.gaf.gz --gobo go.obo --taxid 9606 --goenrichDB goenrichDB_20190419.pickle

Usage

usage: goenrich_createdb [-h] [--gene_info GENE_INFO] [--gene2go GENE2GO]
                     [--tsv TSV] [--goenrichDB GOENRICHDB]
                     [--goa_uniprot GOA_UNIPROT] [--gobo GOBO] [--taxid TAXID]
                     -o O

Creates pickle data structure used by "goenrich.py"

optional arguments:
    -h, --help            show this help message and exit
    --gene_info GENE_INFO
                        NCBI gene_info file
    --gene2go GENE2GO     NCBI gene2go file
    --tsv TSV             TSV file with at least two columns: Gene_name<tab>GO
                        terms
    --goenrichDB GOENRICHDB
                        Previous created goenrich pickle file. The new genes
                        will be added to this database
    --goa_uniprot GOA_UNIPROT
                        Uniprot GOA file GAF format
    --gobo GOBO           UGO Obo file from Gene Ontology
    --taxid TAXID         Process genes for tax id if it is possible
    -o O                  Pickle output file name

Pre-computed databases

We offer some pre-computed database https://ftp.ncbi.nlm.nih.gov/pub/goenrichment/

Go enrichment analysis

The analysis is executed using the script goenrich.py. The input file is a text file with one gene name per line.

goenrich --goenrichDB gene2GO_human.pickle -i query.tsv -o goenrich.tsv

The gene2GO_human.pickle can be downloaded from https://ftp.ncbi.nlm.nih.gov/pub/goenrichment/goenrichDB_human.pickle

usage: goenrich [-h] -i I -o O [--goenrichDB GOENRICHDB]
                   [--min_category_depth MIN_CATEGORY_DEPTH]
                   [--min_category_size MIN_CATEGORY_SIZE]
                   [--max_category_size MAX_CATEGORY_SIZE] [--alpha ALPHA]

Calculate GO enrichment from a list of genes. Default database organism: human

optional arguments:
    -h, --help            show this help message and exit
    -i I                  Input list of gene names
    -o O                  TSV file with all results
    --goenrichDB GOENRICHDB
                        Gene2GO pickle file created with "goenrichDB.py". If
                        not provided the database is loaded from:
    --min_category_depth MIN_CATEGORY_DEPTH
                        Min GO term graph depth to include in the report.
                        Default: 4
    --min_category_size MIN_CATEGORY_SIZE
                        Min number of gene in a GO term to include in the
                        report. Default: 3
    --max_category_size MAX_CATEGORY_SIZE
                        Max number of gene in a GO term to include in the
                        report. Default: 500
    --alpha ALPHA         Alpha value for p-value correction. Default: 0.05

Requirements

  • Python 3.8
    • numpy
    • scipy
    • statsmodels
    • pandas
    • networkx

Public Domain notice

National Center for Biotechnology Information.

This software is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part of the authors' official duties as United States Government employees and thus cannot be copyrighted. This software is freely available to the public for use. The National Library of Medicine and the U.S. Government have not placed any restriction on its use or reproduction.

Although all reasonable efforts have been taken to ensure the accuracy and reliability of the software and data, the NLM and the U.S. Government do not and cannot warrant the performance or results that may be obtained by using this software or data. The NLM and the U.S. Government disclaim all warranties, express or implied, including warranties of performance, merchantability or fitness for any particular purpose.

Please cite NCBI in any work or product based on this material.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

goenrichment-1.0.3.tar.gz (10.6 kB view hashes)

Uploaded Source

Built Distribution

goenrichment-1.0.3-py3-none-any.whl (13.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page