Skip to main content

"Automated quality control for GenBank genomes."

Project description

https://api.travis-ci.org/andrewsanchez/GenBankQC.svg?branch=master

GenBank Quality Control

Complete documentation lives at genbank-qc.readthedocs.io. It is a work in progress.

GenBankQC is an effort to address the quality control problem for public databases such as the National Center for Biotechnology Information’s GenBank. The goal is to offer a simple, efficient, and automated solution for assessing the quality of your genomes.

Note

Please note that GenbankQC is currently in alpha. As a proof of concept for a specific use case, it currently has limitations that users should be aware of. If there is interest, we will address the issues to make it more convenient to use. Please see caveats for more details.

Features

  • Labelling/annotation-independent quality control based on:

    • Simple metrics

    • Genome distance estimation using MASH

  • Flag potential outliers to exclude them from polluting your pipelines

The genbankqc work-flow consists of the following steps:

  1. Generate statistics for each genome based on the following metrics:

    • Number of unknown bases

    • Number of contigs

    • Assembly size

    • Average MASH distance compared to other genomes

  2. Flag potential outliers based on these statistics:

    • Flag genomes containing more than a certain number of unknown bases.

    • Flag genomes outside of a range based on the median absolute deviation.

      • Applies to number of contigs and assembly size

    • Flag genomes whose MASH distance is greater than the upper end of the median absolute deviation.

  3. Visualize the results with a color coded tree

Usage

genbankqc /path/to/genomes
open /path/to/genomes/Escherichia_coli/qc/200_3.0_3.0_3.0/tree.svg

Installation

If you don’t yet have a functional conda environment, please download and install Miniconda.

conda create -n genbankqc -c etetoolkit -c biocore pip ete3 scikit-bio

source activate genbankqc

pip install genbankqc

Caveats

There are some arbitrary, hard-coded limitations regarding file names. This is because the project originally began as a part of the NCBI Tool Kit (NCBITK) which we use for downloading genomes from NCBI. NCBITK generates a specific directory structure and file naming scheme which GenbankQC currently expects.

If you’d like to use GenBankQC without using NCBITK, all that is required is that your file names match the python regular expression re.compile('.*(GCA_\d+\.\d.*)(.fasta)'). You can quickly test this by following my example at pythex.org.

https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

GenBankQC-0.2a0.tar.gz (11.8 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page