GenBankQC

"Automated quality control for GenBank genomes."

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- POSIX :: Linux
Programming Language
- Python :: 3.6
Topic
- Scientific/Engineering :: Bio-Informatics

Project description

https://api.travis-ci.org/andrewsanchez/GenBankQC.svg?branch=master

GenBank Quality Control

Complete documentation lives at genbank-qc.readthedocs.io. It is a work in progress.

GenBankQC is an effort to address the quality control problem for public databases such as the National Center for Biotechnology Information’s GenBank. The goal is to offer a simple, efficient, and automated solution for assessing the quality of your genomes.

Note

Please note that GenbankQC is currently in alpha. As a proof of concept for a specific use case, it currently has limitations that users should be aware of. If there is interest, we will address the issues to make it more convenient to use. Please see caveats for more details.

Features

Labelling/annotation-independent quality control based on:
- Simple metrics
- Genome distance estimation using MASH
Flag potential outliers to exclude them from polluting your pipelines

The genbankqc work-flow consists of the following steps:

Generate statistics for each genome based on the following metrics:
- Number of unknown bases
- Number of contigs
- Assembly size
- Average MASH distance compared to other genomes
Flag potential outliers based on these statistics:
- Flag genomes containing more than a certain number of unknown bases.
- Flag genomes outside of a range based on the median absolute deviation.
  - Applies to number of contigs and assembly size
- Flag genomes whose MASH distance is greater than the upper end of the median absolute deviation.
Visualize the results with a color coded tree

Usage

genbankqc /path/to/genomes
open /path/to/genomes/Escherichia_coli/qc/200_3.0_3.0_3.0/tree.svg

Installation

If you don’t yet have a functional conda environment, please download and install Miniconda.

conda create -n genbankqc -c etetoolkit -c biocore pip ete3 scikit-bio

source activate genbankqc

pip install genbankqc

Caveats

There are some arbitrary, hard-coded limitations regarding file names. This is because the project originally began as a part of the NCBI Tool Kit (NCBITK) which we use for downloading genomes from NCBI. NCBITK generates a specific directory structure and file naming scheme which GenbankQC currently expects.

If you’d like to use GenBankQC without using NCBITK, all that is required is that your file names match the python regular expression re.compile('.*(GCA_\d+\.\d.*)(.fasta)'). You can quickly test this by following my example at pythex.org.

https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- POSIX :: Linux
Programming Language
- Python :: 3.6
Topic
- Scientific/Engineering :: Bio-Informatics

Release history Release notifications | RSS feed

This version

0.2a0 pre-release

Dec 10, 2017

0.1a0 pre-release

Dec 9, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

GenBankQC-0.2a0.tar.gz (11.8 kB view hashes)

Uploaded Dec 10, 2017 Source

Hashes for GenBankQC-0.2a0.tar.gz

Hashes for GenBankQC-0.2a0.tar.gz
Algorithm	Hash digest
SHA256	`f1aad07badc81af3a0b65c14d892fa8716ee8696263e6f99a5f795dd46038180`
MD5	`142daca61f1e4119362fa97ca6fa5f44`
BLAKE2b-256	`55b8f48c8cb01371078f0d8099eadc90fa27d3d65696c0edb7e87860c9e514b6`