Prokaryotic genome assembly and annotation pipeline

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: BSD License
Operating System
- POSIX :: Linux
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Bio-Informatics

Project description

ZGA - prokaryotic genome assembly and annotation pipeline

Installation

Installing dependencies

ZGA is written in Python and tested with Python 3.6 and Python 3.7. ZGA uses several software and libs including:

fastqc
ea-utils
BBmap
SPAdes (>= 3.12 to support merged paired-end reads, >= 3.5.0 to support Nanopore reads)
unicycler
CheckM
DFast
BioPython
NCBI BLASTn
NxTrim
mash

All of them may be installed using conda:

It's highly recommended to create a new conda environment:

conda create -n zga "python>=3.6" fastqc ea-utils "spades>=3.12" unicycler checkm-genome dfast bbmap blast biopython nxtrim mash

and activate it

conda activate zga

Otherwise you may install dependencies to existing conda environment:

conda install python>=3.6 fastqc ea-utils spades unicycler checkm-genome dfast bbmap blast biopython nxtrim mash

Of course, it's possible to use another ways even compile all tools from source code. In this case you should check if binaries are in your '$PATH' variable.

Install from PyPi

Run pip install zga it will check if You have Biopython and istall it if not. But all other dependencies You should install manually or using conda. CheckM is available on PyPi, but it's easier to install it using conda.

Get source from Github

You can get ZGA by cloning from the repository with git clone https://github.com/laxeye/zga.git or by downloading an archive. After downloading enter the directory and run python3 setup.py build && python3 setup.py install.

Operating systems requirements

ZGA was tested on Ubuntu 18.04 and 19.10. Most probably any modern 64-bit Linux distribuition is enough.

Your feedback on other OS is welcome!

Usage

Run zga -h to get a help message.

Examples:

Perform all steps: read qc, read trimming and merging, assembly, CheckM assesment with default (bacterial) marker set, DFAST annotation and use 4 CPU threads where possible:

zga -1 R1.fastq.gz -2 R2.fastq.gz --threads 4 -o my_assembly

or use SPAdes and provide it with paired-end and nanopore reads of archaeal genome (CheckM will use archaeal markers)

zga -1 R1.fastq.gz -2 R2.fastq.gz --nanopore MiniION.fastq.gz -a spades --threads 4 --domain archaea -o my_assembly

or from Nanopore reads using only unicycler

zga --first-step assembling --nanopore MiniION.fastq.gz -o nanopore_assembly

Perform genome assesment and annotation:

With 'Pectobacterium' CheckM marker set:

zga --first-step check_genome -g pectobacterium_sp.fasta --checkm_rank genus --checkm_taxon Pectobacterium -o my_output_dir

Let CheckM to infer the right marker set:

zga --first-step check_genome -g my_genome.fa --checkm_mode lineage -o my_output_dir

Know issues and limitations

ZGA is in the stage of active development.

I hope to fix next issues ASAP:

It's not posible to provide multiple read libraries i.e. two sets of PE reads or two nanopore runs.
There is no conda package

Known issues and limitations:

Unicycler doesn't use mate-pair reads.
It's not possible to install all dependencies with Python 3.8 via conda, please use 3.7 or 3.6.

Don't hesitate to report bugs or features!

Cite

It's a great pleasure to know, that your software is useful. Please cite ZGA:

Korzhenkov A. (2020). ZGA: prokaryotic genome assembly and annotation pipeline.

And of course tools it's using:

Andrews, S. (2010). FastQC: a quality control tool for high throughput sequence data.

Aronesty, E. (2013). Comparison of sequencing utility programs. The open bioinformatics journal, 7(1).

Bushnell, B., Rood, J., & Singer, E. (2017). BBMerge–accurate paired shotgun read merging via overlap. PloS one, 12(10).

Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A., Dvorkin, M., Kulikov, A. S., ... & Pyshkin, A. V. (2012). SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology, 19(5), 455-477.

Wick, R. R., Judd, L. M., Gorrie, C. L., & Holt, K. E. (2017). Unicycler: resolving bacterial genome assemblies from short and long sequencing reads. PLoS computational biology, 13(6), e1005595.

Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P., & Tyson, G. W. (2015). CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome research, 25(7), 1043-1055.

Tanizawa, Y., Fujisawa, T., & Nakamura, Y. (2018). DFAST: a flexible prokaryotic genome annotation pipeline for faster genome publication. Bioinformatics, 34(6), 1037-1039.

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of molecular biology, 215(3), 403-410.

Cock, P. J., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A., ... & De Hoon, M. J. (2009). Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11), 1422-1423.

O’Connell, J., et al. (2015) NxTrim: optimized trimming of Illumina mate pair reads. Bioinformatics 31(12), 2035-2037.

Ondov, B.D., Treangen, T.J., Melsted, P. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 17, 132 (2016). doi: 10.1186/s13059-016-0997-x

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: BSD License
Operating System
- POSIX :: Linux
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Bio-Informatics

Release history Release notifications | RSS feed

0.1.0

Feb 13, 2023

0.1b3 pre-release

Dec 2, 2022

0.1b2 pre-release

Dec 2, 2022

0.1a2 pre-release

Jun 2, 2022

0.1a1 pre-release

Jun 1, 2022

0.0.9.post2

Apr 1, 2021

0.0.9

Feb 20, 2021

0.0.9a0 pre-release

Jan 31, 2021

0.0.8

Oct 26, 2020

0.0.8b2 pre-release

Oct 24, 2020

0.0.8b1 pre-release

Oct 9, 2020

0.0.7

Aug 8, 2020

0.0.7b2 pre-release

Aug 4, 2020

0.0.7b1 pre-release

Aug 4, 2020

0.0.6

Aug 3, 2020

This version

0.0.6b3 pre-release

Aug 1, 2020

0.0.5.post1

Jul 16, 2020

0.0.5

Jul 12, 2020

0.0.4

Jul 2, 2020

0.0.3

May 11, 2020

0.0.2

May 6, 2020

0.0.1

May 1, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zga-0.0.6b3.tar.gz (12.4 kB view hashes)

Uploaded Aug 1, 2020 Source

Built Distributions

zga-0.0.6b3-py3.6.egg (14.9 kB view hashes)

Uploaded Aug 1, 2020 Source

zga-0.0.6b3-py3-none-any.whl (16.0 kB view hashes)

Uploaded Aug 1, 2020 Python 3

Hashes for zga-0.0.6b3.tar.gz

Hashes for zga-0.0.6b3.tar.gz
Algorithm	Hash digest
SHA256	`f9e9868352811dc0a93b865b4f12fbe64f87a0a95362f4d0ba318aa27c4893f4`
MD5	`3796c3a79b9d4ef3280ae17f0b5b16d9`
BLAKE2b-256	`175834e3b9baea6c03682c27813bf309631187bbdcd4955c173d224e09ccb76f`

Hashes for zga-0.0.6b3-py3.6.egg

Hashes for zga-0.0.6b3-py3.6.egg
Algorithm	Hash digest
SHA256	`7bbf2509d9a39c861060d73f5f61df787ffe22a3c821b0137d973e08854a9303`
MD5	`b137efdd65814d6d5d9dddb4a98a7e93`
BLAKE2b-256	`e59f6cca30a59939d1eb875f7cc2e895a80711ac3d8380e3465ab62b03b4a4b0`

Hashes for zga-0.0.6b3-py3-none-any.whl

Hashes for zga-0.0.6b3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`297b5952ab06cf28dc4c233880b084088e3b8b0cae4e2bf0c481477535148743`
MD5	`7a909fed24bc25f8c3db6b33957eea02`
BLAKE2b-256	`96dc25033803508bad6599ed50bf5becaf1967b1995b2c4d55fd5ebe309a34d1`