Skip to main content

Prokaryotic genome assembly and annotation pipeline

Project description

ZGA - prokaryotic genome assembly and annotation pipeline

version status Anaconda Cloud

Installation

ZGA is written in Python and tested with Python 3.6 and Python 3.7. ZGA uses several software and libraries including:

Install with conda

The simplest way to install ZGA and all dependencies is conda:

  1. You need to install conda, e.g. miniconda. Python 3.7 is preferred.

  2. After installation You should add channels - the conda's software sources:
    conda config --add channels defaults
    conda config --add channels bioconda
    conda config --add channels conda-forge

  3. At the end You should install ZGA to an existing active environment (Python 3.6 or 3.7): conda install -c laxeye zga
    or create a fresh environment and activate it:
    conda create -n zga -c laxeye zga
    conda activate zga

(https://anaconda.org/laxeye/zga/badges/latest_release_date.svg)

Installing dependencies

All dependencies may be installed using conda:

It's highly recommended to create a new conda environment:

conda create -n zga "python>=3.6" fastp "spades>=3.12" unicycler checkm-genome dfast bbmap blast biopython nxtrim "mash>=2" flye racon "samtools>=1.9"

and activate it

conda activate zga

Otherwise you may install dependencies to existing conda environment:

conda install "python>=3.6" fastp "spades>=3.12" unicycler checkm-genome dfast bbmap blast biopython nxtrim "mash>=2" flye racon "samtools>=1.9"

Of course, it's possible to use another ways even compile all tools from source code. In this case you should check if binaries are in your '$PATH' variable.

Install from PyPI

Run pip install zga. Biopython is the only one dependency installed from PyPI. All other dependencies You should install manually or using conda as mentioned above. CheckM is available on PyPi, but it's easier to install it using conda.

Get source from Github

You can get ZGA by cloning from the repository with git clone https://github.com/laxeye/zga.git or by downloading an archive. After downloading enter the directory and run python3 setup.py build && python3 setup.py install.

Operating systems requirements

ZGA was tested on Ubuntu 18.04 and 19.10. Most probably any modern 64-bit Linux distribuition is enough.

Your feedback on other OS is welcome!

Usage

Run zga -h to get a help message.

Examples:

Perform all steps: read qc, read trimming and merging, assembly, CheckM assesment with default (bacterial) marker set, DFAST annotation and use 4 CPU threads where possible:

zga -1 R1.fastq.gz -2 R2.fastq.gz --threads 4 -o my_assembly

Assemble with SPAdes using paired-end and nanopore reads of archaeal genome (CheckM will use archaeal markers) altering memory limit to 16 GB:

zga -1 R1.fastq.gz -2 R2.fastq.gz --nanopore MiniION.fastq.gz -a spades --threads 4 --memory-limit 16 --domain archaea -o my_assembly

Assemble long reads with Flye skipping long read polishing and perfom short-read polishing with racon:

zga -1 R1.fastq.gz -2 R2.fastq.gz --nanopore MiniION.fastq.gz -a flye --threads 4 --domain archaea -o my_assembly --flye-short-polish --skip-flye-long-polish

Assemble from Nanopore reads using unicycler:

zga -a unicycler --nanopore MiniION.fastq -o nanopore_assembly

Perform assesment and annotation of genome assembly with 'Pectobacterium' CheckM marker set:

zga --first-step check_genome -g pectobacterium_sp.fasta --checkm_rank genus --checkm_taxon Pectobacterium -o my_output_dir

Let CheckM to infer the right marker set:

zga --first-step check_genome -g my_genome.fa --checkm_mode lineage -o my_output_dir

Known issues and limitations

ZGA is in the stage of active development.

I hope to fix next issues ASAP:

  • It's not posible to provide multiple read libraries i.e. two sets of PE reads or two nanopore runs.

Known issues and limitations:

  • Unicycler doesn't use mate-pair reads.
  • It's not possible to install all dependencies with Python 3.8 via conda, please use 3.7 or 3.6.

Don't hesitate to report bugs or features!

Cite

It's a great pleasure to know, that your software is useful. Please cite ZGA:

Korzhenkov A. (2020). ZGA: prokaryotic genome assembly and annotation pipeline.

And of course tools it's using:

Chen, S., Zhou, Y., Chen, Y., & Gu, J. (2018). fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34(17), i884-i890. https://doi.org/10.1093/bioinformatics/bty560

Bushnell, B., Rood, J., & Singer, E. (2017). BBMerge–accurate paired shotgun read merging via overlap. PloS one, 12(10).

Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A., Dvorkin, M., Kulikov, A. S., ... & Pyshkin, A. V. (2012). SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology, 19(5), 455-477.

Wick, R. R., Judd, L. M., Gorrie, C. L., & Holt, K. E. (2017). Unicycler: resolving bacterial genome assemblies from short and long sequencing reads. PLoS computational biology, 13(6), e1005595.

Vaser, R., Sović, I., Nagarajan, N., & Šikić, M. (2017). Fast and accurate de novo genome assembly from long uncorrected reads. Genome research, 27(5), 737-746.

Kolmogorov, M., Yuan, J., Lin, Y., & Pevzner, P. A. (2019). Assembly of long, error-prone reads using repeat graphs. Nature biotechnology, 37(5), 540-546.

Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P., & Tyson, G. W. (2015). CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome research, 25(7), 1043-1055.

Tanizawa, Y., Fujisawa, T., & Nakamura, Y. (2018). DFAST: a flexible prokaryotic genome annotation pipeline for faster genome publication. Bioinformatics, 34(6), 1037-1039.

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of molecular biology, 215(3), 403-410.

Cock, P. J., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A., ... & De Hoon, M. J. (2009). Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11), 1422-1423.

O’Connell, J., et al. (2015) NxTrim: optimized trimming of Illumina mate pair reads. Bioinformatics 31(12), 2035-2037.

Ondov, B.D., Treangen, T.J., Melsted, P. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 17, 132 (2016). doi: 10.1186/s13059-016-0997-x

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zga-0.0.7b1.tar.gz (17.1 kB view hashes)

Uploaded Source

Built Distributions

zga-0.0.7b1-py3.7.egg (29.4 kB view hashes)

Uploaded Source

zga-0.0.7b1-py3.6.egg (29.4 kB view hashes)

Uploaded Source

zga-0.0.7b1-py3-none-any.whl (17.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page