ReferenceSeeker: rapid determination of appropriate reference genomes.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 5 - Production/Stable
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Natural Language
- English
Operating System
- POSIX :: Linux
Programming Language
- Python :: 3 :: Only
Topic
- Scientific/Engineering :: Bio-Informatics

Project description

PyPI - Python Version GitHub release PyPI PyPI - Status Conda Conda

ReferenceSeeker: rapid determination of appropriate reference genomes.

Description
Input & Output
Installation
Usage
Examples
Databases
Dependencies
Citation

Description

ReferenceSeeker determines closely related reference genomes from RefSeq (https://www.ncbi.nlm.nih.gov/refseq) following a scalable hierarchical approach combining an ultra-fast kmer profile-based database lookup of candidate reference genomes and subsequent computation of specific average nucleotide identity (ANI) values for the rapid determination of suitable reference genomes.

ReferenceSeeker computes kmer-based genome distances between a query genome and and a database built on RefSeq genomes via Mash (Ondov et al. 2016). Therefore, only complete genomes or those stated as 'representative' or 'reference' genome are included. ReferenceSeeker offers pre-built databases for a broad spectrum of microbial taxonomic groups, i.e. bacteria, archaea, fungi, protozoa and viruses. For resulting candidates ReferenceSeeker subsequently computes ANI values picking genomes meeting community standard thresholds by default (ANI >= 95 % & conserved DNA >= 69 %) (Goris, Konstantinos et al. 2007) ranked by ANI and conserved DNA.

The reasoning for subsequent calculations of both ANI and conserved DNA values is that Mash distance values correlate well with ANI values for closely related genomes but the same is not true for conserved DNA values. A kmer fingerprint-based comparison alone cannot distinguish if a kmer is missing due to a SNP, for instance or a lack of the kmer-comprising subsequence. As DNA conservancy (next to DNA identity) is very important for many kinds of analyses, e.g. reference based SNP detections, ranking potential reference genomes based on a mash distance alone is often not sufficient in order to select the most appropriate reference genomes.

Mash D vs. ANI / conDNA

Input & Output

Input:

Path to a taxon database and a draft or finished genome in fasta format:

$ referenceseeker ~/bacteria GCF_000013425.1.fna

Output:

Tab separated lines to STDOUT comprising the following columns:

RefSeq Assembly ID
ANI
Conserved DNA
Mash Distance
NCBI Taxonomy ID
Assembly Status
Organism

#ID	ANI	Con. DNA	Mash Distance	Taxonomy ID	Assembly Status	Organism
GCF_000013425.1	100.00	100.00	0.00000	93061	complete	Staphylococcus aureus subsp. aureus NCTC 8325
GCF_001900185.1	100.00	99.89	0.00002	46170	complete	Staphylococcus aureus subsp. aureus HG001
GCF_900475245.1	100.00	99.57	0.00004	93061	complete	Staphylococcus aureus subsp. aureus NCTC 8325 NCTC8325
GCF_001018725.2	100.00	99.28	0.00016	1280	complete	Staphylococcus aureus FDAARGOS_10
GCF_001018915.2	99.99	96.35	0.00056	1280	complete	Staphylococcus aureus NRS133
...

Installation

Platon can be installed and used in 2 different ways.

In either case, a taxon database must be downloaded which we provide for download at Zenodo: For more information scroll to Databases.

Conda:

The preferred way to install and run ReferenceSeeker is BioConda:

$ conda install -c conda-forge -c bioconda -c defaults referenceseeker
$ referenceseeker --help

GitHub

Alternatively, you can use this raw GitHub repository:

install necessary Python dependencies (if necessary)
clone the latest version of the repository
download and extract a databases

Example:

$ pip3 install --user biopython
$ git clone https://github.com/oschwengers/referenceseeker.git
$ ./referenceseeker/bin/referenceseeker --help

Usage

Usage:

usage: referenceseeker [-h] [--crg CRG] [--ani ANI]
                       [--conserved-dna CONSERVED_DNA] [--unfiltered]
                       [--verbose] [--threads THREADS] [--version]
                       <database> <genome>

Rapid determination of appropriate reference genomes.

positional arguments:
  <database>            ReferenceSeeker database path
  <genome>              target draft genome in fasta format

optional arguments:
  -h, --help            show this help message and exit
  --crg CRG, -r CRG     max number of candidate reference genomes to pass kmer
                        prefilter (default = 100)
  --ani ANI, -a ANI     ANI threshold value (default = 0.95)
  --conserved-dna CONSERVED_DNA, -c CONSERVED_DNA
                        Conserved DNA threshold value (default = 0.69)
  --unfiltered, -u      set kmer prefilter to extremely conservative values
                        and skip species level ANI cutoffs (ANI >= 0.95 and
                        conserved DNA >= 0.69
  --verbose, -v         print verbose information
  --threads THREADS, -t THREADS
                        number of threads to use (default = number of
                        available CPUs)
  --version, -V         show program's version number and exit

Examples

Simple:

$ # referenceseeker <REFERENCE_SEEKER_DB> <GENOME>
$ referenceseeker bacteria/ genome.fasta

Expert: verbose output and increased output of candidate reference genomes using a defined number of threads:

$ # referenceseeker --crg 500 --verbose --threads 8 <REFERENCE_SEEKER_DB> <GENOME>
$ referenceseeker --crg 500 --verbose --threads 8 bacteria/ genome.fasta

Databases

ReferenceSeeker depends on custom databases based on reference, representative as well as complete NCBI RefSeq genomes comprising kmer hash profiles and taxonomic information. We provide the following pre-built databases based on RefSeq 2019-07-02 via :

Taxon	URL	# Genomes	Size Zipped	Size Unzipped
bacteria	https://zenodo.org/record/3562005/files/bacteria.tar.gz?download=1	18,229	22 Gb	71 Gb
archaea	https://zenodo.org/record/3562005/files/archaea.tar.gz?download=1	417	364 Mb	1.2 Gb
fungi	https://zenodo.org/record/3562005/files/fungi.tar.gz?download=1	288	2.6 Gb	8 Gb
protozoa	https://zenodo.org/record/3562005/files/protozoa.tar.gz?download=1	88	1 Gb	3.4 Gb
viral	https://zenodo.org/record/3562005/files/viral.tar.gz?download=1	9,264	608 Mb	835 Mb

Updated database versions reflecting the latest RefSeq versions can be built with a shell script and nextflow pipeline.

Download and install Nextflow:

$ curl -fsSL get.nextflow.io | bash

Build database:

$ export REFERENCE_SEEKER_HOME=<REFERENCE_SEEKER_DIR>
$ sh $REFERENCE_SEEKER_HOME/build-db.sh <DB_TYPE_OPTION>

List of available database options:

$ sh build-db.sh
	-b (bacteria)
	-a (archaea)
	-v (viral)
	-f (fungi)
	-p (protozoa)

Dependencies

ReferenceSeeker needs the following dependencies:

Python (3.5.2), Biopython (1.71)
Mash (2.2) https://github.com/marbl/Mash
MUMmer (4.0.0-beta2) https://github.com/gmarcais/mummer

ReferenceSeeker has been tested against aforementioned versions.

Citation

ReferenceSeeker: rapid determination of appropriate reference genomes. Oliver Schwengers, Torsten Hain, Trinad Chakraborty, Alexander Goesmann. bioRxiv 863621; doi: https://doi.org/10.1101/863621

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 5 - Production/Stable
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Natural Language
- English
Operating System
- POSIX :: Linux
Programming Language
- Python :: 3 :: Only
Topic
- Scientific/Engineering :: Bio-Informatics

Release history Release notifications | RSS feed

1.8.0

Jan 14, 2022

1.7.3

Apr 27, 2021

1.7.1

Apr 7, 2021

1.6.4

Jan 29, 2021

1.6.3

Apr 17, 2020

1.6.2

Apr 15, 2020

1.6.1

Apr 14, 2020

1.6

Feb 4, 2020

1.5

Jan 31, 2020

1.4

Dec 19, 2019

This version

1.3.0

Dec 17, 2019

1.2.0

Dec 17, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

referenceseeker-1.3.0.tar.gz (9.4 kB view hashes)

Uploaded Dec 17, 2019 Source

Built Distribution

referenceseeker-1.3.0-py3-none-any.whl (23.3 kB view hashes)

Uploaded Dec 17, 2019 Python 3

Hashes for referenceseeker-1.3.0.tar.gz

Hashes for referenceseeker-1.3.0.tar.gz
Algorithm	Hash digest
SHA256	`ca1755f74783eb65d9bb4aae20d933341fbafeaa3f6193d02b042d113859cb79`
MD5	`10535dd1d07d48cba958cb44f48839a6`
BLAKE2b-256	`908010f08250a236c93266b27839ebf318a1ab1b2e137447614a6584b1b932d8`

Hashes for referenceseeker-1.3.0-py3-none-any.whl

Hashes for referenceseeker-1.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`21b7e0ecd21a376c33c0b8e038b33771c51247f0f314620a5e5bc2e6920296e0`
MD5	`1f00d6b7f748f8a0c0ca32229849cec5`
BLAKE2b-256	`56ec8a6dae875536bf8ff5ad779a45daebea4f632f9dfe753f577a8da601a9a5`