calculate pair-wise allelic distances from cgMLST implements chewBBACAs

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Science/Research
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- OS Independent
Programming Language
- Python :: 3.7
Topic
- Scientific/Engineering :: Bio-Informatics

Project description

COREugate - A pipeline for cgMLST

From contigs to cgMLST profile and SLC.

COREugate has had a small facelift!! Under the hood we are now using NextFlow as our pipeline engine and have introduced some additional functionality for clustering the profiles.

PrepSchema (if necessary) and Call alleles using chewBBACA.
Combine profiles and statisitics for the whole dataset.
Calculate pairwise allelic distances (missing data is ignored)
Perform SLC to group related profiles, based on user supplied thresholds.

Dependencies

Python >=3.7
Biopython >=1.70
Nextflow >=20.10
chewBBACA >=2.6

NextFlow

Ensure that you have NextFlow installed. Detailed instructions can be found here

chewBBACA

chewBBACA is used here to prepare the schema, by selecting exemplar alleles for comparison and to call allele profiles. More information about chewBBACA and how it is works can be found here. COREugate can use a singularity version of chewBBACA, however if you want to install the latest version (>=2.0.16)

Run COREugate

Get COREugate

pip3 install git+https://github.com/kristyhoran/Coreugate

If you are installing COREugate on a server using --user please ensure that your ~/.local/bin is part of your PATH

export PATH=$PATH:/path/to/.local/bin

Running COREugate

coreugate [-h] [-v] [--input_file INPUT_FILE]
                 [--schema_path SCHEMA_PATH]
                 [--prodigal_training PRODIGAL_TRAINING] [--workdir WORKDIR]
                 [--threads THREADS]
                 [--filter_samples_threshold FILTER_SAMPLES_THRESHOLD]
                 [--cluster] [--cluster_thresholds CLUSTER_THRESHOLDS]
                 [--force] [--report]

Coreugate - a cgMLST pipeline implementing chewBACCA

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --input_file INPUT_FILE, -i INPUT_FILE
                        Input file tab-delimited file3 columns isolate_id
                        path_to_input_file (contigs) (default: )
  --schema_path SCHEMA_PATH, -s SCHEMA_PATH
                        Path to species schema/allele db (or url if using
                        chewie Nomenclature server) (default: )
  --prodigal_training PRODIGAL_TRAINING, -p PRODIGAL_TRAINING
                        Prodigal file to be used in allele calling. See https:
                        //github.com/B-UMMI/chewBBACA/tree/master/CHEWBBACA/pr
                        odigal_training_files for options (default: )
  --workdir WORKDIR, -w WORKDIR
                        Working directory, default is current directory
                        (default: /home/khhor/validation/salmonella_typing/rev
                        erification_20210322)
  --threads THREADS, -t THREADS
                        Number of threads to run chewBACCA (default: 16)
  --filter_samples_threshold FILTER_SAMPLES_THRESHOLD, -ft FILTER_SAMPLES_THRESHOLD
                        The proportion of loci present in a sample for an
                        sample to be included in further analysis (0-1)
                        (default: 0.95)
  --cluster, -c         If you would like to cluster the pairwise distance
                        matrix. If selected you must provide a list of
                        thresholds. (default: False)
  --cluster_thresholds CLUSTER_THRESHOLDS, -ct CLUSTER_THRESHOLDS
                        Provide a comma separate list (NO SPACES) eg 20,40,200
                        (default: )
  --force, -f           If you want to force chewBBACA to re-run. (default:
                        False)
  --report              Save nextflow reports. (default: False)
                                 Display this help message

Sample data

Assemblies

isolate_name	path/to/assembly.fa

Species cgMLST schema

COREugate requires an exisiting cgMLST schema, this can be a schema generated by the user or downloaded from one of the publically available databases. These schema should be in the format of a fasta file for each loci, each file should contain the different alleles for each loci. It should be noted that during allele calling, chewBBACA (implemented by COREugate) will add inferred alleles (more information) to your schema, so it is recommended that the schema path be fixed, that is that the schema is kept in a central location and a single version is used for each species/study.

Other optional arguments

prodigal_training a prodigal training file for allele calling. Recommended by chewBBACA developers, a list of default training files and further information can be found here.

Limitations of the pipeline

Coreugate is only able to work with pre-exisiting schemas that have been prep as described above, to derive profiles for isolates.
Possibly more, I just haven't found them yet!!

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Science/Research
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- OS Independent
Programming Language
- Python :: 3.7
Topic
- Scientific/Engineering :: Bio-Informatics

Release history Release notifications | RSS feed

This version

2.0.4

May 12, 2021

2.0.2

May 5, 2021

2.0.1

May 4, 2021

2.0.0

May 3, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

coreugate-2.0.4.tar.gz (1.9 MB view hashes)

Uploaded May 12, 2021 Source

Built Distribution

coreugate-2.0.4-py3-none-any.whl (10.4 kB view hashes)

Uploaded May 12, 2021 Python 3

Hashes for coreugate-2.0.4.tar.gz

Hashes for coreugate-2.0.4.tar.gz
Algorithm	Hash digest
SHA256	`0c0df45eb7a21011bfd62b93ad7f786bf93dcdb40c3cd14f455d95045df493f4`
MD5	`e43a0f55fb818fa47cc0010ab7ce2319`
BLAKE2b-256	`95f2cf1adf80d6418c64e6a62e0a85522ccd1888806ec48c99be2116bd661c4b`

Hashes for coreugate-2.0.4-py3-none-any.whl

Hashes for coreugate-2.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f792b8d0ba9d55b54fc218f2ddbba7a6312b13b71deca32376092509d7730e3a`
MD5	`95ed102ff33a0e568b6d18487cd132d6`
BLAKE2b-256	`fecb15d1f5b41477f76a0ca0b8fe7023f5f994fe65d257426ac10ea65fb739ad`