Skip to main content

python Mlst Local Search Tool

Project description

PyMLST

A Python Mlst Local Search Tool.

Purpose

Typing bacteria is an important public health task in hospital. The use of next generation sequencing to analyse or survey epidimic strain is rising. For this purpose, core or whole genome Multilocus Sequence Typing (cgMLST / wgMLST) has become the new standard. It is an extension of the traditionnal MLST method that uses a short list of housekeeping genes. Here, a large set of gene corresponding to the core or the whole genome is used. Similarly to MLST, each unique sequence matches a specific allele and the combination of allele determines the sequence type (ST) of the strain.

pyMLST have been developped to performed this task. In comparaison to other tools, it uses a local sqlite database to store allele sequences and mlst profiles. This allows to iteratively enlarge the collection of strains to compare. The entry is a draft genome produced by an assembler or the direct raw data, but also other genomes store in sequence database.

New version

V2.1:

  • Use raw reads (FASTQ) directly with the kma integration (search2 and add2 command)

V2.0:

  • An automatic import database mechanism to initiated cgMLST and MLST databases.
  • A new process to fill incomplet genes using MAFFT alignment.
  • A more complete command line interface with a sub-command system.
  • A configuration file for defined PATH to external tools.
  • An easy installation with pypi repository.

Installation

From pypi repository

pip install pymlst

From source

virtualenv venv
source venv/bin/activate
make install
make build

Dependancy

PyMLST uses 3 external tools to run alignment:

  • Mafft (>=7.307)
sudo apt install mafft 

Configuration

Configure the executables locations and log level :

pyMLST configure --help
Usage: pyMLST configure [OPTIONS]

  Configure executables paths and log level.

Options:
  -b, --blat FILE   Blat executable absolute path.
  -k, --kma FILE    Kma executable absolute path.
  -m, --mafft FILE  Mafft executable absolute path.
  -l, --log [DEBUG|INFO|WARNING|ERROR]
                    Level of logging, default=INFO  
  -r, --reset       Reset the configuration.
  --help            Show this message and exit.

cg/wgMLST Analysis

A complete analysis of wgMLST is performed using a succession of python script.

wgMLST --help
Usage: wgMLST [OPTIONS] COMMAND [ARGS]...

  Whole/Core genome MLST analysis.

Options:
  -v, --version  Prints PyMLST version.

Commands:
  add            Add a strain GENOME to the wgMLST DATABASE.
  create         Create a wgMLST DATABASE from a template COREGENE.
  distance       Extract an distance matrix from a wgMLST DATABASE.
  gene           Extract an genes list from a wgMLST DATABASE.
  import         Create a wgMLST DATABASE from an online resource.
  mlst           Extract an MLST table from a wgMLST DATABASE.
  msa            Compute Multiple Sequence Alignment from a wgMLST...
  recombinaison  Search potential gene re-combinations from wgMLST...
  remove         Remove STRAINS or GENES from a wgMLST DATABASE.
  sequence       Extract sequences from a wgMLST DATABASE.
  stats          Extract stats from a wgMLST DATABASE.
  strain         Extract an strains list from a wgMLST DATABASE.
  subgraph       Search group of strains at a DISTANCE threshold.

Import or Create a database

First, you need to create a database containing the scheme to use.

wgMLST import -h
Usage: wgMLST import [OPTIONS] DATABASE [SPECIES]...

  Create a wgMLST DATABASE from an online resource.

  The research can be filtered by adding a SPECIES name.

Options:
  -f, --force             Override alrealdy existing DATABASE
  --prompt / --no-prompt  Do not prompt if multiple choices are found,
                          fail instead.
  • Alternatively, you can create one with a scheme. The scheme is a multi-fasta file containing sequences of genes in nucleotide format. You can obtained scheme for:
    • Core genome analysis in publications.
    • Whole genome analysis by using annoted genes of a publish genome close to your strains.
wgMLST create --help
Usage: wgMLST create [OPTIONS] COREGENE DATABASE

  Create a wgMLST DATABASE from a template COREGENE.

Options:
  -f, --force        Override alrealdy existing DATABASE
  -c, --concatenate  Automatically concatenate genes with duplicated sequences
  -r, --remove       Automatically remove genes with duplicated sequences

Add strains

Next, you need to add your strain iteratively to the database. A draft genome can be used (we recommend to use Spades for assembly). You can also add reference genome for comparison.

wgMLST add --help
Usage: wgMLST add [OPTIONS] GENOME DATABASE

  Add a strain GENOME to the wgMLST DATABASE.

Options:
  -s, --strain TEXT     Name of the strain (default:genome name)
  -i, --identity FLOAT  Minimum identity to search gene (default=0.95)
  -c, --coverage FLOAT  Minimum coverage to search gene (default=0.9)

Alternatively, you can also add strain from raw reads direcly with single or paired FASTQS(.gz) files.

wgMLST add2 --help
Usage: wgMLST add2 [OPTIONS] DATABASE [FASTQS]...

  Add a strain from FASTQS(.gz) reads to the wgMLST DATABASE.

Options:
  -s, --strain TEXT     Name of the strain (default:genome name).
  -i, --identity FLOAT  Minimum identity to search gene (default=0.95).
  -c, --coverage FLOAT  Minimum coverage to search gene (default=0.9).
  -r, --reads INTEGER   Minimum reads coverage to search a gene (default=10).

Export results

When the database is complete, you can export results for futher analysis.

  • Distance: A matrix of cgMLST distance can be compute from the database and defined as the number of different alleles between each pair of two strains, omitting the missing data. The genes used to compute this distance can be filtered with the different options (-m, -k, -d and -V).
wgMLST distance --help
Usage: wgMLST distance [OPTIONS] DATABASE

  Extract an distance matrix from a wgMLST DATABASE.
Options:
  -m, --mincover INTEGER  Minimun number of strain found to keep a gene
                          (default:0)

  -k, --keep              Keep only gene with different allele (omit
                          missing).

  -d, --duplicate         Conserve duplicate gene (default remove).
  -V, --inverse           Keep only gene that do not meet the filter of
                          mincover or keep options.

  -o, --output FILENAME   Export distance to (default=stdout).
  • MLST: The MLST profiles can be also exported. An formatted version compatible with grapetree can be defined.
wgMLST mslt --help
Usage: wgMLST mlst [OPTIONS] DATABASE

  Extract an MLST table from a wgMLST DATABASE.
Options:
  ...
  -f, --form [default|grapetree]  Specify format of output

Export sequences

You can access to allele sequences present in the database and specify a list of genes to export with -l option (The gene list can be obtained with the gene command).

  • Sequence: A simple export of the different sequences
wgMLST sequence -h
Usage: wgMLST sequence [OPTIONS] DATABASE

  Extract sequences from a wgMLST DATABASE.

Options:
  -o, --output FILENAME  Output result in fasta format (default:stdout).
  -f, --file FILENAME    File containing list of coregenes to extract
                         (default:all coregenes).
  --reference            Return sequence of the reference instead of strains
                         alleles
  • MSA: A multialign fasta file with genes concatened. The file can be use directly for phylogenetic analysis using maximun likelihood or bayesien approaches.
wgMLST msa -h
Usage: wgMLST msa [OPTIONS] DATABASE

  Compute Multiple Sequence Alignment from a wgMLST DATABASE.

Options:
  ...
  -r, --realign          Realign genes with same length (Default:No).

classical MLST Analysis

Furthermore, pyMLST is able to search classical MLST and return alleles number and Sequence Type.

claMLST --help
Usage: claMLST [OPTIONS] COMMAND [ARGS]...

  Classical MLST commands.

Options:
  -v, --version  Prints PyMLST version.
  --help         Show this message and exit.

Commands:
  create  Create a classical MLST DATABASE from a SCHEME csv and ALLELES...
  import  Create a claMLST DATABASE from an online resource.
  search  Search ST number for assembly GENOMES using an mlst DATABASE

Initialise a MLST database

claMLST import -h
Usage: claMLST import [OPTIONS] DATABASE [SPECIES]...

  Create a claMLST DATABASE from an online resource.

  The research can be filtered by adding a SPECIES name.

Options:
  --prompt / --no-prompt  Do not prompt if multiple choices are found,
                          fail instead.
  -f, --force        	  Override alrealdy existing DATABASE
  -m, --mlst TEXT         Specify the desired MLST scheme name.

  • Create : Alternatively, you can create a database with the sequence of alleles and mlst profile of your specie of interest at.

To create database, pyMLST needs the gene name present in the mlst profile header to match the name of the fasta file. As an example, rpoB gene in the header of mlst profile must match rpoB.fas file. You also need to remove additionnal column corresponding to clonal complex in the mlst profile file, if existing.

claMLST create --help
Usage: claMLST create [OPTIONS] DATABASE SCHEME ALLELES...

  Create a classical MLST DATABASE from a SCHEME csv and ALLELES files.

Options:
  -f, --force        	  Override alrealdy existing DATABASE

Search MLST profile of a strain

Similarly to wgMLST analysis, you need a draft genome to find the mlst profile. In case a new allele is present, you can obtain the sequence with the -f option. Multiple genomes could be search simultanisly.

claMLST search --help
Usage: claMLST search [OPTIONS] DATABASE GENOMES

  Search ST number for assembly GENOMES using an mlst DATABASE

Options:
  -i, --identity FLOAT   Minimum identity to search gene (default=0.9)
  -c, --coverage FLOAT   Minimum coverage to search gene (default=0.9)
  -f, --fasta FILENAME   Write fasta file with gene allele
  -o, --output FILENAME  Write ST search result to (default:stdout)

Alternatively, you can also search ST from raw reads direcly with single or paired FASTQS(.gz) files.

claMLST search2 --help
Usage: claMLST search2 [OPTIONS] DATABASE [FASTQS]...

  Search ST number from FASTQS(.gz) raw reads using an mlst DATABASE.

Options:
  -i, --identity FLOAT   Minimum identity to search gene (default=0.9).
  -c, --coverage FLOAT   Minimum coverage to search gene (default=0.95).
  -r, --reads INTEGER    Minimum reads coverage to search gene (default=10).
  --paired / --single    Defined type of fastqs files.
  -f, --fasta FILENAME   Write fasta file with gene allele.
  -o, --output FILENAME  Write ST search result to (default:stdout).

Publications

PyMLST v1 have been already use to analyse most of clinical bacteria:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PyMLST-2.1.2.tar.gz (41.5 kB view hashes)

Uploaded Source

Built Distribution

PyMLST-2.1.2-py3-none-any.whl (56.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page