Skip to main content

unitig-caller: wrapper around mantis to detect presence of sequence elements

Project description

unitig-caller

Dev build Status Anaconda-Server Badge

Determines presence/absence of sequence elements in bacterial sequence data using Bifrost Build and Query functions. Uses assemblies and/or reads as inputs.

The implementation of unitig-caller is a wrapper around Bifrost which formats files for use with pyseer, as well as an implementation which calls sequences using an FM-index.

Build mode creates a compact de Bruijn graph using Bifrost. Query mode converts the .gfa file produced by Build mode to a .fasta, using an associated colours file to query the presence of unitigs in the source genomes used to build the original de Bruijn graph.

Simple mode finds presence of unitigs in a new population using an FM-index.

Install

Use unitig-caller if installed through pip/conda, or python unitig_caller-runner.py if using a clone of the code.

With conda (recommended)

Get it from bioconda:

conda install unitig-caller

If you haven't set this up, first install miniconda. Then add the correct channels:

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

With pip

Get it from PyPI:

pip install unitig-caller

Requires bifrost version 1.0.3 installed, and accessible via PATH (see steps for installation at Bifrost github page).

From source

Requires cmake, pthreads, pybind11 and a C++17 compiler (e.g. gcc >=7.3), in addition to the pip requirements.

git clone https://github.com/johnlees/unitig-caller --recursive
python setup.py install

Usage

There are three ways to use this package:

  1. Build a population graph to extract unitigs for GWAS with pyseer like unitig-counter (--build).
  2. Find these unitigs in a new population using a graph (--build and --query).
  3. Find these unitigs in a new population using an index (--simple).

For 1), run --build mode followed by --query mode.

Both 2) and 3) give the same results with different index tools, both finding unitigs so pyseer models can be applied to a new population.

For 2), first run --build mode to make a graph for the new population. Then run --query mode with this graph, but the --unitigs from the original population.

For 3), run --simple mode giving the new genomes as --refs and the --unitigs from the original population.

These modes are detailed below

Running Build mode

This uses Bifrost Build to generate a compact de Bruijn graph. By default this a coloured compact de Bruijn graph.

unitig-caller --build --refs refs.txt --reads reads.txt --output out_prefix

--refs is a required .txt file listing paths of input assemblies or read files (.fasta or .fastq), each on a new line. Must be specified as either 'refs.txt' for assemblies or 'reads.txt' for read files. No header row.

--reads is an optional .txt file listing paths to additional sequence files of different type to those specified in --input1 (e.g. if 'refs.txt' is given in --refs, then 'reads.txt' will be given in --reads and vice versa), each on new line. No header row.

--output is the prefix for output files.

By default de Bruijn graphs are coloured, with an accompanying .bfg_colors being generated alongside the .gfa file. To turn this off, use --no_colour. Note, Query mode cannot be run without a .bfg_colors file.

To generate a clean de Bruijn graph (clip tips and delete isolated contigs shorter than k k-mers in length), specify --clean.

Build mode automatically generates a .fasta file containing unitigs found within the graph.

Running Query mode

Before running Query mode, generate a coloured compact de Bruijn graph using Build mode. Then run the Query command as below.

unitig-caller --query --graph-prefix in_prefix --unitigs query_unitigs.fasta --output out_prefix

--graph-prefix is the required prefix for the .gfa, .bfg_colors and unitigs .fasta files generated from --build mode applied to the new population.

--unitigs is an optional .fasta file, specifying a separate unitigs .fasta file that was generated by --build mode on another graph. If not specified, unitigs from the graph will be used, generating calls for this population.

--output is the prefix for output files.

The sensitivity of querying can be altered by passing a float argument to --ratiok (between 0 and 1, default 1.0), which determines the threshold proportion of k-mers of a specific colour present in a unitig for colour classification. Specifying --inexact will search the graph for both exact and inexact k-mers (1 substitution or indel) from queries. Lowering --ratiok and/or specifying --inexact will result in more colour hits per unitig, but will increase probability of false positives and run-time.

Running simple mode

This uses suffix arrays (FM-index) provided by SeqAn3 to perform string matches:

unitig-caller --simple --refs strain_list.txt --unitigs queries.txt --output calls

--refs is a required file listing input assemblies, name followed by location of fasta file (tab separated), each on a new line. No header row.

--unitigs is a required list of the unitig sequences to call. The unitigs need to be in the first column (tab separated). A header row is assumed, so output from pyseer etc can be directly used.

calls_pyseer.txt will contain unitig calls in seer/pyseer k-mer format.

By default FM-indexes are saved in the same location as the assembly files so that they can be quickly loaded by subsequent runs. To turn this off use --no-save-idx.

Option reference

usage: unitig-caller [-h] (--build | --query | --simple) [--refs REFS]
                     [--reads READS] [--graph-prefix GRAPH_PREFIX]
                     [--unitigs UNITIGS] [--output OUTPUT] [--no_colour]
                     [--clean] [--ratiok RATIOK] [--inexact]
                     [--kmer_size KMER_SIZE] [--minimizer_size MINIMIZER_SIZE]
                     [--no-save-idx] [--threads THREADS] [--bifrost BIFROST]
                     [--version]

Call unitigs in a population dataset

optional arguments:
  -h, --help            show this help message and exit

Mode of operation:
  --build               Build coloured/uncoloured de Bruijn graph using
                        Bifrost
  --query               Query unitig presence/absence across input genomes
  --simple              Use FM-index to make calls

Unitig-caller input/output:
  --refs REFS           Ref file to use to --build bifrost graph (or with
                        --simple)
  --reads READS         Read file to use to --build bifrost graph
  --graph-prefix GRAPH_PREFIX
                        Prefix of bifrost graph to --query
  --unitigs UNITIGS     fasta file of unitigs to query (--query or --simple)
  --output OUTPUT       Prefix for output [default = 'unitig_caller']

Build Input/output:
  --no_colour           Specify for uncoloured de Bruijn Graph [default =
                        False]
  --clean               Clean DBG (clip tips and delete isolated contigs
                        shorter than k k-mers in length) [default = False]

Query Input/output:
  --ratiok RATIOK       ratio of k-mers from queries that must occur in the
                        graph to be considered as belonging to colour [default
                        = 1.0]
  --inexact             Graph is searched with exact and inexact k-mers (1
                        substitution or indel) from queries [default = False]

Bifrost options:
  --kmer_size KMER_SIZE
                        K-mer size for graph building/querying [default = 31]
  --minimizer_size MINIMIZER_SIZE
                        Minimizer size to be used for k-mer hashing [default =
                        23]

Simple mode options:
  --no-save-idx         Do not save FM-indexes for reuse

Other:
  --threads THREADS     Number of threads to use [default = 1]
  --bifrost BIFROST     Location of bifrost executable [default = Bifrost]
  --version             show program's version number and exit

Citation

If you use this, please cite the Bifrost paper:

Holley G., Melsted, P. Bifrost – Highly parallel construction and indexing of colored and compacted de Bruijn graphs. bioRxiv 695338 (2019). doi: https://doi.org/10.1101/695338

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unitig-caller-1.1.0.tar.gz (13.3 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page