Skip to main content

Indexes genomics data (nucleotide variants, kmers, MLST) for fast querying of features.

Project description

Genomics data index

Build Status pypi Binder

This project is to design a system which can index large amounts of genomics data and enable rapid querying of this data.

Indexing breaks genomes up into individual features (nucleotide mutations, k-mers, or genes/MLST) and stores the index in a directory which can easily be shared with other people. Indexes can be generated direct from sequence data or loaded from existing intermediate files (e.g., VCF files, MLST results).

# Analyze sequence data (reads/assemblies, compressed/uncompressed)
gdi analysis --reference-file genome.gbk.gz *.fasta.gz *.fastq.gz

# (Alternatively) Index features in previously computed files (VCF files, or MLST results)
gdi load vcf --reference-file reference.gbk.gz vcf-files.txt
gdi load mlst-tseemann mlst.tsv # Load from https://github.com/tseemann/mlst
gdi load mlst-sistr sistr-profiles.csv # Load from https://github.com/phac-nml/sistr_cmd

Querying provides both a Python API and Command-line interface to select sets of samples using this index or attached external data (e.g., phylogenetic trees or DataFrames of metadata).

Python API:

# Select samples with a D614G mutation on gene S
r = s.hasa('hgvs:MN996528.1:S:D614G')

# Select samples with Allele 100 for Locus (gene) adk in MLST scheme ecoli
r = s.hasa('ecoli:adk:100')

Summaries of the features (mutations, kmers, MLST) can be exported from a set of samples alongside nucleotide alignments, distance matrices or trees constructed from subsets of features.

r.summary_features()
Mutation Count
10 G>T 1
20 C>T 3
30 A>G 5

Visualization of trees and sets of selected samples can be constructed using the provided Python API and the visualization tools provided by the ETE Toolkit.

r.tree_styler() \
 .highlight(set1) \
 .highlight(set2) \
 #...
 .render()

tree-visualization.png

You can see more examples of this software in action in the provided Tutorials.

Table of contents

1. Overview

The software is divided into two main components: (1) Indexing and (2) Querying.

1.1. Indexing

figure-index.png

The indexing component provides a mechanism to break genomes up into individual features and store these features in a database. The types of features supported include: Nucleotide mutations, K-mers, and Genes/MLST.

1.1.1. Naming features

Indexing assigns names to the individual features, represented as strings inspired by the Sequence Position Deletion Insertion (SPDI) model.

  1. Nucleotide mutations: sequence:position:deletion:insertion (e.g., ref:100:A:T)
  2. Genes/MLST: scheme:locus:allele (e.g., ecoli:adk:100)
  3. Kmers: Not implemented yet

Alternatively, for Nucleotide mutations names can be given using hgvs (as output by SnpEff).

  1. Nucleotide mutations: hgvs:sequence:gene:p.protein_change (e.g., hgvs:ref:geneX:p.P20H).

1.2. Querying

figure-queries.png

The querying component provides a Python API or command-line interface for executing queries on the genomics index. The primary type of query is a Samples query which returns sets of samples based on different criteria. These criteria are grouped into different Methods. Each method operates on a particular type of Data which could include features stored in the genomics index as well as trees or external metadata.

1.2.1. Python API

An example query on an existing set of samples s would be:

r = s.isa('B.1.1.7', isa_column='lineage') \
     .isin(['SampleA'], distance=1, units='substitutions') \
     .hasa('MN996528.1:26568:C:A')

This would be read as:

Select all samples in s which are a B.1.1.7 lineage as defined in some attached DataFrame (isa()) AND which are within 1 substitution of SampleA as defined on a phylogenetic tree (isin()) AND which have a MN996528.1:26568:C:A mutation (hasa()).

Note: I have left out some details in this query. Full examples for querying are available at Tutorial 1: Salmonella dataset.

2. Background

A paper on this project is in progress. A detailed description is found in my Thesis.

Additionally, a poster on this project can be found at immem2022.

3. Installation

3.1. Conda

Conda is a package and environment management software which makes it very easy to install and maintain dependencies of software without requiring administrator/root access. Packages from conda are provided using different channels and the bioconda channel contains a very large collection of bioinformatics software which can be automatically installed. To make use of conda you will have to first download and install conda. Once installed you can use the command conda to install software/manage conda environments.

To install this software, we will first, create a conda environment with the necessary dependencies as follows (a full conda package is not available yet https://github.com/apetkau/genomics-data-index/issues/51 ).

conda create -c conda-forge -c bioconda -c defaults --name gdi python=3.8 pyqt bedtools iqtree 'bcftools>=1.13' 'htslib>=1.13'

# Activate environment. Needed to install additional Python dependencies below.
conda activate gdi

Now, you can install with:

pip install genomics-data-index

If everything is working you should be able to run:

gdi --version

You should see gdi, version 0.1.0 printed out.

Additional dependencies

For snpeff to work you will need to install the package mkisofs on Ubuntu (e.g., sudo apt install mkisofs). I do not know the exact package name on other systems.

3.2. PyPI/pip

To install just the Python component of this project from PyPI you can run the following:

pip install genomics-data-index

Note that you will have to install some additional dependencies separately. Please see the conda-env.yaml environment file for details.

3.3. From GitHub

To install the project from the source on GitHub please first clone the git repository:

git clone https://github.com/apetkau/genomics-data-index.git
cd genomics-data-index

Now install all the dependencies using conda and bioconda with:

conda env create -f conda-env.yaml
conda activate gdi

Once these are installed you can setup the Python package with:

pip install .

4. Usage

The main command is called gdi. A quick overview of the usage is as follows:

4.1. Indexing

# Create new index in `index/`
# cd to `index/` to make next commands easier to run
gdi init index
cd index

# Creates an index of mutations (VCF files) and kmer sketches (sourmash)
gdi analysis --use-conda --include-kmer --kmer-size 31 --reference-file genome.gbk.gz *.fastq.gz

# (Optional) build tree from mutations (against reference genome `genome`) for phylogenetic querying
gdi rebuild tree --align-type full genome

The produced index will be in the directory index/.

4.2. Querying

# List indexed samples
gdi list samples

# Query for genomes with mutation
gdi query mutation 'genome:10:A:T'

4.3. Main usage statement

Usage: gdi [OPTIONS] COMMAND [ARGS]...

Options:
  --project-dir TEXT              A project directory containing the data and
                                  connection information.

  --ncores INTEGER RANGE          Number of cores for any parallel processing
                                  [default: 8]

  --log-level [DEBUG|INFO|WARNING|ERROR|CRITICAL]
                                  Sets the log level  [default: INFO]
  --version                       Show the version and exit.
  --config FILE                   Read configuration from FILE.
  --help                          Show this message and exit.

Commands:
  analysis
  build
  db
  export
  init
  input
  list
  load
  query
  rebuild

5. Tutorial

Tutorials and a demonstration of the software are available below (code in separate repository). You can select the [launch | binder] badge to launch each of these tutorials in an interactive Jupyter environment within the cloud environment using Binder.

  1. Tutorial 1: Querying (Salmonella) - Binder
    • In case GitHub link is not rendering try here
  2. Tutorial 2: Indexing assemblies (SARS-CoV-2) - Binder
    • In case GitHub link is not rendering try here
  3. Tutorial 3: Querying overview - Binder
    • In case GitHub link is not rendering try here

Alternatively, you can run these tutorials on your local machine. In order to run these tutorials you will first have to install the genomics-data-index software (see the Installation section for details). In addition, you will have to install Jupyter Lab. If you have already installed the genomics-data-index software with conda you can install Jupyter Lab as follows:

conda activate gdi
conda install jupyterlab

To run Jupyter you can run the following:

# QT_QPA_PLATFORM The below is useful to avoid having to set the DISPLAY env variable for Qt
# You can ignore setting this environment variable if you are running on a machine with an X server installed and configured

QT_QPA_PLATFORM="offscreen" jupyter lab

Please see the instructions for Jupyter Lab for details.

6. Acknowledgements

I would like to acknowledge the Public Health Agency of Canada, the University of Manitoba, and the VADA Program for providing me with the opportunity, resources and training for working on this project.

Some icons used in this documentation are provided by Font Awesome and licensed under a Creative Commons Attribution 4.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genomics-data-index-0.9.2.tar.gz (2.3 MB view hashes)

Uploaded Source

Built Distribution

genomics_data_index-0.9.2-py3-none-any.whl (2.4 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page