Differential abundance workflow for microbiome data

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Main CI

Qadabra

Quantitative Analysis of Differential Abundance Ranks

Qadabra is a Snakemake workflow for comparing the results of differential abundance tools. Importantly, Qadabra focuses on feature ranks rather than FDR corrected p-values.

Installation

pip install qadabra

Qadabra requires the following dependencies:

snakemake
click
biom-format
pandas
numpy
cython
iow

Usage

Creating the workflow structure

Qadabra can be used on multiple datasets at once. First, we want to create the worfklow structure to perfrom differential abundance with all tools.

qadabra create-workflow --workflow-dest my_qadabra

This command will initialize the workflow but we still need to point to our dataset(s) of interest.

Adding a dataset

We can add datasets one-by-one with the add-dataset command.

qadabra add-dataset \
    --workflow-dest my_qadabra \
    --table data/table.biom \
    --metadata data/metadata.tsv \
    --name my_dataset_1 \
    --factor-name case_control \
    --target-level case \
    --reference-level control \
    --verbose

Let's walkthrough the arguments provided here:

workflow-dest: The location of the workflow that we created earlier
table: Feature table (features by samples) in BIOM format
metadata: Sample metadata in TSV format
name: Name to give this dataset
factor-name: Metadata column to use for differential abundance
target-level: The value in the chosen factor to use as the target
reference-level: The reference level to which we want to compare our target
verbose: Flag to show all preprocessing performed by Qadabra

You can use qadabra add-dataset --help for more details. To add another dataset, just run this command again with the new dataset information.

Running the workflow

The previous commands will create a subdirectory, my_qadabra in which the workflow structure is contained. Navigate into this directory; you should see two folders: config and workflow. If you open the config/config.yaml file, you can see a number of options with which to run Qadabra. You can modify these as you like. For example, if you want to only run DESeq2, ANCOM-BC, and Songbird, you can delete the other entries in the tools heading.

From the command line, execute snakemake --use-conda <other options> to start the workflow. Please read the Snakemake documentation for how to run Snakemake best on your system.

When this process is completed, you should have directories figures, results, and log. Each of these directories will have a separate folder for each dataset you added.

Generating a report

You can also generate a report of the workflow with the following command:

snakemake --report report.zip

This will create a zipped directory containing the report. Unzip this file and open the report.html file to view the report in your browser.

Additional workflow options

Worfklow subset

In some cases you may not want to run the full workflow and may only be interested in just running the different tools. You can use snakemake all_differentials --use-conda <other options> to eschew the machine learning and visualization parts of the workflow.

Phylogenetic visualization

Qadabra allows users to visualize the differentials on a phylogenetic tree using EMPress. With EMPress, you can annotate the tree with the differentials as barplots. This can be useful for determining phylogenetic signal in differential abundance.

Incorporating confounders

You can also specify additional confounders to incorporate into your DA model. When adding a dataset, use --confounder <column name> to add a confounder into your model. You can add multiple confounders by adding more --confounder <column name> arguments to add-dataset.

Workflow Overview

Qadabra runs several differential abundance tools on the same dataset. The features are ranked according to their association with the given metadata covariate. The top and bottom features are then used to create log-ratios according to Morton 2019 and Fedarko 2020. These log-ratios are used as predictors in logistic regression models to predict the class given the log-ratio.

Output

Qadabra generates many results files including many intermediate files that can be explored further.

Results

Each tool's output is stored in a separate subdirectory. For the R tools, an RDS object with the tool's R data is saved. The raw outputs are processed and concatenated into a file called concatenated_differentials.tsv. A Qurro visualization of all the tool ranks is generated at results/<dataset>/qurro/index.html. An interactive table with all the tool outputs is at results/<dataset>/differentials_table.html.

For each tool, the ranked features are used for machine learning models. The config.yaml file enumerates the percentile of feats to use for log-ratios. For example, at the 5% percentile, the top 5% of features and the bottom 5% of features associated with covariate are used to compute a log-ratio for each sample. This log-ratio is used in repeated K-fold cross-validation to determine how well this log-ratio can predict class membership using logistic regression. The ml subdirectory of each tool contains the features used, sample log-ratios, and compressed model objects.

Figures

The differential rank plots of each tool are plotted as <tool_name>_differentials.svg. A heatmap of the pairwise Kendall rank correlation among all pairs of tools is available as well. We also generated interactive plots to help compare the ranks of different features from the tools. figures/pca.svg generates a PCA plot of all the features, showing the concordance and discordance of results as well as the contribution of the tools. You can use the figures/rank_comparisons.html webpage to dynamically explore the relationship between pairs of tools. The upset subdirectory contains UpSet plots comparing the features from each tool. Finally, the roc and pr subdirectories contain ROC and PR (respectively) plots of all tools at each percentile of features.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.3.0

Jan 24, 2024

This version

0.3.0a1 pre-release

Sep 12, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qadabra-0.3.0a1.tar.gz (3.2 MB view hashes)

Uploaded Sep 12, 2022 Source

Hashes for qadabra-0.3.0a1.tar.gz

Hashes for qadabra-0.3.0a1.tar.gz
Algorithm	Hash digest
SHA256	`a57374eb404f92ac21d6443782449141a370ff75ae134770758d08a68997620e`
MD5	`6dbf4649129b079cdaa03ef0a9f32d37`
BLAKE2b-256	`314bc25041a2cd65283740da9416ab2b665ed2abd552d1fa5884b572f7c10efb`