Metagenomic binning with semi-supervised siamese neural network
Project description
SemiBin: Semi-supervised Metagenomic Binning Using Siamese Neural Networks
Command tool for metagenomic binning with semi-supervised deep learning using information from reference genomes.
NOTE: This tool is still in development. You are welcome to try it out and feedback is appreciated, but expect some bugs/rapid changes until it stabilizes. Please use Github issues for bug reports and the SemiBin users mailing-list for more open-ended discussions or questions.
If you use this software in a publication please cite:
SemiBin: Incorporating information from reference genomes with semi-supervised deep learning leads to better metagenomic assembled genomes (MAGs) Shaojun Pan, Chengkai Zhu, Xing-Ming Zhao, Luis Pedro Coelho bioRxiv 2021.08.16.456517; doi: https://doi.org/10.1101/2021.08.16.456517
Installation
SemiBin runs on Python 3.7-3.9.
Bioconda
NOTE: If you want to use SemiBin with GPU, you need to install Pytorch with GPU support. Or conda install -c bioconda semibin
just install Pytorch with CPU support.
conda create -n SemiBin python==3.7
conda activate SemiBin
conda install -c conda-forge -c bioconda semibin
conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch-lts
Source
You can download the source code from github and install.
Install dependence packages using conda: MMseqs2, Bedtools, Hmmer, and Fraggenescan.
conda install -c conda-forge -c bioconda mmseqs2=13.45111
conda install -c bioconda bedtools hmmer fraggenescan==1.30
Once the dependencies are installed, you can install by running:
python setup.py install
Examples of binning
NOTE: The SemiBin
API is a work-in-progress. The examples refer to
version 0.5
, but this may change in the near future (after the release of
version 1.0, we expect to freeze the API for at least 5
years). We are very happy
to hear any feedback on API
design, though.
SemiBin runs on single-sample, co-assembly and multi-sample binning.
The basic idea of using SemiBin with single-sample and co-assembly is:
(1) generating data.csv and data_split.csv (used in training) for every sample,
(2) training the model for every sample, and
(3) binning the contigs for every contig with the model trained from the same sample.
When using multi-sample binning, the basic idea is very similar, the inputs are the contigs combined from several samples and bam files from several samples. And then we also generate data.csv and data_split.csv and do training and binning for every sample. The only difference compared to single-sample binning is the data.csv and data_split.csv have the abundance information from several samples.
Considering the issue that contig annotations and model training requires significant computational time and the algorithm design of SemiBin, we proposed SemiBin(pretrain) for single-sample binning to:
(1) train a model from one sample or several samples (or used our built-in pretrained model) and
(2) directly apply this model to other samples.
For the details and examples of every command to run SemiBin with these binning modes, please read the docs.
Easy single/co-assembly binning mode
You will need the following inputs:
- A contig file (
contig.fna
in the example below) - BAM files from mapping short reads to the contigs
The single_easy_bin
command can be used in single-sample and co-assembly
binning modes (contig annotations using mmseqs with GTDB reference genome) to
get results in a single step.
SemiBin single_easy_bin -i contig.fna -b *.bam -o output
In this example, SemiBin will download GTDB to
$HOME/.cache/SemiBin/mmseqs2-GTDB/GTDB
. You can change this default using the
-r
argument.
You can use --environment
with human_gut
, dog_gut
, ocean
, soil
,
cat_gut
, human_oral
, mouse_gut
, pig_gut
, built_environment
,
wastewater
, or global
to use one of our built-in models.
SemiBin single_easy_bin -i contig_S1.fna -b S1.bam -o output --environment human_gut
Easy multi-samples binning mode
The multi_easy_bin
command can be used in multi-samples binning modes (contig
annotations using mmseqs with GTDB reference genome).
You will need the following inputs:
-
a combined contig file
-
BAM files from mapping
For every contig, format of the name is <sample_name>:<contig_name>
, where
:
is the default separator (it can be changed with the --separator
argument). NOTE: Make sure the sample names are unique and the separator
does not introduce confusion when splitting. For example:
>S1:Contig_1
AGATAATAAAGATAATAATA
>S1:Contig_2
CGAATTTATCTCAAGAACAAGAAAA
>S1:Contig_3
AAAAAGAGAAAATTCAGAATTAGCCAATAAAATA
>S2:Contig_1
AATGATATAATACTTAATA
>S2:Contig_2
AAAATATTAAAGAAATAATGAAAGAAA
>S3:Contig_1
ATAAAGACGATAAAATAATAAAAGCCAAATCCGACAAAGAAAGAACGG
>S3:Contig_2
AATATTTTAGAGAAAGACATAAACAATAAGAAAAGTATT
>S3:Contig_3
CAAATACGAATGATTCTTTATTAGATTATCTTAATAAGAATATC
You can get the results with one line of code.
SemiBin multi_easy_bin -i contig_whole.fna -b *.bam -o output
Output
The output folder will contain:
-
Datasets used for training and clustering.
-
Saved semi-supervised deep learning model.
-
Output bins.
-
Some intermediate files.
For every sample, reconstructed bins are in output_bins
directory. Using
reclustering, reconstructed bins are in output_recluster_bins
directory.
For more details about the output, read the docs.
Advanced workflows
You can run individual steps by yourself, which can enable using compute
clusters to make the binning process faster (especially in multi-samples
binning mode). For example, single_easy_bin
includes the following steps:
generate_cannot_links
,generate_data_single
and bin
; while multi_easy_bin
includes the following steps: generate_cannot_links
, generate_data_multi
and bin
.
In advanced mode, you can also use our built-in pre-trained model in single-sample binning mode. Here we provide pre-trained models for human gut, dog gut, marine, soil, cat gut, human oral, mouse gut, pig gut, built_environment, wastewater and global (train from all environments listed) environment.
A very easy way to run SemiBin with a built-in model
(human_gut
/dog_gut
/ocean
/soil
/cat_gut
/human_oral
/mouse_gut
/pig_gut
/built_environment
/wastewater
/global
for single-sample binning):
SemiBin single_easy_bin -i contig_S1.fna -b S1.bam -o output --environment human_gut
Another suggestion is that you can pre-train a model from part of your dataset, which can provide a balance as it is faster than training for each sample while achieving better results than a pre-trained model from another dataset.
(1) Generate data.csv/data_split.csv
for every sample
Single-sample/co-assembly binning:
SemiBin generate_data_single -i contig_S1.fna -b S1.bam -o output
Multi-sample binning:
SemiBin generate_data_multi -i contig_combined.fna -b S1.bam S2.bam S3.bam S4.bam S5.bam -o output -s :
(2) Generate cannot-link for every sample
SemiBin generate_cannot_links -i contig_S1.fna -o output
(3) Train a pre-trained model across several samples (For single-sample binning, make sure the input files are corresponding.)
SemiBin train -i S1.fna S2.fna S3.fna --data S1/train.csv S2/train.csv S3/train.csv --data-split S1/train_split.csv S2/train_split.csv S3/train_split.csv -c S1/cannot.txt s2/cannot.txt S3/cannot.txt -o output --mode several
Or just train a model from one sample. If you are using multi-sample binning, here the contig is the contig for every sample.
Single-sample/co-assembly binning:
SemiBin train -i contig.fna --data train.csv --data-split train_split.csv -c cannot.txt -o output --mode single
Multi-sample binning(similar for other samples) :
SemiBin train -i contig_S1.fna --data S1/train.csv --data-split S1/train_split.csv -c S1/cannot.txt -o output --mode single
(4) Bin with the trained model. If you are using multi-sample binning, here the contig is the contig for every sample and the bam files are still from several samples.
Single-sample/co-assembly binning:
SemiBin bin -i contig.fna --model model.h5 --data data.csv -o output
Multi-sample binning(similar for other samples):
SemiBin bin -i contig_S1.fna --model model.h5 --data S1/data.csv -o output
Or our built-in model (human_gut, dog_gut or ocean) (just for single-sample binning)
SemiBin bin -i contig.fna --data data.csv -o output --environment human_gut
For more details on usage, including every command on how to run SemiBin with different binning modes, please read the docs.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.