Tool for ORF-calling and ORF-classification using ML approaches

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

TIdeS

Transcript Identification and Selection (TIdeS) is a method to identify putative open reading frames (pORFs) from a given transcriptome and is able to aid in the bulk decontamination of sequences from "messy" transcriptomic data.

Overall, TIdeS couples sequence composition with ML approaches to discern pORFs in the correct reading frame with substantial improvement over other popular tools, while providing support for additional non-standard genetic codes. Additionally, TIdeS can be used to classify ORFs into several user-defined categories from highly contaminated datasets (e.g., parasite + host, kleptoplasts, big "dirty" protists).

Dependencies

Installation

Note that TIdeS is only supported on UNIX systems (linux and MacOS).

Python's pip can be used to install the necessary python version and related packages.

pip install tides-ml

Followed by downloading the precompiled executables for the remaining dependencies.

Alternatively, you can do this through conda (note this will be updated):

# Create a new environment for TIdeS
conda create -n tides-ml
conda activate tides-ml

# Install the necessary packages (with minimum support versions)
conda install -c bioconda -c conda-forge diamond">=2.0.13" cd-hit">=4.8.1" barrnap">=0.9" kraken2">=2.0"

Clone the repository.

git clone https://github.com/xxmalcala/TIdeS.git

Running TIdeS

The general syntax to run TIdeS is:

tides --fin <transcriptome-assembly> --taxon <taxon-name> --db <protein-database>

Several example command lines and uses for TIdeS (i.e., ORF-calling and ORF classifying) are included in the examples folder. To run the examples, you need to be within the examples folder (e.g., ./orf_call_and_decontam.sh)

List of all options

Command	Comment
`-h`, `--help`	Print the help message
`-i`, `--fin <STRING>`	FASTA formatted file.
`-o`, `--taxon <STRING>`	Name for your taxon, project, outputs.
`-t`, `--threads <INTEGER>`	Number of available threads to use. Default value is `4`.
`-d`, `--db <STRING>`	Path to FASTA or DIAMOND formatted proteome database.
`-k`, `--kraken <STRING>`	Kraken2 database to identify and filter non-eukaryotic sequences.
`--no-filter`	Skip all transcript pre-processing.
`-p`, `--partials`	Include partial ORFs for ORF calling.
`-id`, `--id <INTEGER>`	Minimum % identity to remove redundant transcripts. Default value is `97`.
`-l`, `--min-orf <INTEGER>`	Minimum transcript length (bp) for ORF calling. Default value is `300`.
`-ml`, `--max-orf <INTEGER>`	Maximum transcript length (bp) for ORF calling. Default value is `10000`.
`-e`, `--evalue <REAL>`	Maximum e-value to infer reference ORFs. Default value is `1e-30`.
`-gc`, `--gencode <STRING/INTEGER>`	Genetic code to use to for ORF calling and translation. Default is `1`.
`-s`, `--strand <STRING>`	Strands to call ORFs (both/minus/plus). Default value is `both`.
`-c`, `--contam <STRING>`	Path to annotated sequence table. If unset, TIdeS will assume a prior model is provided as well.
`m`, `--model <STRING>`	Path to a prior TIdeS run's model. These are the ".pkl" files.
`-k`, `--kmer <INTEGER>`	kmer length to use. Default value is `3`.
`-ov`, `--overlap`	Permit overlapping kmers.
`--step <INTEGER>`	Step-size for overlapping kmers. Default value is `kmer-length/2`.
`--clean`	Remove intermediate filter-step files.
`-gz`, `--gzip`	Compress TIdeS outputs when finished.

ORF-Calling and Assessment

Reference protein database

Create a reference protein database for TIdeS (note you can use your own if you choose!). This will generate a database from six diverse eukaryotes, representing a broad yet compact database for subsequent ORF-calling.

Note that this database (tides_aa_db.dmnd) will be prepared from whichever directory you call upon this script.

./TIdes/util/prep_tides_db.sh

Inputs

FASTA formatted transcriptome assembly
Taxon name (e.g., Homo sapiens, Op_me_Hsap)
Protein database (can be prepared by "prep_tides_db.sh" in the util folder)

tides -i <transcriptome-assembly> -o <taxon-name> -d <protein-database>

Decontamination

Inputs

FASTA formatted transcriptome assembly
Taxon/project name (e.g., Durisnkia baltica, Dinotoms) Optional (need one)
Table of annotated sequence names (see examples folder)
Path to Kraken2 database to identify putative non-eukaryotic sequences to remove

With a table of annotated sequences:

tides -i <transcriptome-assembly> -o <taxon-name> -c <annotated-seqs-table>

Using a Kraken2 database:

tides -i <transcriptome-assembly> -o <taxon-name> -c -k <kraken2-database>

Table of annotated sequences

The <annotated-seqs-table> should include sequence names and their category separated by tabs. Note that these sequences should be present within the input FASTA file as well. Please aim to include at least 25 sequences for each category, although more (up to ~200) is great!

seq1  human
seq2  lunch
seq3  lunch
seq4  human
seq5  lunch
...

Additional uses/approaches

More on how to run TIdeS and its uses can be found in the examples folder, including:

ORF-Calling
Classification of ORFs
ORF-calling and sequence classifying with a previously trained model
Preparing a simple proteome database and ORF-calling
Naive approaches to inferring contamination
Example FASTA and <annotated-seqs-table> files

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

1.2.0

Feb 1, 2024

1.1.4

Jan 25, 2024

1.1.3

Jan 22, 2024

1.1.2

Nov 3, 2023

1.1.1

Oct 25, 2023

1.1.0

Oct 25, 2023

1.0.4

Oct 25, 2023

1.0.1

Sep 27, 2023

1.0.0

Sep 27, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

TIdeS-ML-1.2.0.tar.gz (22.5 kB view hashes)

Uploaded Feb 1, 2024 Source

Built Distribution

TIdeS_ML-1.2.0-py3-none-any.whl (27.3 kB view hashes)

Uploaded Feb 1, 2024 Python 3

Hashes for TIdeS-ML-1.2.0.tar.gz

Hashes for TIdeS-ML-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`8c31f4429303e5d34c031b243963fae3eec1f0d9c97e856949a7eac17f82452c`
MD5	`cc0def5c37d2810d4a2377eef33e7863`
BLAKE2b-256	`9f7a24516ab0e0815f11b8c0b6bb8587e2448892201fa9e031055bbf903b08c1`

Hashes for TIdeS_ML-1.2.0-py3-none-any.whl

Hashes for TIdeS_ML-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7cb04887f2147c9ebee064d5b2bb2b6fd8e9220dfc9c5b954d9d1bc002b90f43`
MD5	`830962d19b9b1eb510927d5bfa4672f1`
BLAKE2b-256	`195664f911a9ded513eb27d05cb9390dcedeaf2a6b7cf74c75ed529dfa61df72`