PROFILE methodology for the binarisation and normalisation of RNA-seq data

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

profile_binr

The PROFILE methodology for the binarisation and normalisation of RNA-seq data.

This is a Python interface to a set of normalisation and binarisation functions for RNA-seq data originally written in R.

This software package is based on the methodology developed by Beal, Jonas; Montagud, Arnau; Traynard, Pauline; Barillot, Emmanuel; and Calzone, Laurence at Computational Systems Biology of Cancer team at Institut Curie (contact-sysbio@curie.fr).

This is the repository containing the original implementation in Rmarkdown notebooks.

Installation

This software has only been tested in Debian-based GNU/Linux distributions, it should in principle work on any *nix system.

Prerequisites

system dependencies

R, version 4.0.2 (2020-06-22) -- "Taking Off Again"
- It could be a newer R version, but this has not been tested.
To install R dependencies :
- make
- g++
- gfortran

R dependencies

mclust
diptest
moments
magrittr
tidyr
dplyr
tibble
bigmemory
doSNOW
foreach
glue

Using pip

This is a barebones functional example. We recommend installing within a Python virtual environment.

pip install git+https://github.com/bnediction/profile_binr

Usage

Once again this is a minimal example :

from profile_binr import ProfileBin
import pandas as pd

# your data is assumed to contain observations as
# rows and genes as columns
data = pd.read_csv("path/to/your/data.csv")
data.head()

	Clec1b	Kdm3a	Coro2b	8430408G22Rik	Clec9a	Phf6	Usp14	Tmem167b
cell_id
HSPC_025	0.0	4.891604	1.426148	0.0	0.0	2.599758	2.954035	6.357369
HSPC_031	0.0	6.877725	0.000000	0.0	0.0	2.423483	1.804914	0.000000
HSPC_037	0.0	0.000000	6.913384	0.0	0.0	2.051659	8.265465	0.000000
LT-HSC_001	0.0	0.000000	8.178374	0.0	0.0	6.419817	3.453502	2.579528
HSPC_001	0.0	0.000000	9.475577	0.0	0.0	7.733370	1.478900	0.000000

# create the binarisation instance using the dataframe
# with the index containing the cell identifier
# and the columns being the gene names
probin = ProfileBin(data)

# compute the criteria used to binarise/normalise the data :
# This method uses a parallel implementation, you can specify the 
# number of workers with an integer
probin.fit(8) # train using 8 threads

# Look at the computed criteria
probin.criteria.head(8)

	Dip	BI	Kurtosis	DropOutRate	MeanNZ	DenPeak	Amplitude	Category
Clec1b	0.358107	1.635698	54.017736	0.876208	1.520978	-0.007249	8.852181	ZeroInf
Kdm3a	0.000000	2.407548	-0.784019	0.326087	3.847940	0.209239	10.126676	Bimodal
Coro2b	0.000000	2.320060	7.061604	0.658213	2.383819	0.004597	9.475577	ZeroInf
8430408G22Rik	0.684454	3.121069	21.729044	0.884058	2.983472	0.005663	9.067857	ZeroInf
Clec9a	1.000000	2.081717	140.089285	0.965580	2.280293	-0.009361	9.614233	Discarded
Phf6	0.000000	1.988667	-1.389024	0.035628	5.025501	2.017547	10.135226	Bimodal
Usp14	0.000000	2.208080	-1.224987	0.007850	6.109964	8.245570	11.088750	Bimodal
Tmem167b	0.000000	2.430813	0.093023	0.393720	3.448331	0.072982	9.486826	Bimodal

# get binarised data (alternatively .binarise()):
my_bin = probin.binarize()
my_bin.head()

	Clec1b	Kdm3a	Coro2b	8430408G22Rik	Clec9a	Phf6	Usp14	Tmem167b
HSPC_025	NaN	1.0	NaN	NaN	NaN	0.0	0.0	1.0
HSPC_031	NaN	1.0	NaN	NaN	NaN	0.0	0.0	0.0
HSPC_037	NaN	0.0	1.0	NaN	NaN	0.0	1.0	0.0
LT-HSC_001	NaN	0.0	1.0	NaN	NaN	1.0	0.0	0.0
HSPC_001	NaN	0.0	1.0	NaN	NaN	1.0	0.0	0.0

# idem for normalised data :
my_norm = probin.normalize()
my_norm.head()

	Kdm3a	Coro2b	Clec9a	Phf6	Usp14	Tmem167b
HSPC_025	9.786196e-01	0.184102	NaN	0.000801	8.318176e-05	9.999970e-01
HSPC_031	9.999981e-01	0.000000	NaN	0.000462	8.084114e-07	6.874397e-11
HSPC_037	4.408417e-09	0.892449	NaN	0.000145	9.999940e-01	6.874397e-11
LT-HSC_001	4.408417e-09	1.000000	NaN	0.991865	6.230178e-04	1.599753e-04
HSPC_001	4.408417e-09	1.000000	NaN	0.999865	2.171153e-07	6.874397e-11

References

Please use the following bibtex entries to cite the original author's work :

@article{Beal2019,
abstract = {Logical models of cancer pathways are typically built by mining the literature for relevant experimental observations. They are usually generic as they apply for large cohorts of individuals. As a consequence, they generally do not capture the heterogeneity of patient tumors and their therapeutic responses. We present here a novel framework, referred to as PROFILE, to tailor logical models to a particular biological sample such as a patient tumor. This methodology permits to compare the model simulations to individual clinical data, i.e., survival time. Our approach focuses on integrating mutation data, copy number alterations (CNA), and expression data (transcriptomics or proteomics) to logical models. These data need first to be either binarized or set between 0 and 1, and can then be incorporated in the logical model by modifying the activity of the node, the initial conditions or the state transition rates. The use of MaBoSS, a tool based on Monte-Carlo kinetic algorithm to perform stochastic simulations on logical models results in model state probabilities, and allows for a semi-quantitative study of the model phenotypes and perturbations. As a proof of concept, we use a published generic model of cancer signaling pathways and molecular data from METABRIC breast cancer patients. For this example, we test several combinations of data incorporation and discuss that, with these data, the most comprehensive patient-specific cancer models are obtained by modifying the nodes' activity of the model with mutations, in combination or not with CNA data, and altering the transition rates with RNA expression. We conclude that these model simulations show good correlation with clinical data such as patients' Nottingham prognostic index (NPI) subgrouping and survival time. We observe that two highly relevant cancer phenotypes derived from personalized models, Proliferation and Apoptosis, are biologically consistent prognostic factors: patients with both high proliferation and low apoptosis have the worst survival rate, and conversely. Our approach aims to combine the mechanistic insights of logical modeling with multi-omics data integration to provide patient-relevant models. This work leads to the use of logical modeling for precision medicine and will eventually facilitate the choice of patient-specific drug treatments by physicians.},
author = {Beal, Jonas and Montagud, Arnau and Traynard, Pauline and Barillot, Emmanuel and Calzone, Laurence},
doi = {10.3389/fphys.2018.01965},
issn = {1664042X},
journal = {Frontiers in Physiology},
keywords = {Breast cancer,Data discretization,Logical models,Personalized mechanistic models,Personalized medicine,Stochastic simulations},
number = {JAN},
title = {{Personalization of logical models with multi-omics data allows clinical stratification of patients}},
volume = {10},
year = {2019}
}
@article{Beal2019a,
abstract = {Logical models of cancer pathways are typically built by mining the literature for relevant experimental observations. They are usually generic as they apply for large cohorts of individuals. As a consequence, they generally do not capture the heterogeneity of patient tumors and their therapeutic responses. We present here a novel framework, referred to as PROFILE, to tailor logical models to a particular biological sample such as a patient tumor. This methodology permits to compare the model simulations to individual clinical data, i.e., survival time. Our approach focuses on integrating mutation data, copy number alterations (CNA), and expression data (transcriptomics or proteomics) to logical models. These data need first to be either binarized or set between 0 and 1, and can then be incorporated in the logical model by modifying the activity of the node, the initial conditions or the state transition rates. The use of MaBoSS, a tool based on Monte-Carlo kinetic algorithm to perform stochastic simulations on logical models results in model state probabilities, and allows for a semi-quantitative study of the model phenotypes and perturbations. As a proof of concept, we use a published generic model of cancer signaling pathways and molecular data from METABRIC breast cancer patients. For this example, we test several combinations of data incorporation and discuss that, with these data, the most comprehensive patient-specific cancer models are obtained by modifying the nodes' activity of the model with mutations, in combination or not with CNA data, and altering the transition rates with RNA expression. We conclude that these model simulations show good correlation with clinical data such as patients' Nottingham prognostic index (NPI) subgrouping and survival time. We observe that two highly relevant cancer phenotypes derived from personalized models, Proliferation and Apoptosis, are biologically consistent prognostic factors: patients with both high proliferation and low apoptosis have the worst survival rate, and conversely. Our approach aims to combine the mechanistic insights of logical modeling with multi-omics data integration to provide patient-relevant models. This work leads to the use of logical modeling for precision medicine and will eventually facilitate the choice of patient-specific drug treatments by physicians.},
author = {Beal, Jonas and Montagud, Arnau and Traynard, Pauline and Barillot, Emmanuel and Calzone, Laurence},
doi = {10.3389/fphys.2018.01965},
issn = {1664042X},
journal = {Frontiers in Physiology},
keywords = {Breast cancer,Data discretization,Logical models,Personalized mechanistic models,Personalized medicine,Stochastic simulations},
number = {JAN},
pages = {1--23},
title = {{Personalization of logical models with multi-omics data allows clinical stratification of patients}},
volume = {10},
year = {2019}
}

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.1.2

May 27, 2021

0.1.1

May 27, 2021

This version

0.1.0

May 27, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

profile_binr-0.1.0.tar.gz (8.6 kB view hashes)

Uploaded May 27, 2021 Source

Built Distribution

profile_binr-0.1.0-py3-none-any.whl (6.3 kB view hashes)

Uploaded May 27, 2021 Python 3

Hashes for profile_binr-0.1.0.tar.gz

Hashes for profile_binr-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`86d2d45eefdd2f13902c1ca8a7362ca941f8104835962c16716cfad5f8707651`
MD5	`97e33b05f6de712dd8c8fad66f62bfe9`
BLAKE2b-256	`34528e339eead166da3d1fe9ea1b34be5d3fa4eb07661501135f44b5ec9d53af`

Hashes for profile_binr-0.1.0-py3-none-any.whl

Hashes for profile_binr-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3685578af4ef62d9adcb63655ec10b90d1b5e1c6ab53c7f044af6976f3a71f76`
MD5	`564a6796ed8fc8c5e9ea60a2eea6ba7c`
BLAKE2b-256	`39b887a946404c78d27ce884a58257ce7d1b691584deb38a85dfa9a6b8298431`