Skip to main content

PROFILE methodology for the binarisation and normalisation of RNA-seq data

Project description

profile_binr

The PROFILE methodology for the binarisation and normalisation of RNA-seq data.

This is a Python interface to a set of normalisation and binarisation functions for RNA-seq data originally written in R.

This software package is based on the methodology developed by Beal, Jonas; Montagud, Arnau; Traynard, Pauline; Barillot, Emmanuel; and Calzone, Laurence at Computational Systems Biology of Cancer team at Institut Curie (contact-sysbio@curie.fr).

This is the repository containing the original implementation in Rmarkdown notebooks.

Installation

This software has only been tested in Debian-based GNU/Linux distributions, it should in principle work on any *nix system.

Prerequisites

system dependencies

  • R, version 4.0.2 (2020-06-22) -- "Taking Off Again"
    • It could be a newer R version, but this has not been tested.
  • To install R dependencies :
    • make
    • g++
    • gfortran

R dependencies

  • mclust
  • diptest
  • moments
  • magrittr
  • tidyr
  • dplyr
  • tibble
  • bigmemory
  • doSNOW
  • foreach
  • glue

Using pip

This is a barebones functional example. We recommend installing within a Python virtual environment.

pip install git+https://github.com/bnediction/profile_binr

Usage

Once again this is a minimal example :

from profile_binr import ProfileBin
import pandas as pd

# your data is assumed to contain observations as
# rows and genes as columns
data = pd.read_csv("path/to/your/data.csv")
data.head()
Clec1b Kdm3a Coro2b 8430408G22Rik Clec9a Phf6 Usp14 Tmem167b
cell_id
HSPC_025 0.0 4.891604 1.426148 0.0 0.0 2.599758 2.954035 6.357369
HSPC_031 0.0 6.877725 0.000000 0.0 0.0 2.423483 1.804914 0.000000
HSPC_037 0.0 0.000000 6.913384 0.0 0.0 2.051659 8.265465 0.000000
LT-HSC_001 0.0 0.000000 8.178374 0.0 0.0 6.419817 3.453502 2.579528
HSPC_001 0.0 0.000000 9.475577 0.0 0.0 7.733370 1.478900 0.000000
# create the binarisation instance using the dataframe
# with the index containing the cell identifier
# and the columns being the gene names
probin = ProfileBin(data)

# compute the criteria used to binarise/normalise the data :
# This method uses a parallel implementation, you can specify the 
# number of workers with an integer
probin.fit(8) # train using 8 threads

# Look at the computed criteria
probin.criteria.head(8)
Dip BI Kurtosis DropOutRate MeanNZ DenPeak Amplitude Category
Clec1b 0.358107 1.635698 54.017736 0.876208 1.520978 -0.007249 8.852181 ZeroInf
Kdm3a 0.000000 2.407548 -0.784019 0.326087 3.847940 0.209239 10.126676 Bimodal
Coro2b 0.000000 2.320060 7.061604 0.658213 2.383819 0.004597 9.475577 ZeroInf
8430408G22Rik 0.684454 3.121069 21.729044 0.884058 2.983472 0.005663 9.067857 ZeroInf
Clec9a 1.000000 2.081717 140.089285 0.965580 2.280293 -0.009361 9.614233 Discarded
Phf6 0.000000 1.988667 -1.389024 0.035628 5.025501 2.017547 10.135226 Bimodal
Usp14 0.000000 2.208080 -1.224987 0.007850 6.109964 8.245570 11.088750 Bimodal
Tmem167b 0.000000 2.430813 0.093023 0.393720 3.448331 0.072982 9.486826 Bimodal
# get binarised data (alternatively .binarise()):
my_bin = probin.binarize()
my_bin.head()
Clec1b Kdm3a Coro2b 8430408G22Rik Clec9a Phf6 Usp14 Tmem167b
HSPC_025 NaN 1.0 NaN NaN NaN 0.0 0.0 1.0
HSPC_031 NaN 1.0 NaN NaN NaN 0.0 0.0 0.0
HSPC_037 NaN 0.0 1.0 NaN NaN 0.0 1.0 0.0
LT-HSC_001 NaN 0.0 1.0 NaN NaN 1.0 0.0 0.0
HSPC_001 NaN 0.0 1.0 NaN NaN 1.0 0.0 0.0
# idem for normalised data :
my_norm = probin.normalize()
my_norm.head()
Clec1b Kdm3a Coro2b 8430408G22Rik Clec9a Phf6 Usp14 Tmem167b
HSPC_025 0.0 9.786196e-01 0.184102 0.0 NaN 0.000801 8.318176e-05 9.999970e-01
HSPC_031 0.0 9.999981e-01 0.000000 0.0 NaN 0.000462 8.084114e-07 6.874397e-11
HSPC_037 0.0 4.408417e-09 0.892449 0.0 NaN 0.000145 9.999940e-01 6.874397e-11
LT-HSC_001 0.0 4.408417e-09 1.000000 0.0 NaN 0.991865 6.230178e-04 1.599753e-04
HSPC_001 0.0 4.408417e-09 1.000000 0.0 NaN 0.999865 2.171153e-07 6.874397e-11

References

Please use the following bibtex entries to cite the original author's work :

@article{Beal2019,
abstract = {Logical models of cancer pathways are typically built by mining the literature for relevant experimental observations. They are usually generic as they apply for large cohorts of individuals. As a consequence, they generally do not capture the heterogeneity of patient tumors and their therapeutic responses. We present here a novel framework, referred to as PROFILE, to tailor logical models to a particular biological sample such as a patient tumor. This methodology permits to compare the model simulations to individual clinical data, i.e., survival time. Our approach focuses on integrating mutation data, copy number alterations (CNA), and expression data (transcriptomics or proteomics) to logical models. These data need first to be either binarized or set between 0 and 1, and can then be incorporated in the logical model by modifying the activity of the node, the initial conditions or the state transition rates. The use of MaBoSS, a tool based on Monte-Carlo kinetic algorithm to perform stochastic simulations on logical models results in model state probabilities, and allows for a semi-quantitative study of the model phenotypes and perturbations. As a proof of concept, we use a published generic model of cancer signaling pathways and molecular data from METABRIC breast cancer patients. For this example, we test several combinations of data incorporation and discuss that, with these data, the most comprehensive patient-specific cancer models are obtained by modifying the nodes' activity of the model with mutations, in combination or not with CNA data, and altering the transition rates with RNA expression. We conclude that these model simulations show good correlation with clinical data such as patients' Nottingham prognostic index (NPI) subgrouping and survival time. We observe that two highly relevant cancer phenotypes derived from personalized models, Proliferation and Apoptosis, are biologically consistent prognostic factors: patients with both high proliferation and low apoptosis have the worst survival rate, and conversely. Our approach aims to combine the mechanistic insights of logical modeling with multi-omics data integration to provide patient-relevant models. This work leads to the use of logical modeling for precision medicine and will eventually facilitate the choice of patient-specific drug treatments by physicians.},
author = {Beal, Jonas and Montagud, Arnau and Traynard, Pauline and Barillot, Emmanuel and Calzone, Laurence},
doi = {10.3389/fphys.2018.01965},
issn = {1664042X},
journal = {Frontiers in Physiology},
keywords = {Breast cancer,Data discretization,Logical models,Personalized mechanistic models,Personalized medicine,Stochastic simulations},
number = {JAN},
title = {{Personalization of logical models with multi-omics data allows clinical stratification of patients}},
volume = {10},
year = {2019}
}
@article{Beal2019a,
abstract = {Logical models of cancer pathways are typically built by mining the literature for relevant experimental observations. They are usually generic as they apply for large cohorts of individuals. As a consequence, they generally do not capture the heterogeneity of patient tumors and their therapeutic responses. We present here a novel framework, referred to as PROFILE, to tailor logical models to a particular biological sample such as a patient tumor. This methodology permits to compare the model simulations to individual clinical data, i.e., survival time. Our approach focuses on integrating mutation data, copy number alterations (CNA), and expression data (transcriptomics or proteomics) to logical models. These data need first to be either binarized or set between 0 and 1, and can then be incorporated in the logical model by modifying the activity of the node, the initial conditions or the state transition rates. The use of MaBoSS, a tool based on Monte-Carlo kinetic algorithm to perform stochastic simulations on logical models results in model state probabilities, and allows for a semi-quantitative study of the model phenotypes and perturbations. As a proof of concept, we use a published generic model of cancer signaling pathways and molecular data from METABRIC breast cancer patients. For this example, we test several combinations of data incorporation and discuss that, with these data, the most comprehensive patient-specific cancer models are obtained by modifying the nodes' activity of the model with mutations, in combination or not with CNA data, and altering the transition rates with RNA expression. We conclude that these model simulations show good correlation with clinical data such as patients' Nottingham prognostic index (NPI) subgrouping and survival time. We observe that two highly relevant cancer phenotypes derived from personalized models, Proliferation and Apoptosis, are biologically consistent prognostic factors: patients with both high proliferation and low apoptosis have the worst survival rate, and conversely. Our approach aims to combine the mechanistic insights of logical modeling with multi-omics data integration to provide patient-relevant models. This work leads to the use of logical modeling for precision medicine and will eventually facilitate the choice of patient-specific drug treatments by physicians.},
author = {Beal, Jonas and Montagud, Arnau and Traynard, Pauline and Barillot, Emmanuel and Calzone, Laurence},
doi = {10.3389/fphys.2018.01965},
issn = {1664042X},
journal = {Frontiers in Physiology},
keywords = {Breast cancer,Data discretization,Logical models,Personalized mechanistic models,Personalized medicine,Stochastic simulations},
number = {JAN},
pages = {1--23},
title = {{Personalization of logical models with multi-omics data allows clinical stratification of patients}},
volume = {10},
year = {2019}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

profile_binr-0.1.0.tar.gz (8.6 kB view hashes)

Uploaded Source

Built Distribution

profile_binr-0.1.0-py3-none-any.whl (6.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page