Skip to main content

Package which contains implementations of published collaborative filtering-based algorithms for drug repurposing.

Project description

funding logo

Python Version PyPI version Zenodo version License: MIT Build Status Codecov JOSS

BENCHmark for drug Screening with COllaborative FIltering (benchscofi) Python Package

This repository is a part of the EU-funded RECeSS project (#101102016), and hosts the implementations and / or wrappers to published implementations of collaborative filtering-based algorithms for easy benchmarking.

Statement of need

As of 2022, current drug development pipelines last around 10 years, costing $2billion in average, while drug commercialization failure rates go up to 90%. These issues can be mitigated by drug repurposing, where chemical compounds are screened for new therapeutic indications in a systematic fashion. In prior works, this approach has been implemented through collaborative filtering. This semi-supervised learning framework leverages known drug-disease matchings in order to recommend new ones.

There is no standard pipeline to train, validate and compare collaborative filtering-based repurposing methods, which considerably limits the impact of this research field. In benchscofi, the estimated improvement over the state-of-the-art (implemented in the package) can be measured through adequate and quantitative metrics tailored to the problem of drug repurposing across a large set of publicly available drug repurposing datasets.

Install the latest release

The fastest way to get access to all functionalities of benchscofi is to run the following command:

## Using the Docker image: will open a container
docker push recessproject/benchscofi:1.0.1

Documentation about benchscofi (and a manual installation) can be found at this page. The complete list of dependencies for benchscofi can be found at requirements.txt (pip).

Licence

This repository is under an OSI-approved MIT license.

Citation

If you use benchscofi in academic research, please cite it as follows

@article{reda2024stanscofi,
  title={stanscofi and benchscofi: a new standard for drug repurposing by collaborative filtering},
  author={R{\'e}da, Cl{\'e}mence and Vie, Jill-J{\^e}nn and Wolkenhauer, Olaf},
  journal={Journal of Open Source Software},
  volume={9},
  number={93},
  pages={5973},
  year={2024}
}

Community guidelines with respect to contributions, issue reporting, and support

You are more than welcome to add your own algorithm to the package!

1. Add a novel implementation / algorithm

Add a new Python file (extension .py) in src/benchscofi/ named <model> (where model is the name of the algorithm), which contains a subclass of stanscofi.models.BasicModel which has the same name as your Python file. At least implement methods preprocessing, model_fit, model_predict_proba, and a default set of parameters (which is used for testing purposes). Please have a look at the placeholder file Constant.py which implements a classification algorithm which labels all datapoints as positive. It is highly recommended to provide a proper documentation of your class, along with its methods. When pushing a new algorithm to benchscofi, it is automatically tested (see tests/test_models.py and TemplateTest.py which are run). In order to run this test locally, please run in the tests/ folder:

python3 -m test_models <model> <dataset:default=Synthetic>

2. Rules for contributors

Pull requests and issue flagging are welcome, and can be made through the GitHub interface. Support can be provided by reaching out to recess-project[at]proton.me. However, please note that contributors and users must abide by the Code of Conduct.

Benchmark AUC and NDCG@items values (default parameters, single random training/testing set split) [updated 08/11/23]

These values (rounded to the closest 3rd decimal place) can be reproduced using the following command in folder tests/

python3 -m test_models <algorithm> <dataset:default=Synthetic> <batch_ratio:default=1>

:no_entry:'s represent failure to train or to predict. N/A's have not been tested yet. When present, percentage in parentheses is the considered value of batch_ratio (to avoid memory crash on some of the datasets). [mem]: memory crash [err]: error

Algorithm (global AUC) Synthetic* TRANSCRIPT [a] Gottlieb [b] Cdataset [c] PREDICT [d] LRSSL [e]
PMF 0.922 0.579 0.598 0.604 0.656 0.611
PulearnWrapper 1.000 :no_entry: N/A :no_entry: :no_entry: :no_entry:
ALSWR 0.971 0.507 0.677 0.724 0.693 0.685
FastaiCollabWrapper 1.000 0.876 0.856 0.837 0.835 0.851
SimplePULearning 0.995 0.949 (0.4) :no_entry:[err] :no_entry:[err] 0.994 (4%) :no_entry:
SimpleBinaryClassifier 0.876 :no_entry:[mem] 0.855 0.938 (40%) 0.998 (1%) :no_entry:
NIMCGCN 0.907 0.854 0.843 0.841 0.914 (60%) 0.873
FFMWrapper 0.924 :no_entry:[mem] 1.000 (40%) 1.000 (20%) :no_entry:[mem] :no_entry:
VariationalWrapper :no_entry:[err] :no_entry:[err] 0.851 0.851 :no_entry:[err] :no_entry:
DRRS :no_entry:[err] 0.662 0.838 0.878 :no_entry:[err] 0.892
SCPMF 0.853 0.680 0.548 0.538 :no_entry:[err] 0.708
BNNR 1.000 0.922 0.949 0.959 0.990 (1%) 0.972
LRSSL 0.127 0.581 (90%) 0.159 0.846 0.764 (1%) 0.665
MBiRW 1.000 0.913 0.954 0.965 :no_entry:[err] 0.975
LibMFWrapper 1.000 0.919 0.892 0.912 0.923 0.873
LogisticMF 1.000 0.910 0.941 0.955 0.953 0.933
PSGCN 0.767 :no_entry:[err] 0.802 0.888 :no_entry: 0.887
DDA_SKF 0.779 0.453 0.544 0.264 (20%) 0.591 0.542
HAN 1.000 0.870 0.909 0.905 0.904 0.923
PUextraTrees (n_estimators=10) 0.045 (50%) 0.325 (50%) 0.246 (20%) :no_entry:[mem] 0.309 (5%)
XGBoost (n_estimators=100) 0.500 0.500 (20%) 0.500 0.500 0.500 (1%) 0.500 (60%)

The NDCG score is computed across all diseases (global), at k=#items.

Algorithm (global NDCG@k) Synthetic@300* TRANSCRIPT@613[a] Gottlieb@593[b] Cdataset@663[c] PREDICT@1577[d] LRSSL@763[e]
PMF 0.070 0.019 0.015 0.011 0.005 0.007
PulearnWrapper N/A :no_entry: N/A :no_entry: :no_entry: :no_entry:
ALSWR 0.000 0.177 0.236 0.406 0.193 0.424
FastaiCollabWrapper 1.000 0.035 0.012 0.003 0.001 0.000
SimplePULearning 1.000 0.059 (40%) :no_entry:[err] :no_entry:[err] 0.025 (4%) :no_entry:[err]
SimpleBinaryClassifier 0.000 :no_entry:[mem] 0.002 0.005 (40%) 0.070 (1%) :no_entry:[err]
NIMCGCN 0.568 0.022 0.006 0.005 0.007 (60%) 0.014
FFMWrapper 1.000 :no_entry:[mem] 1.000 (40%) 1.000 (20%) :no_entry:[mem] :no_entry:
VariationalWrapper :no_entry:[err] :no_entry:[err] 0.011 0.010 :no_entry:[err] :no_entry:
DRRS :no_entry:[err] 0.484 0.301 0.426 :no_entry:[err] 0.182
SCPMF 0.528 0.102 0.025 0.011 :no_entry:[err] 0.008
BNNR 1.000 0.466 0.417 0.572 0.217 (1%) 0.508
LRSSL 0.206 0.032 (90%) 0.009 0.004 0.103 (1%) 0.012
MBiRW 1.000 0.085 0.267 0.352 :no_entry:[err] 0.457
LibMFWrapper 1.000 0.419 0.431 0.605 0.502 0.430
LogisticMF 1.000 0.323 0.106 0.101 0.076 0.078
PSGCN 0.969 :no_entry:[err] 0.074 0.052 :no_entry:[err] 0.110
DDA_SKF 1.000 0.039 0.069 0.078 (20%) 0.065 0.069
HAN 1.000 0.075 0.007 0.000 0.001 0.002
PUextraTrees (n_estimators=10) 0.000 (50%) 0.198 (50%) 0.162 (20%) :no_entry:[mem] 0.235 (5%)
XGBoost (n_estimators=100) 0.061 0.000 (20%) 0.002 0.000 0.000 (1%) 0.000 (60%)

:no_entry: Note that results from ``LibMFWrapper'' are not reproducible, and the resulting metrics might slightly vary across iterations.

:no_entry: XGBoost and SimpleBinaryClassifier do not take into account unlabeled points (they assume they are negative points).

Datasets

*Synthetic dataset created with function generate_dummy_dataset in stanscofi.datasets and the following arguments:

npositive=200 #number of positive pairs
nnegative=100 #number of negative pairs
nfeatures=50 #number of pair features
mean=0.5 #mean for the distribution of positive pairs, resp. -mean for the negative pairs
std=1 #standard deviation for the distribution of positive and negative pairs
random_seed=124565 #random seed

[a] Réda, Clémence. (2023). TRANSCRIPT drug repurposing dataset (2.0.0) [Data set]. Zenodo. doi:10.5281/zenodo.7982976

[b] Gottlieb, A., Stein, G. Y., Ruppin, E., & Sharan, R. (2011). PREDICT: a method for inferring novel drug indications with application to personalized medicine. Molecular systems biology, 7(1), 496.

[c] Luo, H., Li, M., Wang, S., Liu, Q., Li, Y., & Wang, J. (2018). Computational drug repositioning using low-rank matrix approximation and randomized algorithms. Bioinformatics, 34(11), 1904-1912.

[d] Réda, Clémence. (2023). PREDICT drug repurposing dataset (2.0.1) [Data set]. Zenodo. doi:10.5281/zenodo.7983090

[e] Liang, X., Zhang, P., Yan, L., Fu, Y., Peng, F., Qu, L., … & Chen, Z. (2017). LRSSL: predict and interpret drug–disease associations based on data integration using sparse subspace learning. Bioinformatics, 33(8), 1187-1196.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page