Package which contains implementations of published collaborative filtering-based algorithms for drug repurposing.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

funding logo

BENCHmark for drug Screening with COllaborative FIltering (benchscofi) Python Package

This repository is a part of the EU-funded RECeSS project (#101102016), and hosts the implementations and / or wrappers to published implementations of collaborative filtering-based algorithms for easy benchmarking.

Statement of need

As of 2022, current drug development pipelines last around 10 years, costing $2billion in average, while drug commercialization failure rates go up to 90%. These issues can be mitigated by drug repurposing, where chemical compounds are screened for new therapeutic indications in a systematic fashion. In prior works, this approach has been implemented through collaborative filtering. This semi-supervised learning framework leverages known drug-disease matchings in order to recommend new ones.

There is no standard pipeline to train, validate and compare collaborative filtering-based repurposing methods, which considerably limits the impact of this research field. In benchscofi, the estimated improvement over the state-of-the-art (implemented in the package) can be measured through adequate and quantitative metrics tailored to the problem of drug repurposing across a large set of publicly available drug repurposing datasets.

Install the latest release

The fastest way to get access to all functionalities of benchscofi is to run the following command:

## Using the Docker image: will open a container
docker push recessproject/benchscofi:1.0.1

Documentation about benchscofi (and a manual installation) can be found at this page. The complete list of dependencies for benchscofi can be found at requirements.txt (pip).

Licence

This repository is under an OSI-approved MIT license.

Citation

If you use benchscofi in academic research, please cite it as follows

Réda, Clémence, Jill-Jênn Vie, and Olaf Wolkenhauer. "A new standard for drug repurposing by collaborative filtering: stanscofi and benchscofi." (2023).

Community guidelines with respect to contributions, issue reporting, and support

You are more than welcome to add your own algorithm to the package!

1. Add a novel implementation / algorithm

Add a new Python file (extension .py) in src/benchscofi/ named <model> (where model is the name of the algorithm), which contains a subclass of stanscofi.models.BasicModel which has the same name as your Python file. At least implement methods preprocessing, model_fit, model_predict_proba, and a default set of parameters (which is used for testing purposes). Please have a look at the placeholder file Constant.py which implements a classification algorithm which labels all datapoints as positive. It is highly recommended to provide a proper documentation of your class, along with its methods. When pushing a new algorithm to benchscofi, it is automatically tested (see tests/test_models.py and TemplateTest.py which are run). In order to run this test locally, please run in the tests/ folder:

python3 -m test_models <model> <dataset:default=Synthetic>

2. Rules for contributors

Pull requests and issue flagging are welcome, and can be made through the GitHub interface. Support can be provided by reaching out to recess-project[at]proton.me. However, please note that contributors and users must abide by the Code of Conduct.

Benchmark AUC and NDCG@items values (default parameters, single random training/testing set split) [updated 08/11/23]

These values (rounded to the closest 3rd decimal place) can be reproduced using the following command in folder tests/

python3 -m test_models <algorithm> <dataset:default=Synthetic> <batch_ratio:default=1>

:no_entry:'s represent failure to train or to predict. N/A's have not been tested yet. When present, percentage in parentheses is the considered value of batch_ratio (to avoid memory crash on some of the datasets). [mem]: memory crash [err]: error

Algorithm (global AUC)	Synthetic*	TRANSCRIPT [a]	Gottlieb [b]	Cdataset [c]	PREDICT [d]	LRSSL [e]
PMF	0.922	0.579	0.598	0.604	0.656	0.611
PulearnWrapper	1.000	:no_entry:	N/A	:no_entry:	:no_entry:	:no_entry:
ALSWR	0.971	0.507	0.677	0.724	0.693	0.685
FastaiCollabWrapper	1.000	0.876	0.856	0.837	0.835	0.851
SimplePULearning	0.995	0.949 (0.4)	:no_entry:[err]	:no_entry:[err]	0.994 (4%)	:no_entry:
SimpleBinaryClassifier	0.876	:no_entry:[mem]	0.855	0.938 (40%)	0.998 (1%)	:no_entry:
NIMCGCN	0.907	0.854	0.843	0.841	0.914 (60%)	0.873
FFMWrapper	0.924	:no_entry:[mem]	1.000 (40%)	1.000 (20%)	:no_entry:[mem]	:no_entry:
VariationalWrapper	:no_entry:[err]	:no_entry:[err]	0.851	0.851	:no_entry:[err]	:no_entry:
DRRS	:no_entry:[err]	0.662	0.838	0.878	:no_entry:[err]	0.892
SCPMF	0.853	0.680	0.548	0.538	:no_entry:[err]	0.708
BNNR	1.000	0.922	0.949	0.959	0.990 (1%)	0.972
LRSSL	0.127	0.581 (90%)	0.159	0.846	0.764 (1%)	0.665
MBiRW	1.000	0.913	0.954	0.965	:no_entry:[err]	0.975
LibMFWrapper	1.000	0.919	0.892	0.912	0.923	0.873
LogisticMF	1.000	0.910	0.941	0.955	0.953	0.933
PSGCN	0.767	:no_entry:[err]	0.802	0.888	:no_entry:	0.887
DDA_SKF	0.779	0.453	0.544	0.264 (20%)	0.591	0.542
HAN	1.000	0.870	0.909	0.905	0.904	0.923
PUextraTrees (`n_estimators=10`)	0.045 (50%)	0.325 (50%)	0.246 (20%)	:no_entry:[mem]	0.309 (5%)
XGBoost (`n_estimators=100`)	0.500	0.500 (20%)	0.500	0.500	0.500 (1%)	0.500 (60%)

The NDCG score is computed across all diseases (global), at k=#items.

Algorithm (global NDCG@k)	Synthetic@300*	TRANSCRIPT@613[a]	Gottlieb@593[b]	Cdataset@663[c]	PREDICT@1577[d]	LRSSL@763[e]
PMF	0.070	0.019	0.015	0.011	0.005	0.007
PulearnWrapper	N/A	:no_entry:	N/A	:no_entry:	:no_entry:	:no_entry:
ALSWR	0.000	0.177	0.236	0.406	0.193	0.424
FastaiCollabWrapper	1.000	0.035	0.012	0.003	0.001	0.000
SimplePULearning	1.000	0.059 (40%)	:no_entry:[err]	:no_entry:[err]	0.025 (4%)	:no_entry:[err]
SimpleBinaryClassifier	0.000	:no_entry:[mem]	0.002	0.005 (40%)	0.070 (1%)	:no_entry:[err]
NIMCGCN	0.568	0.022	0.006	0.005	0.007 (60%)	0.014
FFMWrapper	1.000	:no_entry:[mem]	1.000 (40%)	1.000 (20%)	:no_entry:[mem]	:no_entry:
VariationalWrapper	:no_entry:[err]	:no_entry:[err]	0.011	0.010	:no_entry:[err]	:no_entry:
DRRS	:no_entry:[err]	0.484	0.301	0.426	:no_entry:[err]	0.182
SCPMF	0.528	0.102	0.025	0.011	:no_entry:[err]	0.008
BNNR	1.000	0.466	0.417	0.572	0.217 (1%)	0.508
LRSSL	0.206	0.032 (90%)	0.009	0.004	0.103 (1%)	0.012
MBiRW	1.000	0.085	0.267	0.352	:no_entry:[err]	0.457
LibMFWrapper	1.000	0.419	0.431	0.605	0.502	0.430
LogisticMF	1.000	0.323	0.106	0.101	0.076	0.078
PSGCN	0.969	:no_entry:[err]	0.074	0.052	:no_entry:[err]	0.110
DDA_SKF	1.000	0.039	0.069	0.078 (20%)	0.065	0.069
HAN	1.000	0.075	0.007	0.000	0.001	0.002
PUextraTrees (`n_estimators=10`)	0.000 (50%)	0.198 (50%)	0.162 (20%)	:no_entry:[mem]	0.235 (5%)
XGBoost (`n_estimators=100`)	0.061	0.000 (20%)	0.002	0.000	0.000 (1%)	0.000 (60%)

:no_entry: Note that results from ``LibMFWrapper'' are not reproducible, and the resulting metrics might slightly vary across iterations.

:no_entry: XGBoost and SimpleBinaryClassifier do not take into account unlabeled points (they assume they are negative points).

Datasets

*Synthetic dataset created with function generate_dummy_dataset in stanscofi.datasets and the following arguments:

npositive=200 #number of positive pairs
nnegative=100 #number of negative pairs
nfeatures=50 #number of pair features
mean=0.5 #mean for the distribution of positive pairs, resp. -mean for the negative pairs
std=1 #standard deviation for the distribution of positive and negative pairs
random_seed=124565 #random seed

[a] Réda, Clémence. (2023). TRANSCRIPT drug repurposing dataset (2.0.0) [Data set]. Zenodo. doi:10.5281/zenodo.7982976

[b] Gottlieb, A., Stein, G. Y., Ruppin, E., & Sharan, R. (2011). PREDICT: a method for inferring novel drug indications with application to personalized medicine. Molecular systems biology, 7(1), 496.

[c] Luo, H., Li, M., Wang, S., Liu, Q., Li, Y., & Wang, J. (2018). Computational drug repositioning using low-rank matrix approximation and randomized algorithms. Bioinformatics, 34(11), 1904-1912.

[d] Réda, Clémence. (2023). PREDICT drug repurposing dataset (2.0.1) [Data set]. Zenodo. doi:10.5281/zenodo.7983090

[e] Liang, X., Zhang, P., Yan, L., Fu, Y., Peng, F., Qu, L., … & Chen, Z. (2017). LRSSL: predict and interpret drug–disease associations based on data integration using sparse subspace learning. Bioinformatics, 33(8), 1187-1196.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

2.0.0

Jan 22, 2024

1.0.1

Sep 1, 2023

1.0.0

Aug 11, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

benchscofi-2.0.0.tar.gz (64.2 kB view hashes)

Uploaded Jan 22, 2024 Source

Built Distribution

benchscofi-2.0.0-py3-none-any.whl (73.8 kB view hashes)

Uploaded Jan 22, 2024 Python 3

Hashes for benchscofi-2.0.0.tar.gz

Hashes for benchscofi-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`bf51b091aaaf9162de38e922d7251f98d62056f5c2d7015fe2772b3191c1e28c`
MD5	`878c84006fbdf3be0e436a514593183a`
BLAKE2b-256	`cd5e560093c5a292db42a7bee0fa77e52dc13c1ce5588943f7bad93d618e2505`

Hashes for benchscofi-2.0.0-py3-none-any.whl

Hashes for benchscofi-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2eeaac9bf708c027f81a24862124d27544a5e24aa1ab2519a7b7879bacbe774a`
MD5	`9eb9bdbf2fe9eb761c2271c2d23c762a`
BLAKE2b-256	`b8b93a41e8935e0ccff5001c4e8640ccbdd9349c0155dca6d68e96225b5c2dcf`