Implementation of a general interface for clustering based over-sampling algorithms.
Project description
cluster-over-sampling
Category | Tools |
---|---|
Development | |
Package | |
Documentation | |
Communication |
Introduction
SMOTE algorithm and any other oversampling method based on the SMOTE data generation mechanism creates synthetic samples along line segments that join minority class instances. SMOTE addresses only the issue of between-classes imbalance.
The within-classes imbalanced issue can be addressed by clustering the input space and applying any oversampling algorithm for
each resulting cluster with an appropriate resampling ratio. cluster-over-sampling
provides a general interface for
clustering-based oversampling algorithms. It is compatible with scikit-learn and imbalanced-learn. SOMO [^1], KMeans-SMOTE
[^2] and G-SOMO[^3] are specific realizations of this approach and they are provided in cluster-over-sampling
. Additionally, any
combination of scikit-learn clusterer and imbalanced-learn oversampler is supported.
Installation
cluster-over-sampling
is currently available on the PyPi's repository, and you can install it via pip
:
pip install cluster-over-sampling
SOM clusterer requires optional dependencies:
pip install cluster-over-sampling[som]
Similarly for Geometric SMOTE oversampler:
pip install cluster-over-sampling[gsmote]
You can also install both of them:
pip install cluster-over-sampling[all]
Usage
All the classes included in cluster-over-sampling
follow the imbalanced-learn API using the functionality of the base
oversampler. Using scikit-learn convention, the data are represented as follows:
- Input data
X
: 2D array-like or sparse matrices. - Targets
y
: 1D array-like.
The clustering-based oversamplers implement a fit
method to learn from X
and y
:
clustering_based_oversampler.fit(X, y)
They also implement a fit_resample
method to resample X
and y
:
X_resampled, y_resampled = clustering_based_oversampler.fit_resample(X, y)
References
If you use cluster-over-sampling
in a scientific publication, we would appreciate citations to any of the following papers:
[^1]: G. Douzas, F. Bacao, "Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning", Expert Systems with Applications, vol. 82, pp. 40-52, 2017. [^2]: G. Douzas, F. Bacao, F. Last, "Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE", Information Sciences, vol. 465, pp. 1-20, 2018. [^3]: G. Douzas, F. Bacao, F. Last, "G-SOMO: An oversampling approach based on self-organized maps and geometric SMOTE", Expert Systems with Applications, vol. 183,115230, 2021.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for cluster-over-sampling-0.4.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 942274217dd0514e9d969fc55c9e739a3b83e9a25c135fa5c6cebcbda2f2fb93 |
|
MD5 | a7322d1b5e2ec6941a8609550c81d216 |
|
BLAKE2b-256 | 79b2f327e4b3ce4bd7fd4423f61b48b35ea760096914fb5d4bc435ab0130f0ed |
Hashes for cluster_over_sampling-0.4.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | afeb689644ec89740c740ed09a237202bc236b1990bef1405a3476f072771901 |
|
MD5 | 9773ea1515e1df863e8e9d4f2edc74fb |
|
BLAKE2b-256 | 242474d808c9c5e81b6acedcfa7aafc59dd7549f984f1e356d6f2adbb9348f7b |