Skip to main content

Genie: Fast and Robust Hierarchical Clustering with Noise Points Detection

Project description

genieclust Package for R and Python

Genie: Fast and Robust Hierarchical Clustering with Noise Point Detection

genieclust for Python genieclust for R codecov

Genie outputs meaningful clusters and is fast even on large data sets.

A comprehensive tutorial, benchmarks, and a reference manual is available at https://genieclust.gagolewski.com/.

When using genieclust in research publications, please cite (Gagolewski, 2021) and (Gagolewski, Bartoszuk, Cena, 2016) as specified below. Thank you.

About

A faster and more powerful version of Genie – a robust and outlier resistant clustering algorithm (see Gagolewski, Bartoszuk, Cena, 2016), originally included in the R package genie.

The idea behind Genie is beautifully simple. First, make each individual point the only member of its own cluster. Then, keep merging pairs of the closest clusters, one after another. However, to prevent the formation of clusters of highly imbalanced sizes a point group of the smallest size will sometimes be matched with its nearest neighbour.

Genie's appealing simplicity goes hand in hand with its usability; it often outperforms other clustering approaches such as K-means, BIRCH, or average, Ward, and complete linkage on benchmark data.

Genie is also very fast – determining the whole cluster hierarchy for datasets of millions of points can be completed within minutes. Therefore, it is nicely suited for solving of extreme clustering tasks (large datasets with any number of clusters to detect) for data (also sparse) that fit into memory. Thanks to the use of nmslib, sparse or string inputs are also supported.

It also allows clustering with respect to mutual reachability distances so that it can act as a noise point detector or a robustified version of HDBSCAN* (see Campello et al., 2015) that is able to detect a predefined number of clusters and hence it doesn't dependent on the DBSCAN's somewhat difficult-to-set eps parameter.

Author and Contributors

Author and Maintainer: Marek Gagolewski

Contributors: Maciej Bartoszuk, Anna Cena (R packages genie /genieclust's predecessor/ and CVI /some internal cluster validity measures/), Peter M. Larsen (an implementation of the shortest augmenting path algorithm for the rectangular assignment problem which we use for computing the normalised accuracy and pair sets index).

Python and R Package Features

The implemented algorithms include:

  • Genie++ – a reimplementation of the original Genie algorithm (Gagolewski et al., 2016); much faster than the original one; supports approximate disconnected MSTs;

  • Genie+HDBSCAN* – a robustified (Geniefied) retake on the HDBSCAN* (Campello et al., 2015) method that detects noise points in data and outputs clusters of predefined sizes;

  • (Python only, experimental preview) Genie+Ic (GIc) – Cena's (2018) algorithm to minimise the information theoretic criterion discussed by Mueller et al. (2012).

See classes genieclust.Genie and genieclust.GIc (in Python) or functions gclust() and genieclust() (in R).

Other features:

  • inequality measures: the normalised Gini, Bonferroni, and De Vergottini indices;

  • external cluster validity measures: adjusted asymmetric accuracy and partition similarity scores such as normalised accuracy, pair sets index (PSI), adjusted&unadjusted Rand, adjusted&unadjusted Fowlkes-Mallows (FM), adjusted&normalised&unadjusted mutual information (MI) indices;

  • internal cluster validity measures: the Caliński-Harabasz, Silhouette, Ball-Hall, Davies-Bouldin, generalised Dunn indices, etc.;

  • (Python only) Union-find (disjoint sets) data structures (with extensions);

  • (Python only) Some R-like plotting functions.

Examples, Tutorials, and Documentation

R's interface is compatible with stats::hclust(), but there is more.

X <- ... # some data
h <- gclust(X)
plot(h) # plot cluster dendrogram
cutree(h, k=2)
# or genie(X, k=2)

The Python language version of genieclust has a familiar scikit-learn-like look-and-feel:

import genieclust
X = ... # some data
g = genieclust.Genie(n_clusters=2)
labels = g.fit_predict(X)

The tutorials and the package documentation is available here.

To learn more about Python, check out Marek's recent open-access (free!) textbook Minimalist Data Wrangling in Python.

How to Install

Python Version

To install via pip (see PyPI):

pip3 install genieclust

The package requires Python 3.7+ together with cython as well as numpy, scipy, matplotlib, nmslib, and scikit-learn. Optional dependency: mlpack.

R Version

To install the most recent release, call:

install.packages("genieclust")

See the package entry on CRAN.

Other

The core functionality is implemented in the form of a header-only C++ library. It can thus be easily adapted for use in other environments.

Any contributions are welcome (e.g., Julia, Matlab, ...).

License

Copyright (C) 2018–2022 Marek Gagolewski https://www.gagolewski.com

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License Version 3, 19 November 2007, published by the Free Software Foundation.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License Version 3 for more details. You should have received a copy of the License along with this program. If not, see (https://www.gnu.org/licenses/).


The file src/c_scipy_rectangular_lsap.h is adapted from the scipy project (https://scipy.org/scipylib/), source: /scipy/optimize/rectangular_lsap/rectangular_lsap.cpp. Author: Peter M. Larsen. Distributed under the BSD-3-Clause license.

The implementation of internal cluster validity measures were adapted from our previous project (Gagolewski, Bartoszuk, Cena, 2021); see optim_cvi. Originally distributed under the GNU Affero General Public License Version 3.

References

Gagolewski M., genieclust: Fast and robust hierarchical clustering, SoftwareX 15, 2021, 100722. DOI: 10.1016/j.softx.2021.100722.

Gagolewski M., Bartoszuk M., Cena A., Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, Information Sciences 363, 2016, 8–23. DOI: 10.1016/j.ins.2016.05.003.

Gagolewski M., Bartoszuk M., Cena A., Are cluster validity measures (in)valid?, Information Sciences 581, 2021, 620–636. DOI: 10.1016/j.ins.2021.10.004.

Gagolewski M., Adjusted asymmetric accuracy: A well-behaving external cluster validity measure, 2022, submitted for publication.

Gagolewski M., A Framework for Benchmarking Clustering Algorithms. 2022, https://clustering-benchmarks.gagolewski.com.

Cena A., Adaptive hierarchical clustering algorithms based on data aggregation methods, PhD Thesis, Systems Research Institute, Polish Academy of Sciences, 2018.

Campello R., Moulavi D., Zimek A., Sander J., Hierarchical density estimates for data clustering, visualization, and outlier detection, ACM Transactions on Knowledge Discovery from Data 10(1), 2015, 5:1–5:51. DOI: 10.1145/2733381.

Mueller A., Nowozin S., Lampert C.H., Information Theoretic Clustering using Minimum Spanning Trees, DAGM-OAGM, 2012.

See the package's homepage for more references.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genieclust-1.1.0.tar.gz (90.0 kB view hashes)

Uploaded Source

Built Distributions

genieclust-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.9 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

genieclust-1.1.0-cp310-cp310-macosx_10_9_x86_64.whl (829.4 kB view hashes)

Uploaded CPython 3.10 macOS 10.9+ x86-64

genieclust-1.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.9 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

genieclust-1.1.0-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl (4.8 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ i686

genieclust-1.1.0-cp39-cp39-macosx_10_9_x86_64.whl (824.6 kB view hashes)

Uploaded CPython 3.9 macOS 10.9+ x86-64

genieclust-1.1.0-cp38-cp38-win_amd64.whl (628.8 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

genieclust-1.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.0 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

genieclust-1.1.0-cp38-cp38-manylinux_2_17_i686.manylinux2014_i686.whl (4.9 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ i686

genieclust-1.1.0-cp38-cp38-macosx_10_9_x86_64.whl (804.5 kB view hashes)

Uploaded CPython 3.8 macOS 10.9+ x86-64

genieclust-1.1.0-cp37-cp37m-win_amd64.whl (615.4 kB view hashes)

Uploaded CPython 3.7m Windows x86-64

genieclust-1.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

genieclust-1.1.0-cp37-cp37m-manylinux_2_17_i686.manylinux2014_i686.whl (4.5 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ i686

genieclust-1.1.0-cp37-cp37m-macosx_10_9_x86_64.whl (799.7 kB view hashes)

Uploaded CPython 3.7m macOS 10.9+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page