silicon-clustering

Single-linkage connectivity clustering by cosine similarity

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
Programming Language
- Python :: 2.7
- Python :: 3
Topic
- Scientific/Engineering :: Information Analysis

Project description

A fast algorithm computing connectivity components (clusters) of elements based on their feature cosine similarity. Two elements are connected by an edge if their cosine similarity is above a given threshold. Accepts either a numpy 2D array or a scipy column-sparse (CSR) matrix.

Install from PyPI with pip install silicon-clustering

Tomáš Gavenčiak gavento@ucw.cz

Usage example

import silicon, numpy
# 1000 rows, 10 features
data = numpy.random.rand(1000, 10)
# The ensemble normalizes rows of the data by default
# Choose high verbosity (via logging), cosine similarity >=0.97
ens = silicon.CosineClustering(data, sim_threshold=0.97, verbosity=2)
ens.run()
# (... progress reports)
ens.clusters[0]
# <Cluster no 0, size 13>
ens.clusters_by_size()[0]
# <Cluster no 36, size 22>
# With pyplot you can see the projected points
import matplotlib.pyplot as plt
ens.plot(); plt.show()
# or the individual clusters
ens.clusters_by_size()[0].plot(ens); plt.show()

Details

The algorithm uses several tricks to speed up the computation compared to traditional all-pair scalar products via matrix multiplication:

The data is projected into cell_dims principal components (PCA) by feature. The projection divides the data into cells of size self.distance so only rows from adjacent cells have a chance to be similar enough. Then the vectors of the rows of the adjacent cells are multiplied.
This would still mean that some cells are too big for matrix multiplication so a second trick is used before the cell multiplication: nibbling at cca 1000 random points of the dataset. For a random center row, the similarities to all other rows are computed and all similar points are clustered together (possibly merging existing clusters). This has the effect of pre-clustering most dense points of the dataset (esp. repeated values) - dense clusters have a good chance to be hit with a center, eliminating most of the mass of the cluster (as well as the respective cells). The number of nibble iterations should be tuned according to data (to reasonably decrease the cell size).
To combine the two tricks, a portion of the clustered points (e.g. 10%) together with all the unclustered points are considered for adjacent cell multiplication. The 10% returned points should ensure that the nibbled clusters are linked with any points not hit by the nibble but close to and in effect belonging to the clusters.

Since not all nibbled rows are used in the adjacent cell scalar product, the algorithm may miss few individual cluster connections at nibble ball boundaries, but we found it unlikely in practice.

The algorithm clusters 10000 gaussian points in 30 dimensions in under a second.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
Programming Language
- Python :: 2.7
- Python :: 3
Topic
- Scientific/Engineering :: Information Analysis

Release history Release notifications | RSS feed

This version

0.3.1

Apr 5, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

silicon-clustering-0.3.1.tar.gz (11.9 kB view hashes)

Uploaded Apr 5, 2017 Source

Built Distributions

silicon_clustering-0.3.1-py3-none-any.whl (16.6 kB view hashes)

Uploaded Apr 5, 2017 Python 3

silicon_clustering-0.3.1-py2-none-any.whl (16.6 kB view hashes)

Uploaded Apr 5, 2017 Python 2

Hashes for silicon-clustering-0.3.1.tar.gz

Hashes for silicon-clustering-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`28e94dfbd8d0806436b1222ec8658622d21e08354e957e1b52b8f0cc957dc20a`
MD5	`52652a872477815fac0dbcd74c3b934d`
BLAKE2b-256	`330725632217badc10da1bcbff428722b5c6b36dfcfccd4720e570bca0d96fba`

Hashes for silicon_clustering-0.3.1-py3-none-any.whl

Hashes for silicon_clustering-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0e221d3e68c0a9b62ed42039dfadd54c610f962b313fc15952b76b6a68e4b446`
MD5	`eb491ac399a008fad835cd9dbf198c93`
BLAKE2b-256	`327a916fe5913e8d9dd02b5d2061e980bfc0ee7305949deaf9d4c94f570d3f17`

Hashes for silicon_clustering-0.3.1-py2-none-any.whl

Hashes for silicon_clustering-0.3.1-py2-none-any.whl
Algorithm	Hash digest
SHA256	`264ceb7e3df1f39813e25a65276a624ec5540df1136f030e4fe7715ccca47f64`
MD5	`7b93e3c87abbdc6c09d7f6e9c200d150`
BLAKE2b-256	`195cf323c6338361548ba082f01369b95b00ac46346eef1103aeb1bcf2094e4c`