Skip to main content

A library for similarity search over high-dimensional data based on Locality-Sensitive Hashing (LSH)

Project description

FALCONN - FAst Lookups of Cosine and Other Nearest Neighbors

FALCONN is a library with algorithms for the nearest neighbor search problem. The algorithms in FALCONN are based on Locality-Sensitve Hashing (LSH), which is a popular class of methods for nearest neighbor search in high-dimensional spaces. The goal of FALCONN is to provide very efficient and well-tested implementations of LSH-based data structures.

Currently, FALCONN supports two LSH families for the cosine similarity: hyperplane LSH and cross polytope LSH. Both hash families are implemented with multi-probe LSH in order to minimize memory usage. Moreover, FALCONN is optimized for both dense and sparse data. Despite being designed for the cosine similarity, FALCONN can often be used for nearest neighbor search under the Euclidean distance or a maximum inner product search.

FALCONN is written in C++ and consists of several modular core classes with a convenient wrapper around them. Many mathematical operations in FALCONN are vectorized through the Eigen and FFHT libraries. The core classes of FALCONN rely on templates in order to avoid runtime overhead.

How to use FALCONN

We provide a C++ interface for FALCONN as well as a Python wrapper (that uses NumPy). In the future, we plan to support more programming languages such as Julia. For C++, FALCONN is a header-only library and has no dependencies besides Eigen (which is also header-only), so FALCONN is easy to set up. For further details, please see our documentation.

How fast is FALCONN?

On data sets with about 1 million points in around 100 dimensions, FALCONN typically requires a few milliseconds per query (running on a reasonably modern desktop CPU).

Below are the results of FALCONN being compared with other open source similarity search algorithms. The dataset, which consists of vector representations for words produced by GloVe, has 1.2M points in 100 dimensions. The results for other algorithms are taken from ann-benchmarks created by Erik Bernhardsson. Note that FALCONN is especially good in the regime of high accuracy (0.8 or more).

The plot axes are: accuracy retrieving 10 closest data points vs. the number of queries per second.

direct link

Questions

Maybe your question is already answered in our Frequently Asked Questions. If you have additional questions about using FALCONN, we would be happy to help. Please send an email to falconn.lib@gmail.com.

Authors

FALCONN is mainly developed by Ilya Razenshteyn and Ludwig Schmidt. FALCONN has grown out of a research project with our collaborators Alexandr Andoni, Piotr Indyk, and Thijs Laarhoven.

Many of the ideas used in FALCONN were proposed in research papers over the past 20 years (see the documentation).

If you want to cite FALCONN in a publication, here is the bibliographic information of our research paper (bibtex):

Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, Ludwig Schmidt
NIPS 2015

License

FALCONN is available under the MIT License (see LICENSE.txt). Note that the third-party libraries in the external/ folder are distributed under other open source licenses. The Eigen library is licensed under the MPL2. The googletest and googlemock libraries are licensed under the BSD 3-Clause License. Furthermore, the file src/python/module/numpy.i is part of the NumPy project and also licensed under the BSD 3-Clause License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

FALCONN-1.2.tar.gz (1.1 MB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page