wmd · PyPI

Accelerated functions to calculate Word Mover's Distance

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Operating System
- POSIX :: Linux
Programming Language
Topic
- Scientific/Engineering :: Information Analysis

Project description

Fast Word Mover's Distance [![Build Status](https://travis-ci.org/src-d/wmd-relax.svg?branch=master)](https://travis-ci.org/src-d/wmd-relax) [![PyPI](https://img.shields.io/pypi/v/wmd.svg)](https://pypi.python.org/pypi/wmd) [![codecov](https://codecov.io/github/src-d/wmd-relax/coverage.svg)](https://codecov.io/gh/src-d/wmd-relax)
==========================

Calculates Word Mover's Distance as described in
[From Word Embeddings To Document Distances](http://www.cs.cornell.edu/~kilian/papers/wmd_metric.pdf)
by Matt Kusner, Yu Sun, Nicholas Kolkin and Kilian Weinberger.

<img src="https://blog.sourced.tech/post/lapjv/wmd.png" alt="Word Mover's Distance" width="200"/>

The high level logic is written in Python, the low level functions related to
linear programming are offloaded to the bundled native extension. The native
extension can be built as a generic shared library not related to Python at all.
**Python 2.7 and older are not supported.** The heavy-lifting is done by
[google/or-tools](https://github.com/google/or-tools).

### Installation

```
pip3 install wmd
```
Tested on Linux and macOS.

### Usage

You should have the embeddings numpy array and the nbow model - that is,
every sample is a weighted set of items, and every item is embedded.

```python
import numpy
from wmd import WMD

embeddings = numpy.array([[0.1, 1], [1, 0.1]], dtype=numpy.float32)
nbow = {"first": ("#1", [0, 1], numpy.array([1.5, 0.5], dtype=numpy.float32)),
"second": ("#2", [0, 1], numpy.array([0.75, 0.15], dtype=numpy.float32))}
calc = WMD(embeddings, nbow, vocabulary_min=2)
print(calc.nearest_neighbors("first"))
```
```
[('second', 0.10606599599123001)]
```

`embeddings` must support `__getitem__` which returns an item by it's
identifier; particularly, `numpy.ndarray` matches that interface.
`nbow` must be iterable - returns sample identifiers - and support
`__getitem__` by those identifiers which returns tuples of length 3.
The first element is the human-readable name of the sample, the
second is an iterable with item identifiers and the third is `numpy.ndarray`
with the corresponding weights. All numpy arrays must be float32. The return
format is the list of tuples with sample identifiers and relevancy
indices (lower the better).

It is possible to use this package with [spaCy](https://github.com/explosion/spaCy):

```python
import spacy
import wmd

nlp = spacy.load('en_core_web_md')
nlp.add_pipe(wmd.WMD.SpacySimilarityHook(nlp), last=True)
doc1 = nlp("Politician speaks to the media in Illinois.")
doc2 = nlp("The president greets the press in Chicago.")
print(doc1.similarity(doc2))
```

Besides, see another [example](spacy_example.py) which finds similar Wikipedia
pages.

### Building from source

Either build it as a Python package:

```
pip3 install git+https://github.com/src-d/wmd-relax
```

or use CMake:

```
git clone --recursive https://github.com/src-d/wmd-relax
cmake -D CMAKE_BUILD_TYPE=Release .
make -j
```

Please note the `--recursive` flag for `git clone`. This project uses source{d}'s
fork of [google/or-tools](https://github.com/google/or-tools) as the git submodule.

### Tests

Tests are in `test.py` and use the stock `unittest` package.

### Documentation

```
cd doc
make html
```

The files are in `doc/doxyhtml` and `doc/html` directories.

### Contributions

...are welcome! See [CONTRIBUTING](CONTRIBUTING.md) and [code of conduct](CODE_OF_CONDUCT.md).

### License
[Apache 2.0](LICENSE.md)

#### README {#ignore_this_doxygen_anchor}

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Operating System
- POSIX :: Linux
Programming Language
Topic
- Scientific/Engineering :: Information Analysis

Release history Release notifications | RSS feed

1.3.2

Oct 21, 2019

1.3.1

Apr 23, 2019

This version

1.3.0

Oct 29, 2018

1.2.11

Aug 31, 2018

1.2.10

Aug 21, 2018

1.2.8

Jan 28, 2018

1.2.7

Jan 23, 2018

1.2.6

Jul 19, 2017

1.2.5

Jul 13, 2017

1.2.4

May 8, 2017

1.2.3

May 8, 2017

1.2.2

May 6, 2017

1.2.1

May 5, 2017

1.2.0

Apr 27, 2017

1.1.6

Apr 27, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wmd-1.3.0.tar.gz (103.3 kB view hashes)

Uploaded Oct 29, 2018 Source

Built Distributions

wmd-1.3.0-cp37-cp37m-manylinux1_x86_64.whl (635.6 kB view hashes)

Uploaded Oct 29, 2018 CPython 3.7m

wmd-1.3.0-cp36-cp36m-macosx_10_13_x86_64.whl (146.7 kB view hashes)

Uploaded Oct 29, 2018 CPython 3.6m macOS 10.13+ x86-64

Hashes for wmd-1.3.0.tar.gz

Hashes for wmd-1.3.0.tar.gz
Algorithm	Hash digest
SHA256	`9797d585a6f148bbfb0a926deb04f4eae20f1806dcac3527622bfd3b78a144af`
MD5	`f97d4db2818a4af6647908f9ff853437`
BLAKE2b-256	`2f61686d4dd4f2e37fea15b3bd04a5b68a74aa2cb54be18a31f59d5703991f0b`

Hashes for wmd-1.3.0-cp37-cp37m-manylinux1_x86_64.whl

Hashes for wmd-1.3.0-cp37-cp37m-manylinux1_x86_64.whl
Algorithm	Hash digest
SHA256	`4213900907d2c14f23b92b7db0c6cbe49db04c6aece1584603bcd5a9d78edddb`
MD5	`d85f748e42a6ded86dba4c3518ec5924`
BLAKE2b-256	`f6577a276d3711cc189afe8d6f40e6ef91dd5a2e807bbf6d40b7cad3b72241c7`

Hashes for wmd-1.3.0-cp36-cp36m-macosx_10_13_x86_64.whl

Hashes for wmd-1.3.0-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm	Hash digest
SHA256	`13bc4b10359aa8fbea4a7d2af388e8aa0dd591e16137c3ff2548461eec217308`
MD5	`030da46df3c65fb2406919fbf3cdbaf5`
BLAKE2b-256	`ad0a5457dea17077965394481e9311058e352717dc5a5095e95dfc8e79370fda`