disamby

Python package to carry out entity disambiguation based on string matching

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
- Science/Research
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Project description

disamby

Free software: MIT license
Documentation: https://disamby.readthedocs.io.

disamby is a python package designed to carry out entity disambiguation based on fuzzy string matching.

It works best for entities which if the same have very similar strings. Examples of situation where this disambiguation algorithm works fairly well is with company names and addresses which have typos, alternative spellings or composite names. Other use-cases include identifying people in a database where the name might be misspelled.

The algorithm works by exploiting how informative a given word/token is, based on the observed frequencies in the whole corpus of strings. For example the word ‘inc’ in the case of firm names is not very informative, however “Solomon” is, since the former appears repeatedly whereas the second rarely.

With these frequencies the algorithms computes for a given pair of instances how similar they are, and if they are above an arbitrary threshold they are connected in an “alias graph” (i.e. a directed network where an entity is connected to an other if it is similar enough). After all records have been connected in this way disamby returns sets of entities, which are strongly connected [2] . Strongly connected means in this case that there exists a path from all nodes to all nodes within the component.

Example

To use disamby in a project:

import pandas as pd
import disamby.preprocessors as pre
form disamby import Disamby

# create a dataframe with the fields you intend to match on as columns
df = pd.DataFrame({
    'name':     ['Luca Georger',        'Luca Geroger',         'Adrian Sulzer'],
    'address':  ['Mira, 34, Augsburg',  'Miri, 34, Augsburg',   'Milano, 34']},
    index=      ['L1',                  'L2',                   'O1']
)

# define the pipeline to process the strings, note that the last step must return
# a tuple of strings
pipeline = [
    pre.normalize_whitespace,
    pre.remove_punctuation,
    lambda x: pre.trigram(x) + pre.split_words(x)  # any python function is allowed
]

# instantiate the disamby object, it applies the given pre-processing pipeline and
# computes their frequency.
dis = Disamby(df, pipeline)

# let disamby compute disambiguated sets. Node that a threshold must be given or it
# defaults to 0.
dis.disambiguated_sets(threshold=0.5)
[{'L2', 'L1'}, {'O1'}]  # output

# To check if the sets are accurate you can get the rows from the
# pandas dataframe like so:
df.loc[['L2', 'L1']]

Installation

To install disamby, run this command in your terminal:

$ pip install disamby

This is the preferred method to install disamby, as it will always install the most recent stable release. If you don’t have pip installed, this Python installation guide can guide you through the process.

You can also install it from source as follows The sources for disamby can be downloaded from the Github repo. You can either clone the public repository:

$ git clone git://github.com/verginer/disamby

Or download the tarball:

$ curl  -OL https://github.com/verginer/disamby/tarball/master

Once you have a copy of the source, you can install it with:

$ python setup.py install

Credits

I got the inspiration for this package from the seminar “The SearchEngine - A Tool for Matching by Fuzzy Criteria” by Thorsten Doherr at the CISS [1] Summer School 2017

History

0.2.3 (2017-07-01)

Fixes formatting breaking pypi display

0.2.2 (2017-06-30)

working release with minimal documentation
works with multiple field matching
carries out all steps autonomously from string pre-processing to identifying the strongly connected components

0.1.0 (2017-06-24)

First release on PyPI.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
- Science/Research
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

This version

0.2.4

Jul 1, 2017

0.2.3

Jul 1, 2017

0.2.2

Jun 30, 2017

0.2.1

Jun 30, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

disamby-0.2.4.tar.gz (639.2 kB view hashes)

Uploaded Jul 1, 2017 Source

Built Distribution

disamby-0.2.4-py2.py3-none-any.whl (13.3 kB view hashes)

Uploaded Jul 1, 2017 Python 2 Python 3

Hashes for disamby-0.2.4.tar.gz

Hashes for disamby-0.2.4.tar.gz
Algorithm	Hash digest
SHA256	`301d502cf4f6ba909df06567059ef83e3e9bc9a23e17aa56c001f7e95ed1c461`
MD5	`9a5f75191b81cece775d66c89f2e9353`
BLAKE2b-256	`cbe068ff4bc46425df21fdeafa9734efd00859d51a977b4e3f5ff05f562c482c`

Hashes for disamby-0.2.4-py2.py3-none-any.whl

Hashes for disamby-0.2.4-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`8a96672d09e2253b9c2587e66cb07ec347d4771e64cae3aecf5844059b3d45b8`
MD5	`64ee029df741c169afc2cb3d9b0f44b1`
BLAKE2b-256	`efaeb5f4a7149c8a246341a6cb67f305215bc312d123e81e4f36b770238c4ef5`