flashtext

Extract/Replaces keywords in sentences.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Project description

This code is not ready yet.
=============
Please use https://github.com/vi3k6i5/synonym-extractor till I finish this.

This code is successor to https://github.com/vi3k6i5/synonym-extractor.

flashtext
==============

Flash Text is a python library that is loosely based on `Aho-Corasick algorithm
<https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm>`_.

The idea is that say we have a corpus of terms/keywords. We want to extract any of the term from the corpus present in a sentence by making on pass on the sentence.

Basically say I have a vocabulary of 10K words and I want to get all the words from that set present in a sentence. A simple regex match will take a lot of time to loop over the 10K words.

Hence we use a simpler yet much faster algorithm to get the desired result.

Installation
-------
::

pip install flashtext

Usage
------
::

# import module
from synonym.extractor import SynonymExtractor

# Create an object of SynonymExtractor
synonym_extractor = SynonymExtractor()

# add synonyms
synonym_names = ['NY', 'new-york', 'SF']
clean_names = ['new york', 'new york', 'san francisco']

for synonym_name, clean_name in zip(synonym_names, clean_names):
synonym_extractor.add_to_synonym(synonym_name, clean_name)

synonyms_found = synonym_extractor.get_synonyms_from_sentence('I love SF and NY. new-york is the best.')

synonyms_found
>> ['san francisco', 'new york', 'new york']

Algorithm
----------

synonym-extractor is based on `Aho-Corasick algorithm
<https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm>`_.

Documentation
----------

Documentation can be found at `Read the Docs
<http://synonym-extractor.readthedocs.org>`_.

Why
------

::

Say you have a corpus where similar words appear frequently.

eg: Last weekened I was in NY.
I am traveling to new york next weekend.

If you train a word2vec model on this or do any sort of NLP it will treat NY and new york as 2 different words.

Instead if you create a synonym dictionary like:

eg: NY=>new york
new york=>new york

Then you can extract NY and new york as the same text.

To do the same with regex it will take a lot of time:

============ ========== = ========= ============
Docs count # Synonyms : Regex synonym-extractor
============ ========== = ========= ============
1.5 million 2K : 16 hours NA
2.5 million 10K : 15 days 15 mins
============ ========== = ========= ============

The idea for this library came from the following `StackOverflow question
<https://stackoverflow.com/questions/44178449/regex-replace-is-taking-time-for-millions-of-documents-how-to-make-it-faster>`_.

License
-------

The project is licensed under the MIT license.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

2.7

Feb 16, 2018

2.6

Jan 26, 2018

2.5

Nov 21, 2017

2.4

Nov 19, 2017

2.3

Oct 27, 2017

2.2

Sep 25, 2017

2.1

Sep 25, 2017

2.0

Sep 12, 2017

1.9

Sep 12, 2017

1.8

Sep 8, 2017

1.7

Aug 22, 2017

1.6

Aug 22, 2017

1.5

Aug 20, 2017

1.4

Aug 20, 2017

1.3

Aug 20, 2017

1.2

Aug 20, 2017

1.1

Aug 16, 2017

This version

1.0

Aug 16, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flashtext-1.0.tar.gz (5.9 kB view hashes)

Uploaded Aug 16, 2017 Source

Hashes for flashtext-1.0.tar.gz

Hashes for flashtext-1.0.tar.gz
Algorithm	Hash digest
SHA256	`6d8560df1634cc4cd3d5263e0892197775343991f9f8ed85a604dc456e7dddeb`
MD5	`4d2242178df41241534ad49c69ddd7a3`
BLAKE2b-256	`39b6bfaca1baff932792cb402b6c63cb85d778ed385c47ca2490c4239e4a737e`