🚀 Extremely fast spelling checker and suggester in Python!

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Spellwise

🚀 Extremely fast spelling checker and suggester in Python!

The following algorithms are supported currently,

Edit-distance, Hall and Dowling (1980) (based on Levenshtein distance algorithm)
Editex, Zobel and Dart (1996) (for suggesting phonetically similar words)
Caverphone 1.0 and Caverphone 2.0 David Hood (2002) (to identify English names which sound phonetically similar)
QWERTY Keyboard layout Typographic based correction algorithm (Typox), inspired from Ahmad, Indrayana, Wibisono, and Ijtihadie (2017). This implementation might not be the exact one specified in the paper since it is not available to read for free

All the above algorithms use an underlying Trie based dictionary for efficient storage and fast computation! The implementations of both Edit-distance and Editex algorithms are inspired from the amazing article Fast and Easy Levenshtein distance using a Trie, by Steve Hanov.

📦 Installation

To be updated soon!

🧑‍💻 Usage

Currently there are five algorithms available for use with the following classnames,

Levenshtein
Editex
CaverphoneOne
CaverphoneTwo
Typox

Please check the examples/ folder for specific usage of each algorithm. But in a general sense, each algorithm has three parts,

Initialization (initialize the class object for the algorithm to use)
Index correct words / names (add correct words or names to the dictionary)
Fetch suggestions (inference)

from spellwise import CaverphoneOne, CaverphoneTwo, Editex, Levenshtein, Typox

# (1) Initialize the desired algorithm
algorithm = Editex() # this can be CaverphoneOne, CaverphoneTwo, Levenshtein or Typox as well

# (2) Index the words / names to the algorithm
# Indexing can be done by adding words from a file
algorithm.add_from_path("<path-to-the-dictionary-file>")
# or by adding them manually
algorithm.add_words(["spell", "spelling", "check"])

# (3) Fetch the suggestions for the word
suggestions = algorithm.get_suggestions("spellin")
# The `suggestions` is a list of dict with fields `word` and `distance`
# [{"word": ..., "distance": ...}, ...]
print(suggestions)

💡 Analysis of each algorithm

There are many algorithms currently available in the package and each one of them are used for different purposes. We will discuss each algorithm in specific in the following sections.

(1) Levenshtein

The Levenshtein algorithm is the baseline and most popular method to identify the closest correct words given the mispelled word, based on the edit-distance (number of insertions, deletions and replacements) between the given word and correct word.

from spellwise import Levenshtein

levenshtein = Levenshtein()
levenshtein.add_from_path("examples/data/american-english")

suggestions = levenshtein.get_suggestions("run")
# Print the top 10 suggestions
print("Word \t Distance")
print("=================")
for suggestion in suggestions[0:10]:
    print("{} \t {}".format(suggestion.get("word"), suggestion.get("distance")))

Levenshtein provides the following,

Word 	 Distance
=================
run 	 0
bun 	 1
dun 	 1
fun 	 1
gun 	 1
hun 	 1
jun 	 1
jun 	 1
mun 	 1
nun 	 1

(2) Editex

The Editex algorithm provides suggestions of words which are phonetically closed to the given word. It also uses the edit-distance but has different replacement or deletion costs depending on whether the two letters belong to the same phonetic group or not.

from spellwise import Editex

editex = Editex()
editex.add_from_path("examples/data/american-english")

suggestions = editex.get_suggestions("run")
# Print the top 10 suggestions
print("Word \t Distance")
print("=================")
for suggestion in suggestions[0:10]:
    print("{} \t {}".format(suggestion.get("word"), suggestion.get("distance")))

Editex suggests the following,

Word 	 Distance
=================
run 	 0
ran 	 1
ron 	 1
ruin 	 1
rum 	 1
bun 	 2
dun 	 2
dunn 	 2
fun 	 2
gun 	 2

Notice that the Levenshtein algorithm computes the distance between run and bun to be 1 since there is only one replacement necessary. On the other hand, Editex algorithm computes this distance to be 2 since phonetically, the words are farther apart.

(3) Caverphone 1.0 and Caverphone 2.0

The Caverphone algorithm was developed as a part of the Caversham project to phonetically identify the names of different instances of the same person from different sources. In other words, it can be used for phonetically identifying duplicate entries of an entity or word. The difference between the v1 and v2 of the algorithm is in the pre-precessing of the words before comparing.

from spellwise import CaverphoneTwo # or CaverphoneOne

caverphone = CaverphoneTwo()
caverphone.add_from_path("examples/data/american-english")

suggestions = caverphone.get_suggestions("run")
# Print the top 10 suggestions
print("Word \t Distance")
print("=================")
for suggestion in suggestions[0:10]:
    print("{} \t {}".format(suggestion.get("word"), suggestion.get("distance")))

Caverphone v2 provides the following suggestions,

Word 	 Distance
=================
rain 	 0
ran 	 0
rein 	 0
rene 	 0
roan 	 0
ron 	 0
ruin 	 0
run 	 0
rune 	 0
wren 	 0

(4) Typox

The Typox is a Typographic based correction algorithm optimised for correcting typos in QWERTY keyboard. This is based on the philosophy of the Editex algorithm by grouping of letters is based on their locations on the keyboard, instead of matching them phonetically. This might not be the exact implementation of the algorithm since the original paper is not available to read for free.

from spellwise import Typox

typox = Typox()
typox.add_from_path("examples/data/american-english")

suggestions = typox.get_suggestions("ohomr")
# Print the top 10 suggestions
print("Word \t Distance")
print("=================")
for suggestion in suggestions[0:10]:
    print("{} \t {}".format(suggestion.get("word"), suggestion.get("distance")))

Typox provides the following words,

Word 	 Distance
=================
home 	 2
phone 	 2

Notice that Typox didn not suggest words like choke, come, chore, chose etc., (which Levenshtein would suggest) even though they are of edit-distance 2 with the word ohome. But it rather suggests closest wrods based on the QWERTY keyboard layout which are phone and home.

⚡️ Memory and Time profiling

The following are the usage statistics on MacBook Pro, 2.4 GHz Quad-Core Intel Core i5 with 16 GB RAM.

Algorithm	No. of words	Memory used	Time to index	Time to inference	Remarks
Levenshtein	119,095	~ 127 MB	~ 1160 milliseconds	~ 36 milliseconds	For word "hallo" With max distance 2
Editex	119,095	~ 127 MB	~ 1200 milliseconds	~ 90 milliseconds	For word "hallo" With max distance 2
Caverphone 1.0	119,095	~ 36.7 MB	~ 1700 milliseconds	~ 0.2 milliseconds (yes right!)	For word "hallo"
Caverphone 2.0	119,095	~ 99 MB	~ 2400 milliseconds	~ 0.4 milliseconds (yes right!)	For word "hallo"
Typox	119,095	~ 127 MB	~ 1360 milliseconds	~ 200 milliseconds	For word "hallo" With max distance 2

🙌 Contributing

Please feel free to raise PRs! 😃

There are so many algorithms to be added and improvements to be made to this package. This is still in an early version and would love to have you contribute!

📝 References

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.8.1

Apr 7, 2021

0.8.0

Apr 7, 2021

0.7.0

Feb 7, 2021

0.6.2

Feb 7, 2021

0.5.2

Dec 30, 2020

This version

0.5.1

Dec 30, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spellwise-0.5.1.tar.gz (10.3 kB view hashes)

Uploaded Dec 30, 2020 Source

Hashes for spellwise-0.5.1.tar.gz

Hashes for spellwise-0.5.1.tar.gz
Algorithm	Hash digest
SHA256	`832fcc4d986c18fa6b81592a159984a87e6fe0fca20f832a7d834e5573124260`
MD5	`8e31712edc3e544565a1d1878766356b`
BLAKE2b-256	`2feb1eea3eedda998a5917ce8028bbcb9b14e99e2a447f15e0dc36640c713f06`