simplemma

A simple multilingual lemmatizer for Python.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Lemmatization is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word’s lemma, or dictionary form. Unlike stemming, lemmatization outputs word units that are still valid linguistic forms.

In modern natural language processing (NLP), this task is often indirectly tackled by more complex systems encompassing a whole processing pipeline. However, it appears that there is no straightforward way to address lemmatization in Python although this task is useful in information retrieval and natural language processing.

Simplemma provides a simple and multilingual approach to look for base forms or lemmata. It may not be as powerful as full-fledged solutions but it is generic, easy to install and straightforward to use. By design it should be reasonably fast and work in a large majority of cases, without being perfect. Currently, 35 languages are partly or fully supported, see table below.

With its comparatively small footprint it is especially useful when speed and simplicity matter, for educational purposes or as a baseline system for lemmatization and morphological analysis.

Installation

The current library is written in pure Python with no dependencies:

pip install simplemma (or pip3 where applicable)

Usage

Word-by-word

Simplemma is used by selecting a language of interest and then applying the data on a list of words.

>>> import simplemma
# get a word
myword = 'masks'
# decide which language data to load
>>> langdata = simplemma.load_data('en')
# apply it on a word form
>>> simplemma.lemmatize(myword, langdata)
'mask'
# grab a list of tokens
>>> mytokens = ['Hier', 'sind', 'Vaccines']
>>> langdata = simplemma.load_data('de')
>>> for token in mytokens:
>>>     simplemma.lemmatize(token, langdata)
'hier'
'sein'
'Vaccines'
# list comprehensions can be faster
>>> [simplemma.lemmatize(t, langdata) for t in mytokens]
['hier', 'sein', 'Vaccines']

Chaining several languages can improve coverage:

>>> langdata = simplemma.load_data('de', 'en')
>>> simplemma.lemmatize('Vaccines', langdata)
'vaccine'
>>> langdata = simplemma.load_data('it')
>>> simplemma.lemmatize('spaghettis', langdata)
'spaghettis'
>>> langdata = simplemma.load_data('it', 'fr')
>>> simplemma.lemmatize('spaghettis', langdata)
'spaghetti'
>>> simplemma.lemmatize('spaghetti', langdata)
'spaghetto'

There are cases in which a greedier decomposition and lemmatization algorithm is better. It is deactivated by default:

# same example as before, comes to this result in one step
>>> simplemma.lemmatize('spaghettis', mydata, greedy=True)
'spaghetto'
# a German case
>>> langdata = simplemma.load_data('de')
>>> simplemma.lemmatize('angekündigten', langdata)
'ankündigen' # infinitive verb
>>> simplemma.lemmatize('angekündigten', langdata, greedy=False)
'angekündigt' # past participle

Tokenization

A simple tokenization is included for convenience:

>>> from simplemma import simple_tokenizer
>>> simple_tokenizer('Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.')
['Lorem', 'ipsum', 'dolor', 'sit', 'amet', ',', 'consectetur', 'adipiscing', 'elit', ',', 'sed', 'do', 'eiusmod', 'tempor', 'incididunt', 'ut', 'labore', 'et', 'dolore', 'magna', 'aliqua', '.']

The function text_lemmatizer() chains tokenization and lemmatization. It can take greedy and silent as arguments:

>>> from simplemma import text_lemmatizer
>>> langdata = simplemma.load_data('pt')
>>> text_lemmatizer('Sou o intervalo entre o que desejo ser e os outros me fizeram.', langdata)
# caveat: desejo is also a noun, should be desejar here
['ser', 'o', 'intervalo', 'entre', 'o', 'que', 'desejo', 'ser', 'e', 'o', 'outro', 'me', 'fazer', '.']

Caveats

# don't expect too much though
>>> langdata = simplemma.load_data('it')
# this diminutive form isn't in the model data
>>> simplemma.lemmatize('spaghettini', langdata)
'spaghettini' # should read 'spaghettino'
# the algorithm cannot choose between valid alternatives yet
>>> langdata = simplemma.load_data('es')
>>> simplemma.lemmatize('son', langdata)
'son' # valid common name, but what about the verb form?

As the focus lies on overall coverage, some short frequent words (typically: pronouns) can need post-processing, this generally concerns 10-20 tokens per language.

The greedy algorithm rarely produces forms that are not valid. Still, it is mainly useful on long words and neologisms, not for general approaches.

Bug reports over the issues page are welcome.

Supported languages

The following languages are available using their ISO 639-1 code:

Available languages (2021-02-02)
Code	Language	Word pairs	Scores	Comments
bg	Bulgarian	69,680		low coverage
ca	Catalan	583,969
cs	Czech	35,021		low coverage
cy	Welsh	349,638
da	Danish	555,559		alternative: lemmy
de	German	623,249	0.94	on UD DE-GSD. See also this list
en	English	136,226	0.93	on UD EN-GUM. Alternative: LemmInflect
es	Spanish	666,016	0.87	on UD ES-GSD.
et	Estonian	112,501		low coverage
fa	Persian	9,333		low coverage
fi	Finnish	2,096,328		alternative: voikko
fr	French	217,091	0.93	on UD FR-GSD.
ga	Irish	366,086
gd	Gaelic	49,080
gl	Galician	386,714
gv	Manx	63,667
hu	Hungarian	446,650
id	Indonesian	36,461
it	Italian	333,682
ka	Georgian	65,938
la	Latin	96,409		low coverage
lb	Luxembourgish	305,398
lt	Lithuanian	247,418
lv	Latvian	57,154
nl	Dutch	228,123
pt	Portuguese	933,730
ro	Romanian	313,181
ru	Russian	608,770		alternative: pymorphy2
sk	Slovak	847,383
sl	Slovene	97,460		low coverage
sv	Swedish	663,984		alternative: lemmy
tr	Turkish	1,333,970
uk	Ukranian	190,725		alternative: pymorphy2
ur	Urdu	28,848

Low coverage mentions means you’d probably be better off with a language-specific library, but simplemma will work to a limited extent. Open-source alternatives for Python are referenced if available.

The scores are calculated on Universal Dependencies treebanks on single word tokens (including some contractions but not merged prepositions), they describe to what extent simplemma can accurately map tokens to their lemma form.

Software under MIT license, for the linguistic information databases see licenses folder
Documentation: https://github.com/adbar/simplemma

Roadmap

[-] Add further lemmatization lists
[ ] Grammatical categories as option
[ ] Function as a meta-package?
[ ] Integrate optional, more complex models?

Credits

The current version basically acts as a wrapper for lemmatization lists:

Lemmatization lists by Michal Měchura (Open Database License)
Wikinflection corpus by Eleni Metheniti (CC BY 4.0 License)
Unimorph Project
FreeLing project
spaCy lookups data

This rule-based approach based on flexion and lemmatizations dictionaries is to this day an approach used in popular libraries such as spacy.

Contributions

Feel free to contribute, notably by filing issues for feedback, bug reports, or links to further lemmatization lists, rules and tests.

You can also contribute to this lemmatization list repository.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.9.1

Jan 20, 2023

0.9.0

Oct 18, 2022

0.8.2

Sep 5, 2022

0.8.1

Sep 1, 2022

0.8.0

Aug 2, 2022

0.7.0

Jun 16, 2022

0.6.0

Apr 6, 2022

0.5.0

Nov 19, 2021

0.4.0

Oct 19, 2021

0.3.0

Apr 8, 2021

0.2.2

Feb 24, 2021

This version

0.2.1

Feb 2, 2021

0.2.0

Jan 25, 2021

0.1.0

Jan 18, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simplemma-0.2.1.tar.gz (11.8 kB view hashes)

Uploaded Feb 2, 2021 Source

Built Distribution

simplemma-0.2.1-py3-none-any.whl (46.3 MB view hashes)

Uploaded Feb 2, 2021 Python 3

Hashes for simplemma-0.2.1.tar.gz

Hashes for simplemma-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`7e11c10fbdc8e06907eb133703e735313bd417fbc8f70417e96d3338fb8eeda6`
MD5	`03bb3178ede821ce0baefca8b0663ac9`
BLAKE2b-256	`28f0742daf7267ad95d71819904b2c29d8a545f8518635a59fa18f100c53d61a`

Hashes for simplemma-0.2.1-py3-none-any.whl

Hashes for simplemma-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e60596fb4e29f6b1eab68a7da702ccced1978fcabc452d86702ef7e0d758ffcc`
MD5	`00af7b1fb3fe5b4b0d0504f802bdc4c1`
BLAKE2b-256	`2ac99a9ff5146591bcfe261bbd8bed05b8760132038d4ef3b4a6707b59f9bdad`