simplemma

A simple multilingual lemmatizer for Python.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Purpose

Lemmatization is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word’s lemma, or dictionary form. Unlike stemming, lemmatization outputs word units that are still valid linguistic forms.

In modern natural language processing (NLP), this task is often indirectly tackled by more complex systems encompassing a whole processing pipeline. However, it appears that there is no straightforward way to address lemmatization in Python although this task can be crucial in fields such as information retrieval and NLP.

Simplemma provides a simple and multilingual approach to look for base forms or lemmata. It may not be as powerful as full-fledged solutions but it is generic, easy to install and straightforward to use. In particular, it does not need morphosyntactic information and can process a raw series of tokens or even a text with its built-in tokenizer. By design it should be reasonably fast and work in a large majority of cases, without being perfect.

With its comparatively small footprint it is especially useful when speed and simplicity matter, in low-resource contexts, for educational purposes, or as a baseline system for lemmatization and morphological analysis.

Currently, 49 languages are partly or fully supported (see table below).

Installation

The current library is written in pure Python with no dependencies:

pip install simplemma

pip3 where applicable
pip install -U simplemma for updates

Usage

Word-by-word

Simplemma is used by selecting a language of interest and then applying the data on a list of words.

>>> import simplemma
# get a word
myword = 'masks'
# decide which language to use and apply it on a word form
>>> simplemma.lemmatize(myword, lang='en')
'mask'
# grab a list of tokens
>>> mytokens = ['Hier', 'sind', 'Vaccines']
>>> for token in mytokens:
>>>     simplemma.lemmatize(token, lang='de')
'hier'
'sein'
'Vaccines'
# list comprehensions can be faster
>>> [simplemma.lemmatize(t, lang='de') for t in mytokens]
['hier', 'sein', 'Vaccines']

Chaining several languages can improve coverage, they are used in sequence:

>>> from simplemma import lemmatize
>>> lemmatize('Vaccines', lang=('de', 'en'))
'vaccine'
>>> lemmatize('spaghettis', lang='it')
'spaghettis'
>>> lemmatize('spaghettis', lang=('it', 'fr'))
'spaghetti'
>>> lemmatize('spaghetti', lang=('it', 'fr'))
'spaghetto'

For certain languages a greedier decomposition is activated by default as it can be beneficial, mostly due to a certain capacity to address affixes in an unsupervised way. This can be triggered manually by setting the greedy parameter to True.

This option also triggers a stronger reduction through a further iteration of the search algorithm, e.g. “angekündigten” → “angekündigt” (standard) → “ankündigen” (greedy). In some cases it may be closer to stemming than to lemmatization.

# same example as before, comes to this result in one step
>>> simplemma.lemmatize('spaghettis', lang=('it', 'fr'), greedy=True)
'spaghetto'
# German case described above
>>> simplemma.lemmatize('angekündigten', lang='de', greedy=True)
'ankündigen' # 2 steps: reduction to infinitive verb
>>> simplemma.lemmatize('angekündigten', lang='de', greedy=False)
'angekündigt' # 1 step: reduction to past participle

The additional function is_known() checks if a given word is present in the language data:

>>> from simplemma import is_known
>>> is_known('spaghetti', lang='it')
True

Tokenization

A simple tokenization function is included for convenience:

>>> from simplemma import simple_tokenizer
>>> simple_tokenizer('Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.')
['Lorem', 'ipsum', 'dolor', 'sit', 'amet', ',', 'consectetur', 'adipiscing', 'elit', ',', 'sed', 'do', 'eiusmod', 'tempor', 'incididunt', 'ut', 'labore', 'et', 'dolore', 'magna', 'aliqua', '.']
# use iterator instead
>>> simple_tokenizer('Lorem ipsum dolor sit amet', iterate=True)

The functions text_lemmatizer() and lemma_iterator() chain tokenization and lemmatization. They can take greedy (affecting lemmatization) and silent (affecting errors and logging) as arguments:

>>> from simplemma import text_lemmatizer
>>> sentence = 'Sou o intervalo entre o que desejo ser e os outros me fizeram.'
>>> text_lemmatizer(sentence, lang='pt')
# caveat: desejo is also a noun, should be desejar here
['ser', 'o', 'intervalo', 'entre', 'o', 'que', 'desejo', 'ser', 'e', 'o', 'outro', 'me', 'fazer', '.']
# same principle, returns a generator and not a list
>>> from simplemma import lemma_iterator
>>> lemma_iterator(sentence, lang='pt')

Caveats

# don't expect too much though
# this diminutive form isn't in the model data
>>> simplemma.lemmatize('spaghettini', lang='it')
'spaghettini' # should read 'spaghettino'
# the algorithm cannot choose between valid alternatives yet
>>> simplemma.lemmatize('son', lang='es')
'son' # valid common name, but what about the verb form?

As the focus lies on overall coverage, some short frequent words (typically: pronouns and conjunctions) may need post-processing, this generally concerns a few dozens of tokens per language.

The current absence of morphosyntactic information is both an advantage in terms of simplicity and an impassable frontier regarding lemmatization accuracy, e.g. disambiguation between past participles and adjectives derived from verbs in Germanic and Romance languages. In most cases, simplemma often does not change such input words.

The greedy algorithm seldom produces invalid forms. It is designed to work best in the low-frequency range, notably for compound words and neologisms. Aggressive decomposition is only useful as a general approach in the case of morphologically-rich languages, where it can also act as a linguistically motivated stemmer.

Bug reports over the issues page are welcome.

Language detection

Language detection works by providing a text and tuple lang consisting of a series of languages of interest. Scores between 0 and 1 are returned.

The lang_detector() function returns a list of language codes along with scores and adds “unk” for unknown or out-of-vocabulary words. The latter can also be calculated by using the function in_target_language() which returns a ratio.

# import necessary functions
>>> from simplemma.langdetect import in_target_language, lang_detector
# language detection
>>> lang_detector('"Moderní studie narazily na několik tajemství." Extracted from Wikipedia.', lang=("cs", "sk"))
[('cs', 0.625), ('unk', 0.375), ('sk', 0.125)]
# proportion of known words
>>> in_target_language("opera post physica posita (τὰ μετὰ τὰ φυσικά)", lang="la")
0.5

Supported languages

The following languages are available using their BCP 47 language tag, which is usually the ISO 639-1 code but if no such code exists, a ISO 639-3 code is used instead:

Available languages (2022-01-20)
Code	Language	Forms (10³)	Acc.	Comments
ast	Asturian	124
bg	Bulgarian	204
ca	Catalan	579
cs	Czech	187	0.89	on UD CS-PDT
cy	Welsh	360
da	Danish	554	0.92	on UD DA-DDT, alternative: lemmy
de	German	675	0.95	on UD DE-GSD, see also German-NLP list
el	Greek	181	0.88	on UD EL-GDT
en	English	131	0.94	on UD EN-GUM, alternative: LemmInflect
enm	Middle English	38
es	Spanish	665	0.95	on UD ES-GSD
et	Estonian	119		low coverage
fa	Persian	12		experimental
fi	Finnish	3,199		see this benchmark
fr	French	217	0.94	on UD FR-GSD
ga	Irish	372
gd	Gaelic	48
gl	Galician	384
gv	Manx	62
hbs	Serbo-Croatian	656		Croatian and Serbian lists to be added later
hi	Hindi	58		experimental
hu	Hungarian	458
hy	Armenian	246
id	Indonesian	17	0.91	on UD ID-CSUI
is	Icelandic	174
it	Italian	333	0.93	on UD IT-ISDT
ka	Georgian	65
la	Latin	843
lb	Luxembourgish	305
lt	Lithuanian	247
lv	Latvian	164
mk	Macedonian	56
ms	Malay	14
nb	Norwegian (Bokmål)	617
nl	Dutch	250	0.92	on UD-NL-Alpino
nn	Norwegian (Nynorsk)	56
pl	Polish	3,211	0.91	on UD-PL-PDB
pt	Portuguese	924	0.92	on UD-PT-GSD
ro	Romanian	311
ru	Russian	595		alternative: pymorphy2
se	Northern Sámi	113
sk	Slovak	818	0.92	on UD SK-SNK
sl	Slovene	136
sq	Albanian	35
sv	Swedish	658		alternative: lemmy
sw	Swahili	10		experimental
tl	Tagalog	32		experimental
tr	Turkish	1,232	0.89	on UD-TR-Boun
uk	Ukrainian	370		alternative: pymorphy2

Low coverage mentions means one would probably be better off with a language-specific library, but simplemma will work to a limited extent. Open-source alternatives for Python are referenced if possible.

Experimental mentions indicate that the language remains untested or that there could be issues with the underlying data or lemmatization process.

The scores are calculated on Universal Dependencies treebanks on single word tokens (including some contractions but not merged prepositions), they describe to what extent simplemma can accurately map tokens to their lemma form. They can be reproduced by concatenating all available UD files and by using the script udscore.py in the tests/ folder.

This library is particularly relevant as regards the lemmatization of less frequent words. Its performance in this case is only incidentally captured by the benchmark above. In some languages, a fixed number of words such as pronouns can be further mapped by hand to enhance performance.

Speed

Orders of magnitude given for reference only, measured on an old laptop to give a lower bound:

Tokenization: > 1 million tokens/sec
Lemmatization: > 250,000 words/sec

Installing the most recent Python version can improve speed.

Optional pre-compilation with mypyc

pip3 install mypy
clone or download the source code from the repository
python3 setup.py --use-mypyc bdist_wheel
pip3 install dist/*.whl (where * is the compiled wheel)

Roadmap

[-] Add further lemmatization lists
[ ] Grammatical categories as option
[ ] Function as a meta-package?
[ ] Integrate optional, more complex models?

Credits and licenses

Software under MIT license, for the linguistic information databases see licenses folder.

The surface lookups (non-greedy mode) use lemmatization lists derived from various sources, ordered by relative importance:

Lemmatization lists by Michal Měchura (Open Database License)
Wiktionary entries packaged by the Kaikki project
FreeLing project
spaCy lookups data
Unimorph Project
Wikinflection corpus by Eleni Metheniti (CC BY 4.0 License)

Contributions

See this list of contributors to the repository.

Feel free to contribute, notably by filing issues for feedback, bug reports, or links to further lemmatization lists, rules and tests.

Contributions by pull requests ought to follow the following conventions: code style with black, type hinting with mypy, included tests with pytest.

References

To cite this software:

Barbaresi A. (year). Simplemma: a simple multilingual lemmatizer for Python [Computer software] (Version version number). Berlin, Germany: Berlin-Brandenburg Academy of Sciences. Available from https://github.com/adbar/simplemma DOI: 10.5281/zenodo.4673264

This work draws from lexical analysis algorithms used in:

Barbaresi, A., & Hein, K. (2017). Data-driven identification of German phrasal compounds. In International Conference on Text, Speech, and Dialogue Springer, pp. 192-200.
Barbaresi, A. (2016). An unsupervised morphological criterion for discriminating similar languages. In 3rd Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2016), Association for Computational Linguistics, pp. 212-220.
Barbaresi, A. (2016). Bootstrapped OCR error detection for a less-resourced language variant. In 13th Conference on Natural Language Processing (KONVENS 2016), pp. 21-26.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.9.1

Jan 20, 2023

0.9.0

Oct 18, 2022

0.8.2

Sep 5, 2022

0.8.1

Sep 1, 2022

0.8.0

Aug 2, 2022

0.7.0

Jun 16, 2022

0.6.0

Apr 6, 2022

0.5.0

Nov 19, 2021

0.4.0

Oct 19, 2021

0.3.0

Apr 8, 2021

0.2.2

Feb 24, 2021

0.2.1

Feb 2, 2021

0.2.0

Jan 25, 2021

0.1.0

Jan 18, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simplemma-0.9.1.tar.gz (75.5 MB view hashes)

Uploaded Jan 20, 2023 Source

Built Distribution

simplemma-0.9.1-py3-none-any.whl (75.5 MB view hashes)

Uploaded Jan 20, 2023 Python 3

Hashes for simplemma-0.9.1.tar.gz

Hashes for simplemma-0.9.1.tar.gz
Algorithm	Hash digest
SHA256	`98ebcaf659bd1e281d9d87716a95d5c148d318dc18d866d3549990d6b3334749`
MD5	`ccaf192f0ed8f332438617c706e5df0a`
BLAKE2b-256	`f10ff6cff760d641b0ff5d3a8978d791eafdcfeffd762c2b103b5e7a3a511f16`

Hashes for simplemma-0.9.1-py3-none-any.whl

Hashes for simplemma-0.9.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7f7371b325302ce522d90bae85b29839f87918fb1201d38359de9ea3b7467e65`
MD5	`11063927f9624a1261781cf8e412645d`
BLAKE2b-256	`748724e3ce8de234998171fe0161cbac221cbfa51b28043d0441a9980758b768`