A simple lemmatizer based on Unitex word lists
Project description
This is a simple module for lemmatization based on the Unitex inflected word list. As such, it needs a Unitex vocabulary file in order to work properly.
So far, I’ve only worked with Portuguese, with the DELAF_PB file provided by NILC.
Installing
You can either clone the repository and install with
$ python setup.py install
or install through pip
$ pip install unitexlemmatizer
Usage
In order to use the Unitex Lemmatizer, you need to tell it where the word list is:
>>> import unitexlemmatizer as ul
>>> ul.load_unitex_dictionary('/path/to/delaf.dic')
Then, you can call the get_lemma function passing the inflected word and its part of speech tag (from the Universal Dependencies tagset).
>>> ul.get_lemma('corpora', 'noun')
'corpus'
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
unitexlemmatizer-1.0.0.tar.gz
(3.0 kB
view hashes)
Built Distributions
Close
Hashes for unitexlemmatizer-1.0.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a493635169a21456d66e7587a065ec86e0fb80b926516198077335a60fd38df3 |
|
MD5 | 36a8f4d39f2d0b494158320ecb3faf6e |
|
BLAKE2b-256 | 1ed639ad1bd2dce9bd0d90faa64373a2fee48e89fe03e95da9b7a04cded0339b |
Close
Hashes for unitexlemmatizer-1.0.0-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5a7a4699e10a1b37efaac2e9404e8766c0e664c907bc1a89fcd37910756dac08 |
|
MD5 | 4811bf793feb638b997305efa3654171 |
|
BLAKE2b-256 | f9aebee3a227b4c623abd36c2354909354783a0a413e9bc11c5421c00b1ae1e9 |