Python version of the BulStem stemming algorithm

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

BulStem-py: A Python Re-implementation of BulStem - inflectional stemmer for Bulgarian

Introduction

This is the Python version of the BulStem stemming algorithm. It follows the algorithm presented in

Nakov, P. BulStem: Design and evaluation of inflectional stemmer for Bulgarian. In Workshop on 
Balkan Language Resources and Tools (Balkan Conference in Informatics).

See http://people.ischool.berkeley.edu/~nakov/bulstem/ for the homepage of the algorithm. Also, check the original paper for more details and examples.

Implementation

This implementation, in contrast of other available, uses a Trie, instead of Dictionary/Hashtable/, to find the longest possible rule, which can be applied to a certain token. The Stemmer class is derived from NLTK's StemmerI interface, making it fully compatible with its pipelines.

Basic algorithm steps:

Find the position of the first vowel in the token.
Finds the longest possible rule traversing the string in reverse order until there is a matching suffix, or the position of the first vowel found in Step. 1.
Prepend the non-stemmed prefix to the stemmed suffix (Step. 2).

Installation

This library is compatible Python >= 3.6.

Clone the repository and run:

With pip

pip install -e .
pip install -r requirements.txt

Test

A set of tests are included in the project, under the tests folder. The test suit can be run as follows:

pip install -e ".[testing]"
pip install -r requirements-test.txt
python -m unittest

Usage

The library needs a set of rules to apply stemming properly. The rules can be either a list to the BulStemmer constructor, or a path to a file containing them.

For both options the rules need to be formatted as follows:

word ==> stem ==> freq

Pre-defined set of rules is included in the distribution, and can be used directly by the user, and can be found here. (examples: Reading the rules from an external file)

Manually loading rules

from bulstem.stem import BulStemmer

stemmer = BulStemmer(["ой ==> о 10"], min_freq=0, left_context=2)
stemmer.stem('порой')# Excepted output: 1. 'поро'

BulStemmer constructor params:

rules - Iterable of strings containing rules.
min_freq - The minimum frequency of a rule to be used when stemming.
left_context - Size of the prefix which will not be stemmed.

Reading the rules from an external file

from bulstem.stem import BulStemmer


# Pre-defined names of rule sets
PRE_DEFINED_RULES = ['stem-context-1', 
                     'stem-context-2',
                     'stem-context-3']

# Excepted output:
# 1 втор
# 2 втори
# 3 вторият
for i, rules_name in enumerate(PRE_DEFINED_RULES, start=1):
    stemmer = BulStemmer.from_file(rules_name, min_freq=2, left_context=i)
    print(i, stemmer.stem('вторият'))

stemmer = BulStemmer.from_file('stem_rules_context_2_utf8.txt', min_freq=2, left_context=i)
stemmer.stem('вторият') # Excepted output: 1. 'втори'
stemmer.stem('вероятен') # Excepted output: 1. 'вероят'

BulStemmer.from_file params:

path - Path (or pre-defined name) to the rules file formatted, as follows: word ==> stem ==> freq.
min_freq - The minimum frequency of a rule to be used when stemming.
left_context - Size of the prefix which will not be stemmed.

Other implementations

Perl (Original), Java (JDK 1.4), Ruby, C#, Python2, GATE plugin (Java)

License

For license information, see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.3.3

Aug 6, 2020

This version

0.3.2

Aug 5, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bulstem-0.3.2.tar.gz (5.3 kB view hashes)

Uploaded Aug 5, 2020 Source

Built Distribution

bulstem-0.3.2-py3-none-any.whl (831.6 kB view hashes)

Uploaded Aug 5, 2020 Python 3

Hashes for bulstem-0.3.2.tar.gz

Hashes for bulstem-0.3.2.tar.gz
Algorithm	Hash digest
SHA256	`7a24130fee958c93de6785d09048d0bbd510bf40a589b038681a115218fa2629`
MD5	`043d10bd3557b97e21004603946506f0`
BLAKE2b-256	`53b42da139620de69e2168e6adb4d819090d803168338e8302fb67548f3c98bd`

Hashes for bulstem-0.3.2-py3-none-any.whl

Hashes for bulstem-0.3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`edaea9602ae4792e6bd1388fa158986a970e72f6540979c3461cd2f59d3e67dc`
MD5	`818ef7157cd57d5f9f3a473152f91969`
BLAKE2b-256	`5fdb0715cd98f1c824d62b3f9b7f10a30b7141a8bfc1668c999c31111cf03d73`