Skip to main content

Justice for African Languages

Project description

Toka (tokaafrika)

Justice for African Languages


This package helps with utility functions to provide african language StopWords and helps in generating the new StopWords with ease and help in cleaning the text dataset, We are working on improving it and help you get reliable StopWords and Quality Aligned Datasets.

Currently we support the following languages

  • Throughout our application we will use the language code or description of the language interchangiably, this mean we will use the description of language code consistently. Refer to the table below.
Language Code / ISO CODE Language Name
eng English
sep Sepedi
afr Afrikaans
tsn Setswana
nbl isiNdebele
ssw Siswati
xho isiXhosa
ven Tshivenda
zul isiZulu
tso Xitsonga
sot Sesotho
nuu N|uu

Installation

Dependencies

We are using type hinting on this project.

  • Toka-api requires
    python>=3.9.13

To get started started and install the package execute the following.

>>> pip install tokaafrika==0.0.1

To start using the package, follow the steps below

Get StopWords - (Prebuild Stopwords )

At the moment the StopWords are based on South African Languages including N|uu

>>> from toka.toka import TokaAPI
>>> api = TokaAPI()
>>> stopwords = api.get_stopwords('tshivenda') # use fullname
>>> print(stopwords)
frozenset({'a', 'vha', 'u', 'na', 'tshi', 'nga', 'ya', 'ndi',
... 'o', 'khou', 'ni', 'uri', 'hu', 'ha', 'kha', 'i',
... 'zwi', 'tsha', 'ri', 'yo', 'wa', 'ho', 'vho', 'musi',
... 'ḽa', 'zwa', 'ḓo', 'amba', 'nahone', 'no'})
>>> stopwords = api.get_stopwords('ven') # use shotname/code
>>> print(stopwords)
frozenset({'a', 'vha', 'u', 'na', 'tshi', 'nga', 'ya', 'ndi',
... 'o', 'khou', 'ni', 'uri', 'hu', 'ha', 'kha', 'i',
... 'zwi', 'tsha', 'ri', 'yo', 'wa', 'ho', 'vho', 'musi',
... 'ḽa', 'zwa', 'ḓo', 'amba', 'nahone', 'no'})

To Clean Symbols

This helps in cleaning symbols ensuring your data is clean and free of symbols

>>> from toka.toka import TokaAPI
>>> toka_object = TokaAPI()
>>> clean_text = \
>>> ... toka_object.clean_symbols('Hello! This is an example\
>>> ... text with numbers like 123 ')
>>> print(clean_text)
>>> hello this is an example text with numbers like

Get Frequent Words

This helps in quickly getting the frequent words and how many times is appears from given text

>>> from toka.toka import TokaAPI
>>> toka_object = TokaAPI()
>>> english = toka_object.get_frequent_words('Hello test')
>>> print(english)
{'hello': 1, 'test': 1}

Compute StopWords

This helps with in computing the stop words from given documents or text, it is accurate when using long text or big document

>>> from toka.toka import TokaAPI
>>> api = TokaAPI()
>>> stopwords = api.compute_stopwords(
...    "the the are are the are on the on", 3)
>>> print(stopwords)
['the', 'are', 'on']

Load model

This Assummes you have vectorizer pickle and model that is already trained and are both pickle files

>>> from toka.toka import TokaAPI
>>> api = TokaAPI()
>>> model = 'model.pkl'
>>> vector = 'vector.pkl'
>>> clf, vector = api.load_model_from_pickle(model,
...                     vector)

Development

We welcome new contribution and if you have spotted a bug or have ideas around the improvement leave then in issues or fork the repo and develop the feature and create PR to merge the changes to the repository, ensure tests are written with edge cases.

Citations

If you usedtokaafrika package and it helped you a lot we would appreciate citations.

@article{tokaafrika,
  title={tokaafrika: African Languages - Machine Learning Package},
  author={Ofentswe Lebogo, Shaun Damon},
  howpublished={\url{https://pypi.org/project/tokaafrika/}},
  year={2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokaafrika-0.0.2.tar.gz (8.7 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page