Justice for African Languages

These details have not been verified by PyPI

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Toka (`tokaafrika`)

Justice for African Languages

This package helps with utility functions to provide african language StopWords and helps in generating the new StopWords with ease and help in cleaning the text dataset, We are working on improving it and help you get reliable StopWords and Quality Aligned Datasets.

Currently we support the following languages

Throughout our application we will use the language code or description of the language interchangiably, this mean we will use the description of language code consistently. Refer to the table below.

Language Code / ISO CODE	Language Name
eng	English
sep	Sepedi
afr	Afrikaans
tsn	Setswana
nbl	isiNdebele
ssw	Siswati
xho	isiXhosa
ven	Tshivenda
zul	isiZulu
tso	Xitsonga
sot	Sesotho
nuu	N\|uu

Installation

Dependencies

We are using type hinting on this project.

Toka-api requires
python>=3.9.13

To get started started and install the package execute the following.

>>> pip install tokaafrika==0.0.1

To start using the package, follow the steps below

Get `StopWords` - (Prebuild Stopwords )

At the moment the StopWords are based on South African Languages including N|uu

>>> from toka.toka import TokaAPI
>>> api = TokaAPI()
>>> stopwords = api.get_stopwords('tshivenda') # use fullname
>>> print(stopwords)
frozenset({'a', 'vha', 'u', 'na', 'tshi', 'nga', 'ya', 'ndi',
... 'o', 'khou', 'ni', 'uri', 'hu', 'ha', 'kha', 'i',
... 'zwi', 'tsha', 'ri', 'yo', 'wa', 'ho', 'vho', 'musi',
... 'ḽa', 'zwa', 'ḓo', 'amba', 'nahone', 'no'})
>>> stopwords = api.get_stopwords('ven') # use shotname/code
>>> print(stopwords)
frozenset({'a', 'vha', 'u', 'na', 'tshi', 'nga', 'ya', 'ndi',
... 'o', 'khou', 'ni', 'uri', 'hu', 'ha', 'kha', 'i',
... 'zwi', 'tsha', 'ri', 'yo', 'wa', 'ho', 'vho', 'musi',
... 'ḽa', 'zwa', 'ḓo', 'amba', 'nahone', 'no'})

To Clean Symbols

This helps in cleaning symbols ensuring your data is clean and free of symbols

>>> from toka.toka import TokaAPI
>>> toka_object = TokaAPI()
>>> clean_text = \
>>> ... toka_object.clean_symbols('Hello! This is an example\
>>> ... text with numbers like 123 ')
>>> print(clean_text)
>>> hello this is an example text with numbers like

Get Frequent Words

This helps in quickly getting the frequent words and how many times is appears from given text

>>> from toka.toka import TokaAPI
>>> toka_object = TokaAPI()
>>> english = toka_object.get_frequent_words('Hello test')
>>> print(english)
{'hello': 1, 'test': 1}

Compute `StopWords`

This helps with in computing the stop words from given documents or text, it is accurate when using long text or big document

>>> from toka.toka import TokaAPI
>>> api = TokaAPI()
>>> stopwords = api.compute_stopwords(
...    "the the are are the are on the on", 3)
>>> print(stopwords)
['the', 'are', 'on']

Load model

This Assummes you have vectorizer pickle and model that is already trained and are both pickle files

>>> from toka.toka import TokaAPI
>>> api = TokaAPI()
>>> model = 'model.pkl'
>>> vector = 'vector.pkl'
>>> clf, vector = api.load_model_from_pickle(model,
...                     vector)

Development

We welcome new contribution and if you have spotted a bug or have ideas around the improvement leave then in issues or fork the repo and develop the feature and create PR to merge the changes to the repository, ensure tests are written with edge cases.

Citations

If you usedtokaafrika package and it helped you a lot we would appreciate citations.

@article{tokaafrika,
  title={tokaafrika: African Languages - Machine Learning Package},
  author={Ofentswe Lebogo, Shaun Damon},
  howpublished={\url{https://pypi.org/project/tokaafrika/}},
  year={2024}
}

Project details

These details have not been verified by PyPI

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.0.2

Mar 16, 2024

0.0.1

Feb 25, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokaafrika-0.0.2.tar.gz (8.7 kB view hashes)

Uploaded Mar 16, 2024 Source

Hashes for tokaafrika-0.0.2.tar.gz

Hashes for tokaafrika-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`a8565ded0855d07dc2aaf6b214d317c6d6684f4b0bca41aa223764afef446702`
MD5	`dd1814e950e4c5ad40de3d35b1824adb`
BLAKE2b-256	`5c23f590ef48e460f8a20d228a93bb1ae637437299cb3bcd5a73781ff497ed4f`

tokaafrika 0.0.2

Navigation

Verified details

Maintainers

Unverified details

GitHub Statistics

Meta

Classifiers

Project description

Toka (`tokaafrika`)

Currently we support the following languages

Installation

Dependencies

Get `StopWords` - (Prebuild Stopwords )

To Clean Symbols

Get Frequent Words

Compute `StopWords`

Load model

Development

Citations

Project details

Verified details

Maintainers

Unverified details

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

tokaafrika 0.0.2

Navigation

Verified details

Maintainers

Unverified details

GitHub Statistics

Meta

Classifiers

Project description

Toka (tokaafrika)

Currently we support the following languages

Installation

Dependencies

Get StopWords - (Prebuild Stopwords )

To Clean Symbols

Get Frequent Words

Compute StopWords

Load model

Development

Citations

Project details

Verified details

Maintainers

Unverified details

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Toka (`tokaafrika`)

Get `StopWords` - (Prebuild Stopwords )

Compute `StopWords`