NLP Pipelines for Tagalog

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

calamanCy: NLP pipelines for Tagalog

calamanCy is a Tagalog natural language preprocessing framework made with spaCy. Its goal is to provide pipelines and datasets for downstream NLP tasks. This repository contains material for using calamanCy, reproduction of results, and guides on usage.

calamanCy takes inspiration from other language-specific spaCy Universe frameworks such as DaCy, huSpaCy, and graCy. The name is based from calamansi, a citrus fruit native to the Philippines and used in traditional Filipino cuisine.

🔧 Installation

To get started with calamanCy, simply install it using pip by running the following line in your terminal:

pip install calamanCy

Development

If you are developing calamanCy, first clone the repository:

git clone git@github.com:ljvmiranda921/calamanCy.git

Then, create a virtual environment and install the dependencies:

python -m venv venv
venv/bin/pip install -e .  # requires pip>=23.0
venv/bin/pip install .[dev]

# Activate the virtual environment
source venv/bin/activate

or alternatively, use make dev.

Running the tests

We use pytest as our test runner:

python -m pytest --pyargs calamancy

👩‍💻 Usage

To use calamanCy you first have to download either the medium, large, or transformer model. To see a list of all available models, run:

import calamancy
from model in calamancy.models():
    print(model)

# ..
# tl_calamancy_md-0.1.0
# tl_calamancy_lg-0.1.0
# tl_calamancy_trf-0.1.0

To download and load a model, run:

nlp = calamancy.load("tl_calamancy_md-0.1.0")
doc = nlp("Ako si Juan de la Cruz")

The nlp object is an instance of spaCy's Language class and you can use it as any other spaCy pipeline. You can also access these models on Hugging Face 🤗.

📦 Models and Datasets

calamanCy provides Tagalog models and datasets that you can use in your spaCy pipelines. You can download them directly or use the calamancy Python library to access them. The training procedure for each pipeline can be found in the models/ directory. They are further subdivided into versions. Each folder is an instance of a spaCy project.

Here are the models for the latest release:

Model	Pipelines	Description
tl_calamancy_md (73.7 MB)	tok2vec, tagger, morphologizer, parser, ner	CPU-optimized Tagalog NLP model. Pretrained using the TLUnified dataset. Using floret vectors (50k keys)
tl_calamancy_lg (431.9 MB)	tok2vec, tagger, morphologizer, parser, ner	CPU-optimized large Tagalog NLP model. Pretrained using the TLUnified dataset. Using fastText vectors (714k)
tl_calamancy_trf (775.6 MB)	transformer, tagger, parser, ner	GPU-optimized transformer Tagalog NLP model. Uses roberta-tagalog-base as context vectors.

📓 API

The calamanCy library contains utility functions that help you load its models and infer on your text. You can think of these functions as "syntactic sugar" to the spaCy API. We highly recommend checking out the spaCy Doc object, as it provides the most flexibility.

Loaders

The loader functions provide an easier interface to download calamanCy models. These models are hosted on HuggingFace so you can try them out first before downloading.

`function` `get_latest_version`

Return the latest version of a calamanCy model.

Argument	Type	Description
`model`	`str`	The string indicating the model.
RETURNS	`str`	The latest version of the model.

`function` `models`

Get a list of valid calamanCy models.

Argument	Type	Description
RETURNS	`List[str]`	List of valid calamanCy models

`function` `load`

Load a calamanCy model as a spaCy language pipeline.

Argument	Type	Description
`model`	`str`	The model to download. See the available models at `calamancy.models()`.
`force`	`bool`	Force download the model. Defaults to `False`.
`**kwargs`	`dict`	Additional arguments to `spacy.load()`.
RETURNS	`Language`	A spaCy language pipeline.

Inference

Below are lightweight utility classes for users who are not familiar with spaCy's primitives. They are only useful for inference and not for training. If you wish to train on top of these calamanCy models (e.g., text categorization, task-specific NER, etc.), we advise you to follow the standard spaCy training workflow.

General usage: first, you need to instantiate a class with the name of a model. Then, you can use the __call__ method to perform the prediction. The output is of the type Iterable[Tuple[str, Any]] where the first part of the tuple is the token and the second part is its label.

`method` `EntityRecognizer.call`

Perform named entity recognition (NER). By default, it uses the v0.1.0 of TLUnified-NER with the following entity labels: PER (Person), ORG (Organization), LOC (Location).

Argument	Type	Description
`text`	`str`	The text to get the entities from.
YIELDS	`Iterable[Tuple[str, str]]`	the token and its entity in IOB format.

`method` `Tagger.call`

Perform parts-of-speech tagging. It uses the annotations from the TRG and Ugnayan treebanks with the following tags: ADJ, ADP, ADV, AUX, DET, INTJ, NOUN, PART, PRON, PROPN, PUNCT, SCONJ, VERB.

Argument	Type	Description
`text`	`str`	The text to get the POS tags from.
YIELDS	`Iterable[Tuple[str, Tuple[str, str]]]`	the token and its coarse- and fine-grained POS tag.

`method` `Parser.call`

Perform syntactic dependency parsing. It uses the annotations from the TRG and Ugnayan treebanks.

Argument	Type	Description
`text`	`str`	The text to get the dependency relations from.
YIELDS	`Iterable[Tuple[str, str]]`	the token and its dependency relation.

📝 Reporting Issues

If you have questions regarding the usage of calamanCy, bug reports, or just want to give us feedback after giving it a spin, please use the Issue tracker. Thank you!

📜 Citation

If you are citing the open-source software, please use:

@misc{miranda2023calamancy,
    title={{calamanCy: A Tagalog Natural Language Processing Toolkit}}, 
    author={Lester James V. Miranda},
    year={2023},
    eprint={2311.07171},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

If you are citing the NER dataset, please use:

@misc{miranda2023developing,
    title={{Developing a Named Entity Recognition Dataset for Tagalog}}, 
    author={Lester James V. Miranda},
    year={2023},
    eprint={2311.07161},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.1.2

Dec 4, 2023

0.1.1

Jul 28, 2023

0.1.0

Jul 2, 2023

0.0.1

Feb 14, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

calamanCy-0.1.2.tar.gz (7.9 kB view hashes)

Uploaded Dec 4, 2023 Source

Built Distribution

calamanCy-0.1.2-py3-none-any.whl (8.1 kB view hashes)

Uploaded Dec 4, 2023 Python 3

Hashes for calamanCy-0.1.2.tar.gz

Hashes for calamanCy-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`5c5c25121a61a6fad8df56e3a335223bd0a202dd065bafe79877825c068a574f`
MD5	`fb4512c50aec42ff817b7aa1843b9abe`
BLAKE2b-256	`1989b0d65b923be34afa5f60f1027a4d635d71a6f3b5fa8f516e012baeff2560`

Hashes for calamanCy-0.1.2-py3-none-any.whl

Hashes for calamanCy-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ecfc4adcff05e0c9cda07a0a4285a7b7bf1cac703686f4be3708e1cd0786519b`
MD5	`6ef376611a218a3711124be7cf3a7157`
BLAKE2b-256	`b7b49b39ecdbcf9c8d4691f06f0be36fe430d197eedd707887d0ffb0e73fecb7`

calamanCy 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

calamanCy: NLP pipelines for Tagalog

🔧 Installation

Development

Running the tests

👩‍💻 Usage

📦 Models and Datasets

📓 API

Loaders

`function` `get_latest_version`

`function` `models`

`function` `load`

Inference

`method` `EntityRecognizer.call`

`method` `Tagger.call`

`method` `Parser.call`

📝 Reporting Issues

📜 Citation

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

calamanCy 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

calamanCy: NLP pipelines for Tagalog

🔧 Installation

Development

Running the tests

👩‍💻 Usage

📦 Models and Datasets

📓 API

Loaders

function get_latest_version

function models

function load

Inference

method EntityRecognizer.__call__

method Tagger.__call__

method Parser.__call__

📝 Reporting Issues

📜 Citation

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

`function` `get_latest_version`

`function` `models`

`function` `load`

`method` `EntityRecognizer.call`

`method` `Tagger.call`

`method` `Parser.call`