Project description

calamanCy: NLP pipelines for Tagalog [WIP]

calamanCy is a Tagalog natural language preprocessing framework made with spaCy. Its goal is to provide pipelines and datasets for downstream NLP tasks. This repository contains material for using calamanCy, reproduction of results, and guides on usage.

calamanCy takes inspiration from other language-specific spaCy Universe frameworks such as DaCy, huSpaCy, and graCy. The name is based from calamansi, a citrus fruit native to the Philippines and used in traditional Filipino cuisine.

🔧 Installation

To get started with calamanCy, simply install it using pip by running the following line in your terminal:

pip install calamancy

👩‍💻 Usage

To use the calamanCy you first have to download either the medium, large, or transformer model. To see a list of all available models, run:

import calamancy
from model in calamancy.models():
    print(model)

# ..
# tl_calamancy_md-0.1.0
# tl_calamancy_lg-0.1.0
# tl_calamancy_trf-0.1.0

To download and load a model, run:

nlp = calamancy.load("tl_calamancy_md-0.1.0")

This will download the model to the .calamancy directory of your home directory. You can also download a model to a specific directory:

calamancy.download_model("tl_calamancy_md-0.1.0", save_directory)
nlp = calamancy.load_model("tl_calamancy_md-0.1.0", save_directory)

The nlp object is an instance of spaCy's Language class, and you can use it as any other spaCy pipeline. Head over to the documentation for more tutorials.

📦 Models and Datasets

calamanCy provides Tagalog models and datasets that you can use in your spaCy pipelines. You can download them directly or use the calamancy Python library to access them.

Datasets

You can find structured evaluation results for each dataset in the datasets/ directory.

Name	Type	Task	Train	Dev	Test	Labels	Description
`tl_tlunified_gold`	Gold	NER	6252	782	782	PER, ORG, LOC	Annotated portion of the TLUnified corpus (Cruz and Cheng, 2021).

Pipelines

The training procedure for each pipeline can be found in the training/ directory. They are further subdivided into versions. Each folder is an instance of a spaCy project.

Name	Components	Sources	Description
`tl_calamancy_md`	tok2vec, morphologizer, parser, trainable_lemmatizer, ner	TLUnified (Cruz and Cheng, 2021), UD Tagalog (2023)	Floret vectors (200k) that were trained from the bulk of TLUnified were used for the tok2vec component. Similar to the `lg` variant, it also uses character pretraining to initialize weights.
`tl_calamancy_lg`	tok2vec, morphologizer, parser, trainable_lemmatizer, ner	TLUnified (Cruz and Cheng, 2021), UD Tagalog (2023)	The tok2vec component uses fastText vectors (714k) trained from CommonCrawl and Wikipedia. It also uses character pretraining to initialize the token-to-vector weights.
`tl_calamancy_trf`	transformer, morphologizer, parser, trainable_lemmatizer, ner	TLUnified (Cruz and Cheng, 2021), UD Tagalog (2023)	The transformer component uses roberta-tagalog-large.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.1.2

Dec 4, 2023

0.1.1

Jul 28, 2023

0.1.0

Jul 2, 2023

This version

0.0.1

Feb 14, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

calamanCy-0.0.1.tar.gz (4.1 kB view hashes)

Uploaded Feb 14, 2023 Source

Built Distribution

calamanCy-0.0.1-py3-none-any.whl (4.0 kB view hashes)

Uploaded Feb 14, 2023 Python 3

Hashes for calamanCy-0.0.1.tar.gz

Hashes for calamanCy-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`5a866d54d803ed6b9f7f98b9b9ec196fa189fcc1e493af5664d37af4adce17ee`
MD5	`82b6e5cd3e1cc30d2ff4c2ff2be079d5`
BLAKE2b-256	`6306364007f971294849b5eb78b5eab982ec35478bb2bf7b912c34588acf4a75`

Hashes for calamanCy-0.0.1-py3-none-any.whl

Hashes for calamanCy-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ee7a135b36f98f954889b89f054c1c6a6497b8f40a407a03aeabe7dff7a4e3c5`
MD5	`3ffb0a3b22190a00443fa23b947a9a8c`
BLAKE2b-256	`0cdea7fd6c2b10dc4d8d9e550339412495021dae621e0b1eb566fd86a0e1df88`