Skip to main content

ThaiXtransformers: Use Pretraining RoBERTa based Thai language models from VISTEC-depa AI Research Institute of Thailand.

Project description

ThaiXtransformers

Open In Colab

Use Pretraining RoBERTa based Thai language models from VISTEC-depa AI Research Institute of Thailand.

Fork from vistec-AI/thai2transformers.

This project build the tokenizer and preprocessing data for RoBERTa models from VISTEC-depa AI Research Institute of Thailand.

Paper: WangchanBERTa: Pretraining transformer-based Thai Language Models

Install

pip install thaixtransformers

Usage

Tokenizer

from thaixtransformers import Tokenizer

If you use models, you should load model by model name.

Tokenizer(model_name) -> Tokeinzer

Example

from thaixtransformers import Tokenizer
from transformers import pipeline
from transformers import AutoModelForMaskedLM

tokenizer = Tokenizer("airesearch/wangchanberta-base-wiki-newmm")
model = AutoModelForMaskedLM.from_pretrained("airesearch/wangchanberta-base-wiki-newmm")

classifier = pipeline("fill-mask",model=model,tokenizer=tokenizer)
print(classifier("ผมชอบ<mask>มาก ๆ"))
# output:
#    [{'score': 0.05261131376028061,
#  'token': 6052,
#  'token_str': 'อินเทอร์เน็ต',
#  'sequence': 'ผมชอบอินเทอร์เน็ตมากๆ'},
# {'score': 0.03980604186654091,
#  'token': 11893,
#  'token_str': 'อ่านหนังสือ',
#  'sequence': 'ผมชอบอ่านหนังสือมากๆ'},
#    ...]

Preprocess

If you want to preprocessing data before training model, you can use preprocess.

from thaixtransformers.preprocess import process_transformers

process_transformers(str) -> str

Example

from thaixtransformers.preprocess import process_transformers

print(process_transformers("สวัสดี   :D"))
# output: 'สวัสดี<_>:d'

BibTeX entry and citation info

@misc{lowphansirikul2021wangchanberta,
      title={WangchanBERTa: Pretraining transformer-based Thai Language Models}, 
      author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},
      year={2021},
      eprint={2101.09635},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thaixtransformers-0.1.0.tar.gz (18.0 kB view hashes)

Uploaded Source

Built Distribution

thaixtransformers-0.1.0-py3-none-any.whl (19.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page