Skip to main content

Thai Nested Named Entity Recognition

Project description

Thai-NNER (Thai Nested Named Entity Recognition Corpus)

Code associated with the paper Thai Nested Named Entity Recognition Corpus at ACL 2022.

Abstract / Motivation

This work presents the first Thai Nested Named Entity Recognition (N-NER) dataset. Thai N-NER consists of 264,798 mentions, 104 classes, and a maximum depth of 8 layers obtained from news articles and restaurant reviews, a total of 4894 documents. Our work, to the best of our knowledge, presents the largest non-English N-NER dataset and the first non-English one with fine-grained classes.

How to use?

Install

pip install thai_nner

Usage

You needs to download model from "data/[checkpoints]": Download

Example: 0906_214036/checkpoint.pth

and use convert_model2use.py script by

python convert_model2use.py -i 0906_214036/checkpoint.pth -o model.pth

Usage Example

import os
os.environ['CUDA_VISIBLE_DEVICES'] = "0" # for non-gpu: os.environ['CUDA_VISIBLE_DEVICES'] = ""
from thai_nner import NNER
nner = NNER("model.pth")
nner.get_tag("วันนี้วันที่ 5 เมษายน 2565 เป็นวันที่อากาศดีมาก")
# output: (['<s>', 'วันนี้', 'วันที่', '', '', '5', '', '', 'เมษายน', '', '', '25', '65', '', '', 'เป็น', 'วันที่', '', 'อากาศ', '', 'ดีมาก', '</s>'], [{'text': ['วันนี้'], 'span': [1, 2], 'entity_type': 'rel'}, {'text': ['วันที่', '', '', '5'], 'span': [2, 6], 'entity_type': 'day'}, {'text': ['วันที่', '', '', '5', '', '', 'เมษายน', '', '', '25', '65'], 'span': [2, 13], 'entity_type': 'date'}, {'text': ['', '5'], 'span': [4, 6], 'entity_type': 'cardinal'}, {'text': ['', 'เมษายน'], 'span': [7, 9], 'entity_type': 'month'}, {'text': ['', '25', '65'], 'span': [10, 13], 'entity_type': 'year'}])

Example

Python library

Colabs

Test

Colabs

Dataset and Models

Model's Checkpoint

Download and save models' checkpoints at the following path "data/[checkpoints]": Download

Dataset

Download and save the dataset at the following path "data/[scb-nner-th-2022]": Download

Pre-trained Language Model

Download and save the pre-trained language model at the following path "data/[lm]": Download

Training/Testing

Train

python train.py --device 0,1 -c config.json

Test

python test_nne.py --resume [PATH]/checkpoint.pth

Tensorboard

tensorboard --logdir [PATH]/save/log/

Results

Experimental results

Citation

@inproceedings{Buaphet-etal-2022-thai-nner,
    title = "Thai Nested Named Entity Recognition Corpus",
    author = "Buaphet, Weerayut  and
      Udomcharoenchaikit, Can  and
      Limkonchotiwat, Peerat and
      Rutherford, Attapol  and 
      Nutanong, Sarana",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2022"
    year = "2022",
    publisher = "Association for Computational Linguistics",
}

License

CC-BY-SA 3.0

Acknowledgements

  • Dataset information: The Thai N-NER corpus is supported in part by the Digital Economy Promotion Agency (depa) Digital Infrastructure Fund MP-62-003 and Siam Commercial Bank. This dataset is released as scb-nner-th-2022.
  • Training code: Tensorflow-Project-Template by Mahmoud Gemy

Project details


Release history Release notifications | RSS feed

This version

0.3

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

thai_nner-0.3-py3-none-any.whl (2.2 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page