Transformer-based named entity recognition

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
- Science/Research
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
Topic
- Scientific/Engineering

Project description

T-NER

TNER is a python tool to analyse language model finetuning on named-entity-recognition (NER), available via pip. It has an easy interface to finetune models, test on cross-domain datasets with 9 publicly available NER datasets as well as custom datasets. All models finetuned with TNER can be deploy on our web app for visualization. Finally, we release 46 XLM-RoBERTa model finetuned on NER on transformers model hub.

Setup
Web API
Model Finetuning
Model Evaluation
Model Inference
Datasets

Get Started

Install pip package

pip install tner

or directly from the repository for the latest version.

pip install git+https://github.com/asahi417/tner

Web App

To start the web app, first clone the repository

git clone https://github.com/asahi417/tner
cd tner

then launch the server by

uvicorn app:app --reload --log-level debug --host 0.0.0.0 --port 8000

and open your browser http://0.0.0.0:8000 once it's ready. You can specify model to deploy by an environment variable NER_MODEL, which is set as asahi417/tner-xlm-roberta-large-ontonotes5 as a defalt. NER_MODEL can be either path to your local model checkpoint directory or model name on transformers model hub.

Acknowledgement The App interface is heavily inspired by this repository.

Model Finetuning

Language model finetuning on NER can be done with a few lines:

import tner
trainer = tner.TrainTransformersNER(checkpoint_dir='./ckpt_tner', dataset="data-name", transformers_model="transformers-model")
trainer.train()

where transformers_model is a pre-trained model name from transformers model hub and dataset is a dataset alias or path to custom dataset explained dataset section. Model files will be generated at checkpoint_dir, and it can be uploaded to transformers model hub without any changes.

To show validation accuracy at the end of each epoch,

trainer.train(monitor_validation=True)

and to tune training parameter such as batch size, epoch, learning rate, please take a look the argument description.

Train on multiple datasets: Model can be trained on a concatenation of multiple datasets by proviing a list of data name.

trainer = tner.TrainTransformersNER(checkpoint_dir='./ckpt_merged', dataset=["ontonotes5", "conll2003"], transformers_model="xlm-roberta-base")

Custom dataset can be also added to it eg) dataset=["ontonotes5", "./examples/custom_datas_ample"].

Command line tool: Model finetune with CL tool.

tner-train [-h] [-c CHECKPOINT_DIR] [-d DATA] [-t TRANSFORMER] [-b BATCH_SIZE] [--max-grad-norm MAX_GRAD_NORM] [--max-seq-length MAX_SEQ_LENGTH] [--random-seed RANDOM_SEED] [--lr LR] [--total-step TOTAL_STEP] [--warmup-step WARMUP_STEP] [--weight-decay WEIGHT_DECAY] [--fp16] [--monitor-validation] [--lower-case]

Model Evaluation

Evaluation of NER models are easily done in/out of domain setting.

import tner
trainer = tner.TrainTransformersNER(checkpoint_dir='path-to-checkpoint', transformers_model="language-model-name")
trainer.test(test_dataset='data-name')

Entity span prediction: For better understanding of out-of-domain accuracy, we provide the entity span prediction pipeline, which ignores the entity type and compute metrics only on the IOB entity position.

trainer.test(test_dataset='data-name', entity_span_prediction=True)

Command line tool: Model evaluation with CL tool.

tner-test [-h] -c CHECKPOINT_DIR [--lower-case] [--test-data TEST_DATA] [--test-lower-case] [--test-entity-span]

Model Inference

If you just want a prediction from a finetuned NER model, here is the best option for you.

import tner
classifier = tner.TransformersNER('transformers-model')
test_sentences = [
    'I live in United States, but Microsoft asks me to move to Japan.',
    'I have an Apple computer.',
    'I like to eat an apple.'
]
classifier.predict(test_sentences)

Command line tool: Model inference with CL tool.

tner-predict [-h] [-c CHECKPOINT]

Datasets

Public datasets that can be fetched with TNER is summarized here.

Name (`alias`)	Genre	Language	Entity types	Data size (train/valid/test)	Note
OntoNotes 5 (`ontonotes5`)	News, Blog, Dialogue	English	18	59,924/8,582/8,262
CoNLL 2003 (`conll2003`)	News	English	4	14,041/3,250/3,453
WNUT 2017 (`wnut2017`)	SNS	English	6	1,000/1,008/1,287
FIN (`fin`)	Finance	English	4	1,164/-/303
BioNLP 2004 (`bionlp2004`)	Chemical	English	5	18,546/-/3,856
BioCreative V CDR (`bc5cdr`)	Medical	English	2	5,228/5,330/5,865	split into sentences to reduce sequence length
WikiAnn (`panx_dataset/en`, `panx_dataset/ja`, etc)	Wikipedia	282 languages	3	20,000/10,000/10,000
Japanese Wikipedia (`wiki_ja`)	Wikipedia	Japanese	8	-/-/500	test set only
Japanese WikiNews (`wiki_news_ja`)	Wikipedia	Japanese	10	-/-/1,000	test set only
MIT Restaurant (`mit_restaurant`)	Restaurant review	English	8	7,660/-/1,521	lower-cased
MIT Movie (`mit_movie_trivia`)	Movie review	English	12	7,816/-/1,953	lower-cased

To take a closer look into each dataset, one may want to use tner.get_dataset_ner as in

import tner
data, label_to_id, language, unseen_entity_set = tner.get_dataset_ner('data-name')

where data consists of following structured data.

{
    'train': {
        'data': [
            ['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.'],
            ['From', 'Green', 'Newsfeed', ':', 'AHFA', 'extends', 'deadline', 'for', 'Sage', 'Award', 'to', 'Nov', '.', '5', 'http://tinyurl.com/24agj38'], ...
        ],
        'label': [
            [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
            [0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ...
        ]
    },
    'valid': ...
}

WikiAnn dataset
All the dataset should be fetched automatically but not panx_dataset/* dataset, as you need to manually download data from here (note that it will download as AmazonPhotos.zip) to the cache folder, which is ~/.cache/tner as a default but can be changed by cache_dir argument in training instance or inference instance.

Custom Dataset

To go beyond the public datasets, user can use their own dataset by formatting them into the IOB format described in CoNLL 2003 NER shared task paper, where all data files contain one word per line with empty lines representing sentence boundaries. At the end of each line there is a tag which states whether the current word is inside a named entity or not. The tag also encodes the type of named entity. Here is an example sentence:

EU B-ORG
rejects O
German B-MISC
call O
to O
boycott O
British B-MISC
lamb O
. O

Words tagged with O are outside of named entities and the I-XXX tag is used for words inside a named entity of type XXX. Whenever two entities of type XXX are immediately next to each other, the first word of the second entity will be tagged B-XXX in order to show that it starts another entity. The custom dataset should has train.txt and valid.txt file in a same folder. Please take a look sample custom data.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
- Science/Research
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
Topic
- Scientific/Engineering

Release history Release notifications | RSS feed

0.2.4

Jan 10, 2023

0.2.3

Jan 10, 2023

0.2.2

Jan 9, 2023

0.2.1

Nov 21, 2022

0.2.0

Oct 29, 2022

0.1.9

Oct 7, 2022

0.1.8

Oct 1, 2022

0.1.7

Sep 30, 2022

0.1.6

Sep 29, 2022

0.1.5

Sep 28, 2022

0.1.4

Sep 28, 2022

0.1.3

Sep 28, 2022

0.1.2

Sep 27, 2022

0.1.1

Aug 11, 2022

0.1.0

Aug 10, 2022

0.0.8

May 30, 2021

0.0.7

Mar 16, 2021

0.0.6

Mar 9, 2021

0.0.5

Mar 9, 2021

0.0.4

Mar 2, 2021

This version

0.0.3

Feb 14, 2021

0.0.2

Feb 14, 2021

0.0.1

Feb 14, 2021

0.0.0

Feb 9, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tner-0.0.3.tar.gz (26.9 kB view hashes)

Uploaded Feb 14, 2021 Source

Hashes for tner-0.0.3.tar.gz

Hashes for tner-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`85562c7ec1796c41dcec80129353c784deeadd3707469e7fde5b5cf0cc8a5263`
MD5	`dcbab19c73335a0716749b74ad83383b`
BLAKE2b-256	`90f6e255d67d394559bd209d0baf8007a88faf96a4b43504a5eba7821d9da51c`