Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing
Project description
Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing
Trankit is a light-weight Transformer-based Python Toolkit for multilingual Natural Language Processing (NLP). It provides a trainable pipeline for fundamental NLP tasks over 100 languages, and 90 downloadable pretrained pipelines for 56 languages. Trankit can process inputs which are untokenized (raw) or pretokenized strings, at both sentence and document level. Currently, Trankit supports the following tasks:
- Sentence segmentation.
- Tokenization.
- Multi-word token expansion.
- Part-of-speech tagging.
- Morphological feature tagging.
- Dependency parsing.
- Named entity recognition.
Built on the state-of-the-art multilingual pretrained transformer XLM-Roberta, Trankit significantly outperforms prior multilingual NLP pipelines (e.g., UDPipe, Stanza) in many tasks over 90 Universal Dependencies v2.5 treebanks while still being efficient in memory usage and speed, making it usable for general users. Below is the performance comparison between Trankit and other NLP toolkits on Arabic, Chinese, and English.
Treebank | System | Tokens | Sents. | Words | UPOS | XPOS | UFeats | Lemmas | UAS | LAS |
---|---|---|---|---|---|---|---|---|---|---|
Overall (90 treebanks) |
Trankit | 99.23 | 91.82 | 99.02 | 95.65 | 94.05 | 93.21 | 94.27 | 87.06 | 83.69 |
Stanza | 99.26 | 88.58 | 98.90 | 94.21 | 92.50 | 91.75 | 94.15 | 83.06 | 78.68 | |
Arabic-PADT |
Trankit | 99.93 | 96.59 | 99.22 | 96.31 | 94.08 | 94.28 | 94.65 | 88.39 | 84.68 |
Stanza | 99.98 | 80.43 | 97.88 | 94.89 | 91.75 | 91.86 | 93.27 | 83.27 | 79.33 | |
UDPipe | 99.98 | 82.09 | 94.58 | 90.36 | 84.00 | 84.16 | 88.46 | 72.67 | 68.14 | |
Chinese-GSD |
Trankit | 97.01 | 99.70 | 97.01 | 94.21 | 94.02 | 96.59 | 97.01 | 85.19 | 82.54 |
Stanza | 92.83 | 98.80 | 92.83 | 89.12 | 88.93 | 92.11 | 92.83 | 72.88 | 69.82 | |
UDPipe | 90.27 | 99.10 | 90.27 | 84.13 | 84.04 | 89.05 | 90.26 | 61.60 | 57.81 | |
English-EWT |
Trankit | 98.48 | 88.35 | 98.48 | 95.95 | 95.71 | 96.26 | 96.84 | 90.14 | 87.96 |
Stanza | 99.01 | 81.13 | 99.01 | 95.40 | 95.12 | 96.11 | 97.21 | 86.22 | 83.59 | |
UDPipe | 98.90 | 77.40 | 98.90 | 93.26 | 92.75 | 94.23 | 95.45 | 80.22 | 77.03 | |
spaCy | 97.30 | 61.19 | 97.30 | 86.72 | 90.83 | - | 87.05 | - | - |
Performance comparison between Trankit and these toolkits on other languages can be found here on our documentation page.
We also created a Demo Website for Trankit, which is hosted at: http://nlp.uoregon.edu/trankit
Technical details about Trankit are presented in our following paper. Please cite the paper if you use Trankit in your software or research.
@article{unset,
title={Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing},
author={unset},
journal={arXiv preprint arXiv:},
year={2021}
}
Installation
Trankit can be easily installed via one of the following methods:
Using pip
pip install trankit
The command would install Trankit and all dependent packages automatically.
From source
git clone https://github.com/nlp-uoregon/trankit.git
cd trankit
pip install -e .
This would first clone our github repo and automatically install Trankit.
Quick Examples
Initialize a pretrained pipeline
The following code shows how to initialize a pretrained pipeline for English; it is instructed to run on GPU, automatically downloaded pretrained models and and store them to the specified cache directory. Trankit will not download pretrained models if they already exist.
from trankit import Pipeline
# initialize a multilingual pipeline
p = Pipeline(lang='english', gpu=True, cache_dir='./cache')
Basic functions
Trankit can process inputs which are untokenized (raw) or pretokenized strings, at both sentence and document level. A pretokenized input can be a list of strings (i.e., a tokenized sentence) or a list of lists of strings (i.e., a tokenized document with multiple tokenized sentences) are automatically recognized by Trankit. If the input is a sentence, the tag is_sent
must be set to True.
from trankit import Pipeline
p = Pipeline(lang='english', gpu=True, cache_dir='./cache')
######## document-level processing ########
untokenized_doc = '''Hello! This is Trankit.'''
pretokenized_doc = [['Hello', '!'], ['This', 'is', 'Trankit', '.']]
# perform all tasks on the input
processed_doc1 = p(untokenized_doc)
processed_doc2 = p(pretokenized_doc)
######## sentence-level processing #######
untokenized_sent = '''This is Trankit.'''
pretokenized_sent = ['This', 'is', 'Trankit', '.']
# perform all tasks on the input
processed_sent1 = p(untokenized_sent, is_sent=True)
processed_sent2 = p(pretokenized_sent, is_sent=True)
# perform separate tasks on the input
sents = p.ssplit(untokenized_doc) # sentence segmentation
tokenized_doc = p.tokenize(untokenized_doc) # sentence segmentation and tokenization
tokenized_sent = p.tokenize(untokenized_sent, is_sent=True) # tokenization only
posdeps = p.posdep(untokenized_doc) # upos, xpos, ufeats, dep parsing
ners = p.ner(untokenized_doc) # ner tagging
lemmas = p.lemmatize(untokenized_doc) # lemmatization
Note that, although pretokenized inputs can always be processed, using pretokenized inputs for languages that require multi-word token expansion such as Arabic or French might not be the correct way. Please check out the column Requires MWT expansion
of this table to see if a particular language requires multi-word token expansion or not.
For more detailed examples, please checkout our documentation page.
Multilingual usage
In case we want to process inputs of different languages, we need to initialize a multilingual pipeline. The following code shows an example for initializing a multilingual pipeline for Arabic, Chinese, Dutch, and English.
from trankit import Pipeline
# initialize a multilingual pipeline
p = Pipeline(lang='english', gpu=True, cache_dir='./cache')
langs = ['arabic', 'chinese', 'dutch']
for lang in langs:
p.add(lang)
# tokenize English input
p.set_active('english')
en = p.tokenize('Rich was here before the scheduled time.')
# get ner tags for Arabic input
p.set_active('arabic')
ar = p.ner('وكان كنعان قبل ذلك رئيس جهاز الامن والاستطلاع للقوات السورية العاملة في لبنان.')
In this example, .set_active()
is used to switch between languages.
Training your own pipelines
Training customized pipelines is easy with Trankit via the class TPipeline
. Below is a code for training a token and sentence splitter with Trankit.
from trankit import TPipeline
tp = TPipeline(training_config={
'task': 'tokenize',
'save_dir': './saved_model',
'train_txt_fpath': './train.txt',
'train_conllu_fpath': './train.conllu',
'dev_txt_fpath': './dev.txt',
'dev_conllu_fpath': './dev.conllu'
}
)
trainer.train()
Detailed guidelines for training customized pipelines can be found here
Acknowledgements
We use the AdapterHub to implement our plug-and-play mechanism with Adapters. To speed up the development process, the implementations for the MWT expander and the lemmatizer are adapted from Stanza.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.