Skip to main content

Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing

Project description

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit is a light-weight Transformer-based Python Toolkit for multilingual Natural Language Processing (NLP). It provides a trainable pipeline for fundamental NLP tasks over 100 languages, and 90 downloadable pretrained pipelines for 56 languages. Trankit can process inputs which are untokenized (raw) or pretokenized strings, at both sentence and document level. Currently, Trankit supports the following tasks:

  • Sentence segmentation.
  • Tokenization.
  • Multi-word token expansion.
  • Part-of-speech tagging.
  • Morphological feature tagging.
  • Dependency parsing.
  • Named entity recognition.

Built on the state-of-the-art multilingual pretrained transformer XLM-Roberta, Trankit significantly outperforms prior multilingual NLP pipelines (e.g., UDPipe, Stanza) in many tasks over 90 Universal Dependencies v2.5 treebanks while still being efficient in memory usage and speed, making it usable for general users. Below is the performance comparison between Trankit and other NLP toolkits on Arabic, Chinese, and English.

Treebank System Tokens Sents. Words UPOS XPOS UFeats Lemmas UAS LAS

Overall (90 treebanks)
Trankit 99.23 91.82 99.02 95.65 94.05 93.21 94.27 87.06 83.69
Stanza 99.26 88.58 98.90 94.21 92.50 91.75 94.15 83.06 78.68


Arabic-PADT
Trankit 99.93 96.59 99.22 96.31 94.08 94.28 94.65 88.39 84.68
Stanza 99.98 80.43 97.88 94.89 91.75 91.86 93.27 83.27 79.33
UDPipe 99.98 82.09 94.58 90.36 84.00 84.16 88.46 72.67 68.14


Chinese-GSD
Trankit 97.01 99.70 97.01 94.21 94.02 96.59 97.01 85.19 82.54
Stanza 92.83 98.80 92.83 89.12 88.93 92.11 92.83 72.88 69.82
UDPipe 90.27 99.10 90.27 84.13 84.04 89.05 90.26 61.60 57.81



English-EWT
Trankit 98.48 88.35 98.48 95.95 95.71 96.26 96.84 90.14 87.96
Stanza 99.01 81.13 99.01 95.40 95.12 96.11 97.21 86.22 83.59
UDPipe 98.90 77.40 98.90 93.26 92.75 94.23 95.45 80.22 77.03
spaCy 97.30 61.19 97.30 86.72 90.83 - 87.05 - -

Performance comparison between Trankit and these toolkits on other languages can be found here on our documentation page.

We also created a Demo Website for Trankit, which is hosted at: http://nlp.uoregon.edu/trankit

Technical details about Trankit are presented in our following paper. Please cite the paper if you use Trankit in your software or research.

@article{unset,
  title={Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing},
  author={unset},
  journal={arXiv preprint arXiv:},
  year={2021}
}

Installation

Trankit can be easily installed via one of the following methods:

Using pip

pip install trankit

The command would install Trankit and all dependent packages automatically.

From source

git clone https://github.com/nlp-uoregon/trankit.git
cd trankit
pip install -e .

This would first clone our github repo and automatically install Trankit.

Quick Examples

Initialize a pretrained pipeline

The following code shows how to initialize a pretrained pipeline for English; it is instructed to run on GPU, automatically downloaded pretrained models and and store them to the specified cache directory. Trankit will not download pretrained models if they already exist.

from trankit import Pipeline

# initialize a multilingual pipeline
p = Pipeline(lang='english', gpu=True, cache_dir='./cache')

Basic functions

Trankit can process inputs which are untokenized (raw) or pretokenized strings, at both sentence and document level. A pretokenized input can be a list of strings (i.e., a tokenized sentence) or a list of lists of strings (i.e., a tokenized document with multiple tokenized sentences) are automatically recognized by Trankit. If the input is a sentence, the tag is_sent must be set to True.

from trankit import Pipeline

p = Pipeline(lang='english', gpu=True, cache_dir='./cache')

######## document-level processing ########
untokenized_doc = '''Hello! This is Trankit.'''
pretokenized_doc = [['Hello', '!'], ['This', 'is', 'Trankit', '.']]

# perform all tasks on the input
processed_doc1 = p(untokenized_doc)
processed_doc2 = p(pretokenized_doc)

######## sentence-level processing ####### 
untokenized_sent = '''This is Trankit.'''
pretokenized_sent = ['This', 'is', 'Trankit', '.']

# perform all tasks on the input
processed_sent1 = p(untokenized_sent, is_sent=True)
processed_sent2 = p(pretokenized_sent, is_sent=True)

# perform separate tasks on the input
sents = p.ssplit(untokenized_doc) # sentence segmentation
tokenized_doc = p.tokenize(untokenized_doc) # sentence segmentation and tokenization
tokenized_sent = p.tokenize(untokenized_sent, is_sent=True) # tokenization only
posdeps = p.posdep(untokenized_doc) # upos, xpos, ufeats, dep parsing
ners = p.ner(untokenized_doc) # ner tagging
lemmas = p.lemmatize(untokenized_doc) # lemmatization

Note that, although pretokenized inputs can always be processed, using pretokenized inputs for languages that require multi-word token expansion such as Arabic or French might not be the correct way. Please check out the column Requires MWT expansion of this table to see if a particular language requires multi-word token expansion or not.
For more detailed examples, please checkout our documentation page.

Multilingual usage

In case we want to process inputs of different languages, we need to initialize a multilingual pipeline. The following code shows an example for initializing a multilingual pipeline for Arabic, Chinese, Dutch, and English.

from trankit import Pipeline

# initialize a multilingual pipeline
p = Pipeline(lang='english', gpu=True, cache_dir='./cache')

langs = ['arabic', 'chinese', 'dutch']
for lang in langs:
    p.add(lang)

# tokenize English input
p.set_active('english')
en = p.tokenize('Rich was here before the scheduled time.')

# get ner tags for Arabic input
p.set_active('arabic')
ar = p.ner('وكان كنعان قبل ذلك رئيس جهاز الامن والاستطلاع للقوات السورية العاملة في لبنان.')

In this example, .set_active() is used to switch between languages.

Training your own pipelines

Training customized pipelines is easy with Trankit via the class TPipeline. Below is a code for training a token and sentence splitter with Trankit.

from trankit import TPipeline

tp = TPipeline(training_config={
    'task': 'tokenize',
    'save_dir': './saved_model',
    'train_txt_fpath': './train.txt',
    'train_conllu_fpath': './train.conllu',
    'dev_txt_fpath': './dev.txt',
    'dev_conllu_fpath': './dev.conllu'
    }
)

trainer.train()

Detailed guidelines for training customized pipelines can be found here

Acknowledgements

We use the AdapterHub to implement our plug-and-play mechanism with Adapters. To speed up the development process, the implementations for the MWT expander and the lemmatizer are adapted from Stanza.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trankit-0.3.0.tar.gz (72.0 kB view hashes)

Uploaded Source

Built Distribution

trankit-0.3.0-py3-none-any.whl (85.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page