trankit

Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing

Project description

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit is a light-weight Transformer-based Python Toolkit for multilingual Natural Language Processing (NLP). It provides a trainable pipeline for fundamental NLP tasks over 100 languages, and 90 downloadable pretrained pipelines for 56 languages. Trankit can process inputs which are untokenized (raw) or pretokenized strings, at both sentence and document level. Currently, Trankit supports the following tasks:

Sentence segmentation.
Tokenization.
Multi-word token expansion.
Part-of-speech tagging.
Morphological feature tagging.
Dependency parsing.
Named entity recognition.

Built on the state-of-the-art multilingual pretrained transformer XLM-Roberta, Trankit significantly outperforms prior multilingual NLP pipelines (e.g., UDPipe, Stanza) in many tasks over 90 Universal Dependencies v2.5 treebanks while still being efficient in memory usage and speed, making it usable for general users. Below is the performance comparison between Trankit and other NLP toolkits on Arabic, Chinese, and English.

Treebank	System	Tokens	Sents.	Words	UPOS	XPOS	UFeats	Lemmas	UAS	LAS
Overall (90 treebanks)	Trankit	99.23	91.82	99.02	95.65	94.05	93.21	94.27	87.06	83.69
Overall (90 treebanks)	Stanza	99.26	88.58	98.90	94.21	92.50	91.75	94.15	83.06	78.68
Arabic-PADT	Trankit	99.93	96.59	99.22	96.31	94.08	94.28	94.65	88.39	84.68
	Stanza	99.98	80.43	97.88	94.89	91.75	91.86	93.27	83.27	79.33
	UDPipe	99.98	82.09	94.58	90.36	84.00	84.16	88.46	72.67	68.14
Chinese-GSD	Trankit	97.01	99.70	97.01	94.21	94.02	96.59	97.01	85.19	82.54
	Stanza	92.83	98.80	92.83	89.12	88.93	92.11	92.83	72.88	69.82
	UDPipe	90.27	99.10	90.27	84.13	84.04	89.05	90.26	61.60	57.81
English-EWT	Trankit	98.48	88.35	98.48	95.95	95.71	96.26	96.84	90.14	87.96
	Stanza	99.01	81.13	99.01	95.40	95.12	96.11	97.21	86.22	83.59
	UDPipe	98.90	77.40	98.90	93.26	92.75	94.23	95.45	80.22	77.03
	spaCy	97.30	61.19	97.30	86.72	90.83	-	87.05	-	-

Performance comparison between Trankit and these toolkits on other languages can be found here on our documentation page.

We also created a Demo Website for Trankit, which is hosted at: http://nlp.uoregon.edu/trankit

Technical details about Trankit are presented in our following paper. Please cite the paper if you use Trankit in your software or research.

@article{unset,
  title={Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing},
  author={unset},
  journal={arXiv preprint arXiv:},
  year={2021}
}

Installation

Trankit can be easily installed via one of the following methods:

Using pip

pip install trankit

The command would install Trankit and all dependent packages automatically.

From source

git clone https://github.com/nlp-uoregon/trankit.git
cd trankit
pip install -e .

This would first clone our github repo and automatically install Trankit.

Quick Examples

Initialize a pretrained pipeline

The following code shows how to initialize a pretrained pipeline for English; it is instructed to run on GPU, automatically downloaded pretrained models and and store them to the specified cache directory. Trankit will not download pretrained models if they already exist.

from trankit import Pipeline

# initialize a multilingual pipeline
p = Pipeline(lang='english', gpu=True, cache_dir='./cache')

Basic functions

Trankit can process inputs which are untokenized (raw) or pretokenized strings, at both sentence and document level. A pretokenized input can be a list of strings (i.e., a tokenized sentence) or a list of lists of strings (i.e., a tokenized document with multiple tokenized sentences) are automatically recognized by Trankit. If the input is a sentence, the tag is_sent must be set to True.

from trankit import Pipeline

p = Pipeline(lang='english', gpu=True, cache_dir='./cache')

######## document-level processing ########
untokenized_doc = '''Hello! This is Trankit.'''
pretokenized_doc = [['Hello', '!'], ['This', 'is', 'Trankit', '.']]

# perform all tasks on the input
processed_doc1 = p(untokenized_doc)
processed_doc2 = p(pretokenized_doc)

######## sentence-level processing ####### 
untokenized_sent = '''This is Trankit.'''
pretokenized_sent = ['This', 'is', 'Trankit', '.']

# perform all tasks on the input
processed_sent1 = p(untokenized_sent, is_sent=True)
processed_sent2 = p(pretokenized_sent, is_sent=True)

# perform separate tasks on the input
sents = p.ssplit(untokenized_doc) # sentence segmentation
tokenized_doc = p.tokenize(untokenized_doc) # sentence segmentation and tokenization
tokenized_sent = p.tokenize(untokenized_sent, is_sent=True) # tokenization only
posdeps = p.posdep(untokenized_doc) # upos, xpos, ufeats, dep parsing
ners = p.ner(untokenized_doc) # ner tagging
lemmas = p.lemmatize(untokenized_doc) # lemmatization

Note that, although pretokenized inputs can always be processed, using pretokenized inputs for languages that require multi-word token expansion such as Arabic or French might not be the correct way. Please check out the column Requires MWT expansion of this table to see if a particular language requires multi-word token expansion or not.
For more detailed examples, please checkout our documentation page.

Multilingual usage

In case we want to process inputs of different languages, we need to initialize a multilingual pipeline. The following code shows an example for initializing a multilingual pipeline for Arabic, Chinese, Dutch, and English.

from trankit import Pipeline

# initialize a multilingual pipeline
p = Pipeline(lang='english', gpu=True, cache_dir='./cache')

langs = ['arabic', 'chinese', 'dutch']
for lang in langs:
    p.add(lang)

# tokenize English input
p.set_active('english')
en = p.tokenize('Rich was here before the scheduled time.')

# get ner tags for Arabic input
p.set_active('arabic')
ar = p.ner('وكان كنعان قبل ذلك رئيس جهاز الامن والاستطلاع للقوات السورية العاملة في لبنان.')

In this example, .set_active() is used to switch between languages.

Training your own pipelines

Training customized pipelines is easy with Trankit via the class TPipeline. Below is a code for training a token and sentence splitter with Trankit.

from trankit import TPipeline

tp = TPipeline(training_config={
    'task': 'tokenize',
    'save_dir': './saved_model',
    'train_txt_fpath': './train.txt',
    'train_conllu_fpath': './train.conllu',
    'dev_txt_fpath': './dev.txt',
    'dev_conllu_fpath': './dev.conllu'
    }
)

trainer.train()

Detailed guidelines for training customized pipelines can be found here

Acknowledgements

We use the AdapterHub to implement our plug-and-play mechanism with Adapters. To speed up the development process, the implementations for the MWT expander and the lemmatizer are adapted from Stanza.

Project details

Release history Release notifications | RSS feed

1.1.1

Mar 26, 2022

1.1.0

Jun 19, 2021

1.0.1

Apr 3, 2021

1.0.0

Mar 31, 2021

0.3.7

Mar 24, 2021

0.3.6

Mar 6, 2021

0.3.5

Jan 26, 2021

0.3.4

Jan 16, 2021

0.3.1

Jan 11, 2021

This version

0.3.0

Jan 10, 2021

0.2.7

Jan 8, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trankit-0.3.0.tar.gz (72.0 kB view hashes)

Uploaded Jan 10, 2021 Source

Built Distribution

trankit-0.3.0-py3-none-any.whl (85.8 kB view hashes)

Uploaded Jan 10, 2021 Python 3

Hashes for trankit-0.3.0.tar.gz

Hashes for trankit-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`5987e466617aa4c3d7195d543ff5eeaa4dd3c0ecd295351ae94a2ef825e382e5`
MD5	`8ec8196b8f331dfecae1a1c986602bbc`
BLAKE2b-256	`2ac90484cccad33765250964fbbf3350478f0dffeb518415ecfd105fdc7d9f54`

Hashes for trankit-0.3.0-py3-none-any.whl

Hashes for trankit-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fd8a0b6719e02d36927a03ec7bc2dad4d821805ec4d212ff6ac4af6a50cf8da2`
MD5	`e9498e0fd570fd4ace6a48432dac5fda`
BLAKE2b-256	`ddbd59de1e21845890863ad4f32e06e3314eb390e75e0b28e2fe833044f91972`