c++ mosestokenizer
Project description
fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.
The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.
The C++ script was adapted from the mosesdecoder repository
contrib/c++tokenizer
.
Benchmark
fast-mosestokenizer is also fast.
On english, it is about 6x faster than tokenizer.perl
and 15x faster than
sacremoses
.
see ./bench/README.md for more information.
Installation
Python users using linux
and osx>=10.15
can install directly from PyPI.
pip install fast-mosestokenizer
See ./INSTALL.md for more information.
Usage (Command-line tool)
# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile
# For a full list of options, refer to the help message.
mosestokenizer -h
Usage (Python)
# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer
>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
'"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for fast_mosestokenizer-0.0.7-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 92d76e3c3d2689e164537b363cf4920cf41973731eded9409bf16717674a8f13 |
|
MD5 | b7cf9233ef2250678f1030910ae85c0d |
|
BLAKE2b-256 | 6c20520b3c6fb67f4166465c67bcf9817a552fec4bcbfcf958a663b180d06b8a |
Hashes for fast_mosestokenizer-0.0.7-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ea1960b6dccd158562a064865a980c93a40c1268e51260af37b6598ddb8c47b6 |
|
MD5 | 54493d1341741628ca179ba90aec5990 |
|
BLAKE2b-256 | 8e9d734daea9abd147fe8aed827e4af44102806c8d19a09704ba7e59b51e96fc |
Hashes for fast_mosestokenizer-0.0.7-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3d3d39f82db9de3ff7b1073872352b412d6cdaffd342a3be94c72d9ab9b557c1 |
|
MD5 | b672ac038dbbd1db9d8aa90660c466f4 |
|
BLAKE2b-256 | 8fbab48cf4048a3c17acca81c165c6e43303d68a765ce7b2a85e910c1208fb4b |
Hashes for fast_mosestokenizer-0.0.7-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 89641b22bf1fcd65cdcee1ebcb4dd5c3f7fa92303cae2178c9ecc624fbc429e1 |
|
MD5 | 886057533c335da43129dd402e86d271 |
|
BLAKE2b-256 | 067d18ec64eb70c4a04ea5dd7327c2ac679f3887d4e78638faee4664fa06c607 |
Hashes for fast_mosestokenizer-0.0.7-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2c77e16dabd99b598c4e5fce2196d2db79b6e1f3e2e4ce34d4c01f1f985d3d4d |
|
MD5 | ed7f812ca4b78be847df152ac7aaf12a |
|
BLAKE2b-256 | 6aef75475b25c97b475943fc4fd5a4d1f6cd7e221057d9ad6aebf49858e6718f |
Hashes for fast_mosestokenizer-0.0.7-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | affe755e69ee5580259bf4a0b00fb3b7d06b03ee78e601c9fe0905e05b4785f2 |
|
MD5 | 4e7b8be0770fc25f63e1f9717cec41d0 |
|
BLAKE2b-256 | 430e4decb596c5fd70b3b1c7ce36e95837cb9be96fbaa6786228515325a737fb |