c++ mosestokenizer
Project description
fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.
The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.
The C++ script was adapted from the mosesdecoder repository
contrib/c++tokenizer
.
Benchmark
fast-mosestokenizer is also fast.
On english, it is about 6x faster than tokenizer.perl
and 15x faster than
sacremoses
.
see ./bench/README.md for more information.
Installation
Python users using linux
and osx>=10.15
can install directly from PyPI.
pip install fast-mosestokenizer
See ./INSTALL.md for more information.
Usage (Command-line tool)
# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile
# For a full list of options, refer to the help message.
mosestokenizer -h
Usage (Python)
# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer
>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
'"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for fast_mosestokenizer-0.0.8-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 645a58992c6a71f7d87022a13a1278f8bff0f5cf7b027e1ee46eead60f925f1d |
|
MD5 | 4fa6efbad62a870371ccb56a332ddd45 |
|
BLAKE2b-256 | e7e56b371ce085b44db627b889e6ff8134b4a8db26776f535fac56bf6d96490e |
Hashes for fast_mosestokenizer-0.0.8-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f3d454f8ee50651cda6db1bc115191c2bd3555503307d89a3aa581d504dd5fdd |
|
MD5 | f7b99d4bf3f141890f077bb3478ecbf8 |
|
BLAKE2b-256 | a5a1614e3acaf6730c4957a118527838b8d02506fc30f34aababd9c0391bc5ed |
Hashes for fast_mosestokenizer-0.0.8-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1b3502a12bb0ce87b1fbc73c392ed1c714c48f1dd7e80e6fe58863f1b5e1259c |
|
MD5 | 68fc925def6639772085674bc77ae187 |
|
BLAKE2b-256 | 058e5d907fad82c07d48f5507aa52a447b329748bb05047eb53ba35c98b54ce1 |
Hashes for fast_mosestokenizer-0.0.8-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ee5e2987eda0337e80109f1343b3ad1fa33004dbfdae2483a77b19523a64823a |
|
MD5 | e460f5546c1a4caa07abcff2bb6ffb75 |
|
BLAKE2b-256 | 1b60dee4e0020651ca763d4a08366354674dd94e40939eaa0e5b3e81887c4ace |
Hashes for fast_mosestokenizer-0.0.8-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 211edd0137d0401d8e801d905310e86f4c95cb3968181dfa9132caa16825434d |
|
MD5 | b345fd80a32ba74da212d0f06bd68cc0 |
|
BLAKE2b-256 | 6646a20db5f8685f7f3a9ca01a521ba13fb9de876ac5193ee8f6079f412608a9 |
Hashes for fast_mosestokenizer-0.0.8-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 802ede8afc53adbad1aa137e1fa91b1d16561365682aeb0aac0cb3498c4b54b9 |
|
MD5 | cd521862b212f31d8277c763954e46fa |
|
BLAKE2b-256 | 56c8f902bbc8cadf675c87484999f50334628b7696b8199981307ffeadb28021 |