c++ mosestokenizer
Project description
fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.
The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.
The C++ script was adapted from the mosesdecoder repository
contrib/c++tokenizer
.
Benchmark
fast-mosestokenizer is also fast.
On english, it is about 6x faster than tokenizer.perl
and 15x faster than
sacremoses
.
see ./bench/README.md for more information.
Installation
Python users using linux
and osx>=10.15
can install directly from PyPI.
pip install fast-mosestokenizer
See ./INSTALL.md for more information.
Usage (Command-line tool)
# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile
# For a full list of options, refer to the help message.
mosestokenizer -h
Usage (Python)
# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer
>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
'"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for fast_mosestokenizer-0.0.6-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6d5d9e13f222969a58aa5455ddc6c165be2c780c0141af45da0356e0dacb0ebe |
|
MD5 | 04cf2266fd902437d8a397aef1d491c3 |
|
BLAKE2b-256 | 253628dce16cfc0defc7c48a44e67e9bfd45ce850a2cd2d8bbf7acf4660e3f3b |
Hashes for fast_mosestokenizer-0.0.6-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7d9246b468e7ee22b58091e83997cb795e2bc9c518237874e2d5691025988e59 |
|
MD5 | 6f899fa24b2b4771ba77eda10f5be2c8 |
|
BLAKE2b-256 | 5504385840a5f636fbea8f58a5be089eaa57e371a71b8f579626d5c38024256f |
Hashes for fast_mosestokenizer-0.0.6-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ac55cd0b2a1b98f394e2256e06467d07739a437822a4fad74ed181bfc19c4adb |
|
MD5 | 6c2c0dea784bb1b6e82b991ad53bc0f0 |
|
BLAKE2b-256 | 23cfebfedcb1b84477ea978d040d449588cc7b6ae79120ab54f462f6da5e24cb |
Hashes for fast_mosestokenizer-0.0.6-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 739e86b39f3efebdf79aba1b238be3f7c70231c0ff2ced46abf13d4cec88c6ed |
|
MD5 | 94c1173cb02c99304935df7a8501575f |
|
BLAKE2b-256 | 1c337048a73b9fd47c2f031915d5462ad85ae752ae5a6f08bfd91c4b5ec03a1d |
Hashes for fast_mosestokenizer-0.0.6-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0f98903ac97e82655e2040c1c1bcb9b3bbe78a9b57954c909952d9f874aebf47 |
|
MD5 | a7804e630f4c4543a5ff01ab8bb6b2ed |
|
BLAKE2b-256 | 5d505505e013fb9d5182d26c9d063b463f1a85daffe2106727e640238d249f3b |
Hashes for fast_mosestokenizer-0.0.6-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 292bf6dd524e4f1041ff60010748f038db204f45eb33d4018180b38ad65e682e |
|
MD5 | c07b4732bf985cdbbdfa373c1bd4ebed |
|
BLAKE2b-256 | 385583c1597e9666b2eae88a6542c28bc97a71a2952d8f732b80bb190aefa15a |