c++ mosestokenizer
Project description
fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.
The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.
The C++ script was adapted from the mosesdecoder repository
contrib/c++tokenizer
.
Benchmark
fast-mosestokenizer is also fast.
On english, it is about 6x faster than tokenizer.perl
and 15x faster than
sacremoses
.
see ./bench/README.md for more information.
Installation
Python users using linux
and osx>=10.15
can install directly from PyPI.
pip install fast-mosestokenizer
See ./INSTALL.md for more information.
Usage (Command-line tool)
# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile
# For a full list of options, refer to the help message.
mosestokenizer -h
Usage (Python)
# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer
>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
'"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for fast_mosestokenizer-0.0.2-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6d9b3d814f1aec5e4d4d4a2e3f7ddcebaefd11a3c11456b2abc6c02f5eb21af0 |
|
MD5 | ee6433685f5a9957ca785b24f0e76923 |
|
BLAKE2b-256 | fbb47e0380d050b410c45083602e9394480e581c7ae11a5ccbe24bbba03a396d |
Hashes for fast_mosestokenizer-0.0.2-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 188d7023c3cf89928daa8b25b5ba31bcd42382c8e9b85551332ca1af06956d9d |
|
MD5 | 29d58e4cbb4c783429b0de96d731c229 |
|
BLAKE2b-256 | 96f635f5361d87ed29c71761a5f5bc07bfc9bffa0e442bedcb9fa6335ea229e2 |
Hashes for fast_mosestokenizer-0.0.2-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f22a29f84847dc4a8b343bbc187b7c1a621f9593487501269e84b7dacc0d6c08 |
|
MD5 | 1b1aa834254db9119ffa1853d8fa655c |
|
BLAKE2b-256 | ca43ae53bbaab902ce91b094ab3b6a0ae0b31810cd87551b21d1c0f33ade5b5a |
Hashes for fast_mosestokenizer-0.0.2-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4ff58388c9f62794b3a05faa0e5bd5115978e60d41d5c38ff9f89bd4cd0cd345 |
|
MD5 | 46f262aa3d3e74f0ea8c611e7f891b5e |
|
BLAKE2b-256 | fa1cabe6e492ac9abdaa66b245a01edbbe11d25965495afee3bd7add67d6940c |
Hashes for fast_mosestokenizer-0.0.2-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f962733f76f06ef6199fe852d69d67564fe3d23cf79b718e82899ba5a9cefda9 |
|
MD5 | 32cb9214de65afa84811edeeb69410bf |
|
BLAKE2b-256 | 7a68db7011219aaf99ca10e88e950ef0c95ed5a98ac20b9020a807439b754766 |
Hashes for fast_mosestokenizer-0.0.2-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f2fde82b3cfcd1858cfaa62679a73b620fb4931c4db90103805ca3870be263ed |
|
MD5 | c3e3a4517aa654aac0d563325f2a6659 |
|
BLAKE2b-256 | 49248b0d93d227c6bcdfcf9382d59c35eb14a3ac5435501be14dd57779ca5e1f |