c++ mosestokenizer
Project description
opus-fast-mosestokenizer is a fork of fast-mosestokenizer created to ensure compability of the package with current Python environments.
fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.
The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.
The C++ script was adapted from the mosesdecoder repository
contrib/c++tokenizer
.
Benchmark
fast-mosestokenizer is also fast.
On english, it is about 6x faster than tokenizer.perl
and 15x faster than
sacremoses
.
see ./bench/README.md for more information.
Installation
Python users using linux
and osx>=10.15
can install directly from PyPI.
pip install fast-mosestokenizer
See ./INSTALL.md for more information.
Usage (Command-line tool)
# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile
# For a full list of options, refer to the help message.
mosestokenizer -h
Usage (Python)
# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer
>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
'"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for opus-fast-mosestokenizer-0.0.8.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5c63ff5e83c126f881c746058506929fe618ba5546d1f50f690c233d7c5bd4ab |
|
MD5 | 548903f186e4bdd2ec5695f1d3a37ea9 |
|
BLAKE2b-256 | a75293ae6b3fa18f8d0e1b36bffaef0bee20b8ff3dc6042e00805de8fd1c49d5 |
Hashes for opus_fast_mosestokenizer-0.0.8.2-cp310-cp310-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6805309d461a2da8bdae870ea4b065589426823c3d3cf8c5e6d9a17f742cee35 |
|
MD5 | 11a9f791a4bf824de9ced5ec1f54f442 |
|
BLAKE2b-256 | dd568326a08c9186ba51a5d469f7fdde9d31c39cfacd256b3e4b079b15a5a6ec |
Hashes for opus_fast_mosestokenizer-0.0.8.2-cp310-cp310-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 49741eee7929c5021deebedd44e86ef1a598762bec9527e6b562f7f5e372c556 |
|
MD5 | 4b82809ec8516c3d4e8fd04fb9b43d3c |
|
BLAKE2b-256 | 3265ab28f7b8bfad572cea26933829e75ed5c4c32c1f5e1669ed414a63defd8f |
Hashes for opus_fast_mosestokenizer-0.0.8.2-cp39-cp39-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 27b4cd24127e6b96782f8818ce8df39255473e69a65916ee1075c0b2dcfacf17 |
|
MD5 | 202735927326db6edae1a16a6bdcd53e |
|
BLAKE2b-256 | f877caca2118d277ecdce334046c2a436c590714e9cf6c8993504b4fd48489a3 |
Hashes for opus_fast_mosestokenizer-0.0.8.2-cp39-cp39-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8c4ed9bfdb7e27e55acd553044d2d78d0163141fa965e6b42d84025692cd7da6 |
|
MD5 | c475cb944c946883cd0ab984a1f0cb74 |
|
BLAKE2b-256 | a339898c559cab9b198ea405f777e26f68db0838c616a85437862b41683d1b18 |
Hashes for opus_fast_mosestokenizer-0.0.8.2-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4d754c8b88a82eba4099fa84d37835ca2165f5d7d4db02739c8e55df6d80f602 |
|
MD5 | 1b0b783d2db650053be32b115e4c3f63 |
|
BLAKE2b-256 | f8f1d30a324fe1d02c054afd7647331d0c97479858328cdfa727cb6bd8ee3167 |
Hashes for opus_fast_mosestokenizer-0.0.8.2-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1d3c73063741f25c4063b78a76171537d2f9f5f4f4bf6a79d91cb4be165f37f0 |
|
MD5 | 247a43d226d2c7cd3f2b8cb9b6a497b3 |
|
BLAKE2b-256 | 3c17ed950b92a0fe607f82e640aeb79651072a619a0aa708b38aa8f0c171fce9 |
Hashes for opus_fast_mosestokenizer-0.0.8.2-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 924485c0014849c8841b650eb96841466ce5081d475c75b65926e320c3474273 |
|
MD5 | 337bd0c37b8e21e2c9ca85b8f5945663 |
|
BLAKE2b-256 | 564ceec3b7088c0b0a1b9759fa3ffb464ea728aa096b9421016a43e17704e2a0 |
Hashes for opus_fast_mosestokenizer-0.0.8.2-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c309c1a508ad3fb00e23152c1b2b7cfab4c5ae8df4c61a74ef26ad40f5441c7a |
|
MD5 | 17523b82b1fa4cf3b5f0600a735b03e9 |
|
BLAKE2b-256 | d30848c10bfcc20c6b9f2bd3d972fd00ea575c40f3f5e258d5b0b24db18b3425 |
Hashes for opus_fast_mosestokenizer-0.0.8.2-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7eb78db84a406b8d9070ed7ccb8abc2ee260290378945c1d2c8da0ecf9862a57 |
|
MD5 | b8ff380e2b60eb1cce5f0d3a397fcd60 |
|
BLAKE2b-256 | 5be463dd302fa8c12d94ca4cce063049ff1c5e4c28a4f2489ca5ea881ce105c3 |
Hashes for opus_fast_mosestokenizer-0.0.8.2-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 19fca77e9457c34d5410ee805eb94ee9b05c2f3cc661ec0af50d7a004380492e |
|
MD5 | 6cdb38fc988d6b014412158f66891117 |
|
BLAKE2b-256 | 0db08b6a01da7e0210dec417a12b83c5f43a48c5027546decfb3ecd809232898 |