c++ mosestokenizer
Project description
opus-fast-mosestokenizer is a fork of fast-mosestokenizer created to ensure compability of the package with current Python environments.
fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.
The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.
The C++ script was adapted from the mosesdecoder repository
contrib/c++tokenizer
.
Benchmark
fast-mosestokenizer is also fast.
On english, it is about 6x faster than tokenizer.perl
and 15x faster than
sacremoses
.
see ./bench/README.md for more information.
Installation
Python users using linux
and osx>=10.15
can install directly from PyPI.
pip install fast-mosestokenizer
See ./INSTALL.md for more information.
Usage (Command-line tool)
# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile
# For a full list of options, refer to the help message.
mosestokenizer -h
Usage (Python)
# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer
>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
'"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for opus-fast-mosestokenizer-0.0.8.4.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1234a2392e634b7f20fc143c98c85a6912cc2fd0013bb2395d50fd85ec084dfd |
|
MD5 | be2115efb0fafffd06db5b05fcc3343e |
|
BLAKE2b-256 | 5145bc1fbb0a65350526b601822bcd52b5a81f9ab0d806c59f737b9665789f01 |
Hashes for opus_fast_mosestokenizer-0.0.8.4-cp310-cp310-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | eb5e384a14169a38f1bec43050870477a06f888e4facefe6f0d1d30f541d8716 |
|
MD5 | 41d615da7b3c16a791a975de5afacb8c |
|
BLAKE2b-256 | e2d8a1b50854298fbc93d93900d77e4a9864585d66e10c8210175691d8fb009b |
Hashes for opus_fast_mosestokenizer-0.0.8.4-cp310-cp310-macosx_12_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 434abc8a393f1ba933173d9b975586f77839466cb2d7fe83cc4f0d76068c53e1 |
|
MD5 | acd3e25e2665031af74e1bd7a2aa5da5 |
|
BLAKE2b-256 | 773a9e91dbb4190b4f3a69b5b7de8b9bac4f2539cdd90da0ff784871e4faa005 |
Hashes for opus_fast_mosestokenizer-0.0.8.4-cp39-cp39-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 59a31a99bc8bd0956cb49835ee6a652cf74b2696e275039b28c68289ae7f9134 |
|
MD5 | 86eea7383a5eac6e73c4634f284be8d9 |
|
BLAKE2b-256 | d3799de0916afb3fc60c091201f71af45c20056dbd620d9612131fb3a6542bd5 |
Hashes for opus_fast_mosestokenizer-0.0.8.4-cp39-cp39-macosx_12_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ab8aa62eda05a1ddd6abde21956832641f10c53bf5af74b94ce55898b8976a7e |
|
MD5 | 5388551633fe83c0989dd4f67774c35a |
|
BLAKE2b-256 | 5d4cd07917cffa93bad4d9edcbcf41c8a8e0a6e93fb0005548df3ca758010506 |
Hashes for opus_fast_mosestokenizer-0.0.8.4-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a3f38f7e0ea12ef9f6e47f4a749c8935a8cb6f421f09f18a9a6a11b1ad10dd6b |
|
MD5 | 03c3a89f6d3622779e643214f2b615fc |
|
BLAKE2b-256 | 2914a0bf74b674d9714aeb0b1648a379dd702c898234f2295f5c4bbb7045f292 |
Hashes for opus_fast_mosestokenizer-0.0.8.4-cp38-cp38-macosx_12_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cbe2aa5cecb15e7cfe21d0fd08273a797a16908214fab17f2f247b4cf29c94bb |
|
MD5 | eb1512c34eeb2353d4c73d5b7c97f495 |
|
BLAKE2b-256 | a73f0b91a3696d1082657f4584f3aba7eff68e8306a6e3b0091f4d476ab6b7bd |
Hashes for opus_fast_mosestokenizer-0.0.8.4-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bb2a8b93577a9a6dd4f9a10694ba84194fca3882f549a28fc9df59b4c0e0c8a1 |
|
MD5 | 6588e43099f0c5ac2000c74668353ec0 |
|
BLAKE2b-256 | 1b91421de8bb3745c15f1021e7df1c909976047727351954c275c4eb4a6d1ae2 |
Hashes for opus_fast_mosestokenizer-0.0.8.4-cp37-cp37m-macosx_12_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3d7aff3e1cf0d1b8f8bbebd841d93be2eccf9ae52361531ac351095be2c10051 |
|
MD5 | 60d7319ee9edf4cbdf8d916d9d669b30 |
|
BLAKE2b-256 | 09de915b83c9560de472ccf163d941c6727ee0332c21895bbe2f268113b47c9c |
Hashes for opus_fast_mosestokenizer-0.0.8.4-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | af32f6a888130ab661a6a83d861ac2652237964954b1d6827136301f2fe84eee |
|
MD5 | 26eacb6cc06a91196880889e58a47cd7 |
|
BLAKE2b-256 | c5b6fe89065b3dbdc92a546700793927ce7081f76a29ed592021e59a7dcdb714 |
Hashes for opus_fast_mosestokenizer-0.0.8.4-cp36-cp36m-macosx_12_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e9e62ab6698971a16c87a563c9f8b7e70240f2a5f00a7293eab909501f169403 |
|
MD5 | e8f517ec958684e92b2deee89f0d833a |
|
BLAKE2b-256 | 5c9c0a36dec6a1e3d4aa615b6774bb42dcb69df9f433f21b0e37404e2fa1e5f0 |