c++ mosestokenizer
Project description
fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.
The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.
The C++ script was adapted from the mosesdecoder repository
contrib/c++tokenizer
.
Benchmark
fast-mosestokenizer is also fast.
On english, it is about 6x faster than tokenizer.perl
and 15x faster than
sacremoses
.
see ./bench/README.md for more information.
Installation
Python users using linux
and osx>=10.15
can install directly from PyPI.
pip install fast-mosestokenizer
See ./INSTALL.md for more information.
Usage (Command-line tool)
# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile
# For a full list of options, refer to the help message.
mosestokenizer -h
Usage (Python)
# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer
>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
'"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for fast_mosestokenizer-0.0.1-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bcdc2977adc23ed7533e96afd295367ec495db30fb2e4f8b0d81494d4d25e779 |
|
MD5 | b5dce405eec86c6bc7787c7f7f7e5ea4 |
|
BLAKE2b-256 | c78917caa73b521ef6c73aab80b0edff20f58a4ba58af48568eecfe77c37000d |
Hashes for fast_mosestokenizer-0.0.1-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 96c956ba719620dde36f3c8b39a6a90e090f4c50d89108c7bad20323523aa27b |
|
MD5 | 6dd625b51505360832c86fa1abe6a330 |
|
BLAKE2b-256 | ca56a942fe6d141ac97437d0c23f5ca7901edaf91096b26266b34296cddf6b43 |
Hashes for fast_mosestokenizer-0.0.1-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 41a2579bb8ce8f3fa344ae74b225441d77e8dbb3267e103596657018c53b297d |
|
MD5 | 1dafcb7826c675666ac0bf9b1513bb01 |
|
BLAKE2b-256 | 0e41484254d7736ccb5e71c022071171dba983df46ab408f809ccd6198d89d4c |
Hashes for fast_mosestokenizer-0.0.1-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2eb3619378673be41b3c37c93b5137addffaf03bf421b12923a80b8fa6173ef7 |
|
MD5 | ba4fcee22039c72f7cb632ac713cb0ab |
|
BLAKE2b-256 | 51beccd7290cd1267cbd0b90bdd7edbe621ef54456e53d84fc84faacd2f8c8d6 |
Hashes for fast_mosestokenizer-0.0.1-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bc45f0f04957cf972457fbc3a1c994dabcfe829ced96751e416b88b4915cb8a1 |
|
MD5 | c5699ed4a5ce901820f1449a333baaaf |
|
BLAKE2b-256 | f6eeda71eee17941944c662344b475b309fcdd5c2722f26e725f4a418247f760 |
Hashes for fast_mosestokenizer-0.0.1-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8bd928fe435ae86fb5e68fe81341ca9f8924764ae212e8f6b88c544ad5bc1fb3 |
|
MD5 | cf014a51b751c0eb3ca9d33ad7e5e688 |
|
BLAKE2b-256 | 3c5fcd98c431d783590a8e94dacaaed5d7c6f24cdafded1fd3c5e311c7645f9a |