c++ mosestokenizer
Project description
fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.
The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.
The C++ script was adapted from the mosesdecoder repository
contrib/c++tokenizer
.
Benchmark
fast-mosestokenizer is also fast.
On english, it is about 6x faster than tokenizer.perl
and 15x faster than
sacremoses
.
see ./bench/README.md for more information.
Installation
Python users using linux
and osx>=10.15
can install directly from PyPI.
pip install fast-mosestokenizer
See ./INSTALL.md for more information.
Usage (Command-line tool)
# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile
# For a full list of options, refer to the help message.
mosestokenizer -h
Usage (Python)
# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer
>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
'"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for fast_mosestokenizer-0.0.7.2-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3c15886aac652d0e915536f0f2a734f8df95853dabb30eae264b24ba6b64ac8b |
|
MD5 | 83c6e0ff774e17c196f1a0ba5d5dc8c3 |
|
BLAKE2b-256 | f7fa6cdc3f784090de3e9b426c87ad9313d7976c6aa3cedd96a93e2102f5f4a6 |
Hashes for fast_mosestokenizer-0.0.7.2-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 05f1bd59bb14b412ee4130efa0140993282133a1b5ae3453a612694809f46c40 |
|
MD5 | 9e35bcf143122858180ca01f6fbcf0bd |
|
BLAKE2b-256 | f543b6040ba2e8cc4a9b07bcc338d7c89c7e1662b3f174f5441d9789b8d6dc75 |
Hashes for fast_mosestokenizer-0.0.7.2-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1d37a51cc7719a06a9ebd36f4a0673c5f6e40fa8a9c53b38ebbeedd76ac33044 |
|
MD5 | 433b18dd592a9f432e9b9341ad62b769 |
|
BLAKE2b-256 | ecfc8362b24cbc765951c324db5e44974717c506d43d372a2cdf85286d4da25c |
Hashes for fast_mosestokenizer-0.0.7.2-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cc4efe49306008c47b79b0ff097e14d8a372607eb3e60829f4b95e697aab1807 |
|
MD5 | 6462ad14dc1857cfdd9045aed21bf78e |
|
BLAKE2b-256 | f8ae679cad9ef39933b2eacebbc1930d5aca57f73ccd52ffcc29727a1fe336ab |
Hashes for fast_mosestokenizer-0.0.7.2-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5a26b444cc69ea35273c11a50803d11ca1547cd62daf8b17bb6a3fa662557412 |
|
MD5 | 41ff5aca5b0c43d012739d040ffdbec7 |
|
BLAKE2b-256 | 901ee5f54d370567e5629873e48606a756eda9ac4fa1373ee5ba6d0b62b3ea1a |
Hashes for fast_mosestokenizer-0.0.7.2-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9feeaaaaf8de7d6fd8237e8d346ae2d2b81de370ed673149dc7d8d050342702c |
|
MD5 | 30d8cc9c95659edcc5188646722b2a51 |
|
BLAKE2b-256 | ef0231f053fa8a5866c046a7102709c45dfb46a08e58b66ef301dc7c38ca5d2c |