c++ mosestokenizer
Project description
fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.
The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.
The C++ script was adapted from the mosesdecoder repository
contrib/c++tokenizer
.
Benchmark
fast-mosestokenizer is also fast.
On english, it is about 6x faster than tokenizer.perl
and 15x faster than
sacremoses
.
see ./bench/README.md for more information.
Installation
Python users using linux
and osx>=10.15
can install directly from PyPI.
pip install fast-mosestokenizer
See ./INSTALL.md for more information.
Usage (Command-line tool)
# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile
# For a full list of options, refer to the help message.
mosestokenizer -h
Usage (Python)
# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer
>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
'"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for fast_mosestokenizer-0.0.5-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 28423d2d00e233330898bed211663743f4b28b1b9afbe62345329d2917409576 |
|
MD5 | 3d4b5f85d27f9261cd3a133b12292e18 |
|
BLAKE2b-256 | 3ca2d2d1571c0328348030293c590ccf769e67f06a723c2b3a9255db9754b029 |
Hashes for fast_mosestokenizer-0.0.5-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4aa4a3604244e365af90e7f9891f5d98b4823ce18a1f456a9272e932fb826147 |
|
MD5 | b47a9022fec5170434b811aed8e873c6 |
|
BLAKE2b-256 | 6b77a0f31031efb362eb80f3694982b0f94af863cd9c75f456b9e7247d999135 |
Hashes for fast_mosestokenizer-0.0.5-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 266c1cce6e3e932b23c630e019aef7b3e85178593e5f922afd9fc0d21a8d3217 |
|
MD5 | 27ea2ba54c97c3797d9a27aefa68f610 |
|
BLAKE2b-256 | 40c5f8276625bb8b118ee27ce169e061e7d3dc6877d0220bf28accca9af79f96 |
Hashes for fast_mosestokenizer-0.0.5-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cd39fe2d082f23bbfd82cbf80a5aef4efd338c516a31a1f7e14caf565518912d |
|
MD5 | 92846fd36735f33f37b3e755d978c776 |
|
BLAKE2b-256 | 1e3c8329c0cb987a90abf104bdac643d98709d3dd86184ba49b7e9d9f4c98de6 |
Hashes for fast_mosestokenizer-0.0.5-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ecb501b2803184c50b61a82538ad4dd7cc4bb2cff53fa5e3999d1979b3c854a4 |
|
MD5 | b0c7c068efd20f3e8c629b9c2c20f4b0 |
|
BLAKE2b-256 | b998edd07294ea211a4bd996c272e91364c3c8f0e32e3746b39da438870ccae7 |
Hashes for fast_mosestokenizer-0.0.5-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 913b3d7901edfb023d521228ffbae6cdcd180a9741f01d773177fc232b9c75aa |
|
MD5 | d8633104d60171ac29908061eacb5777 |
|
BLAKE2b-256 | a4a861210d3b7c3fe574ace6887cbed1e79fe6df9ac42586c7703200354d7813 |