c++ mosestokenizer
Project description
fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.
The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.
The C++ script was adapted from the mosesdecoder repository
contrib/c++tokenizer
.
Benchmark
fast-mosestokenizer is also fast.
On english, it is about 6x faster than tokenizer.perl
and 15x faster than
sacremoses
.
see ./bench/README.md for more information.
Installation
Python users using linux
and osx>=10.15
can install directly from PyPI.
pip install fast-mosestokenizer
See ./INSTALL.md for more information.
Usage (Command-line tool)
# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile
# For a full list of options, refer to the help message.
mosestokenizer -h
Usage (Python)
# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer
>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
'"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for fast_mosestokenizer-0.0.8.1-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8e6b670b3a5d1535202854e8d20ef51da4a5d478b04fe602298488110c804c71 |
|
MD5 | 8dc6974cc7748ab0ecaf769198d3320c |
|
BLAKE2b-256 | 1de2d008ed160cc87201a58346b7f6745680a75944f434cbd106b4d8ec1a7977 |
Hashes for fast_mosestokenizer-0.0.8.1-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 952c257cf995c662b5358f75328b60c0fa9efeceb410b503c2b75ee1bd62b619 |
|
MD5 | 3e99309bd83b17d9d49e16b3c84b65ce |
|
BLAKE2b-256 | 7947069b0bc5160aa98b3fd3d20a8c4762aa975765bb55144b19b8784a2d9ebc |
Hashes for fast_mosestokenizer-0.0.8.1-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 387a67ff4cadbf4dc7546d255b9b870a63d1a2fac8bc347be900224223988050 |
|
MD5 | efd5dee207b727077cdf1d6b2c69dd5c |
|
BLAKE2b-256 | 5bcb42939165b0aaf21e97ab6e491f6d9583d07ef8be63c19c7ef84bc72ca362 |
Hashes for fast_mosestokenizer-0.0.8.1-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d930f9e7ab100ff080b5389f3a41fdbcf51865d3135fb470113f5e558fea6ade |
|
MD5 | 8e8883ec0850f9748996a4fe0960fc1b |
|
BLAKE2b-256 | 17176a2325112678739a18e0b528d17e61341ee2091e8072e5b1bbcb5180f9a8 |
Hashes for fast_mosestokenizer-0.0.8.1-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7c244e160d278f38a5361db26813ff3ca6a5f9fb17b422aea2f24d2181689a9c |
|
MD5 | 5694cb1075503eb6c68955c1fe5c11ce |
|
BLAKE2b-256 | 61e62d6fd8e198b06cafb2b3831163fe8d095f130b2c8345793ab657e3a26538 |
Hashes for fast_mosestokenizer-0.0.8.1-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8b85beddfa1460e5b4ee27a2883eb00a581eab7617d35bca25ad79e78346db58 |
|
MD5 | c8e72c9a48205de4cafabd1d2d172b66 |
|
BLAKE2b-256 | ecd0d25ce53c9462da8e73a57a581dbfdd073ec5e4cc069a1c54574787ec45c8 |