c++ mosestokenizer
Project description
fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.
The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.
The C++ script was adapted from the mosesdecoder repository
contrib/c++tokenizer
.
Benchmark
fast-mosestokenizer is also fast.
On english, it is about 6x faster than tokenizer.perl
and 15x faster than
sacremoses
.
see ./bench/README.md for more information.
Installation
Python users using linux
and osx>=10.15
can install directly from PyPI.
pip install fast-mosestokenizer
See ./INSTALL.md for more information.
Usage (Command-line tool)
# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile
# For a full list of options, refer to the help message.
mosestokenizer -h
Usage (Python)
# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer
>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
'"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for fast_mosestokenizer-0.0.7.1-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2231912b048e05f8378ea2ffaa2c5a038c8d7c362cf91cd366b370d939b04f7f |
|
MD5 | 032f0c5fc0262f29375b46f4b7f5fb02 |
|
BLAKE2b-256 | 7b91c22ff352602a4e80c1944d93ce55e2b0e7001dd60630b654b50af4aac609 |
Hashes for fast_mosestokenizer-0.0.7.1-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 31e118c8d22e9f26696606ce6347e0ae0b9daf06bc28a45ed6ec26218b94781b |
|
MD5 | 4b80859675f4ad0e0c674a403b9a06bf |
|
BLAKE2b-256 | 9065d678282204a89328a3cd9713ddb014950ac5b3545b0dca75852c4486dfc5 |
Hashes for fast_mosestokenizer-0.0.7.1-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 927260e73d7a94a9f7892332854b884606c07af49908cae8a1557f732f264804 |
|
MD5 | a56f8cfe02c66ccfd109e997333d5675 |
|
BLAKE2b-256 | cab08a51d79b6ce0445b890fde95d642bcd58810e144ad59e8657f2ccaa185f7 |
Hashes for fast_mosestokenizer-0.0.7.1-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2a194230fee1ab03c3beed68e215826097fb3a97c7c7e7df9688a19de894c51a |
|
MD5 | de76f7b938295928b39265d55a55e3a9 |
|
BLAKE2b-256 | 5b2e6a23de913638883bbe3ca8e83667dd6b7b1ce71bcf053048c3420e092cdd |
Hashes for fast_mosestokenizer-0.0.7.1-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 536b615f28ed1f1629e02f6a32629dc8c6b857d9c87be9b10b5b3999a7367513 |
|
MD5 | ecaf5e63fceadd9e30d9d74e103a8fab |
|
BLAKE2b-256 | 021b5bcb76603d9241e2d5f44413b099af28c3046c3d8fb453e55f7fb571e468 |
Hashes for fast_mosestokenizer-0.0.7.1-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e6c16036eb695a457bba60f2a8c3a8061734dde23a7bdbed6183cbeb8e12ce6e |
|
MD5 | 7f2b89660b84bf06f6f0f9dbcdecf09a |
|
BLAKE2b-256 | 03e7d236d793bd17c6159fa064fb2c0026e9e9152558fac3b53d358cf8c7ed4c |