c++ mosestokenizer
Project description
fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.
The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.
The C++ script was adapted from the mosesdecoder repository
contrib/c++tokenizer
.
Benchmark
fast-mosestokenizer is also fast.
On english, it is about 6x faster than tokenizer.perl
and 15x faster than
sacremoses
.
see ./bench/README.md for more information.
Installation
Python users using linux
and osx>=10.15
can install directly from PyPI.
pip install fast-mosestokenizer
See ./INSTALL.md for more information.
Usage (Command-line tool)
# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile
# For a full list of options, refer to the help message.
mosestokenizer -h
Usage (Python)
# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer
>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
'"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for fast_mosestokenizer-0.0.3-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3524b572acf61d729d450debe8e993819fd8248b9d9a5b677f954b7c4a9d9467 |
|
MD5 | de884b3a8d43e14b6f311ce492972a56 |
|
BLAKE2b-256 | 62da549a2fbe193d6a5a3c407d0dd76c75e43a149cdc0a459adb19a5228209c9 |
Hashes for fast_mosestokenizer-0.0.3-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 93994786f82908b0664ea9020b327a73d59d5de538d2a596a5ea9bee224d7328 |
|
MD5 | 616cb92d5f2580178767e577d6f4fb9f |
|
BLAKE2b-256 | 22a9e61dbded60bac34bc4cf8ad77011030af087fac3608740f56c865556a837 |
Hashes for fast_mosestokenizer-0.0.3-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 368b7c16020f17768e96fc2840ad09a0fdecb674351148340f0e7b6f28b457a8 |
|
MD5 | 686355107ce8d7c23f7e557f4d717890 |
|
BLAKE2b-256 | f61bbf9c7f120b97a385c83bd97fd78f738ce30a576d26e578e97b45efbe0f56 |
Hashes for fast_mosestokenizer-0.0.3-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 64c877000ddab095a00a6db85f67f90aad02e6e82cf6a9c96c5b4a8890b95130 |
|
MD5 | d11ae524915c3f9e47711e683d7da93b |
|
BLAKE2b-256 | 1a0f7295766a67bfba9a1e4e5eb7fcdbf74f2d31d2315738446edf79c70c4a2a |
Hashes for fast_mosestokenizer-0.0.3-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9a6113805aabbb7f93040cbbc372c4475d4535da9747c8a0ebccc3db2c4ee97a |
|
MD5 | 1f3e971702d683c8e2e26b5a49bebb19 |
|
BLAKE2b-256 | 8a2e81251ff089bd0a57cc37fe7c5ac5460ad9b0cd800741d389943f9d315a7b |
Hashes for fast_mosestokenizer-0.0.3-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c4e133fe1075183bf0ba7292106b9fb8ba4f6a37e8385c0f4e9661721c3bb451 |
|
MD5 | bfe1fe257e2f8f5cdf47c4413ff8a22c |
|
BLAKE2b-256 | 53c960f3b32b40cb1b7e1b4057bd392c66fa762c9d0cc084d8f4c0bc95124c27 |