c++ mosestokenizer
Project description
fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.
The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.
The C++ script was adapted from the mosesdecoder repository
contrib/c++tokenizer
.
Benchmark
fast-mosestokenizer is also fast.
On english, it is about 6x faster than tokenizer.perl
and 15x faster than
sacremoses
.
see ./bench/README.md for more information.
Installation
Python users using linux
and osx>=10.15
can install directly from PyPI.
pip install fast-mosestokenizer
See ./INSTALL.md for more information.
Usage (Command-line tool)
# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile
# For a full list of options, refer to the help message.
mosestokenizer -h
Usage (Python)
# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer
>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
'"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for fast_mosestokenizer-0.0.3.1-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6750b7592f7ab3fdd46d91f21a5361d23cafcee19b5c6e44e869f677753d6f0c |
|
MD5 | 2da33cd3754aa2f7cf648bd2a7993fe0 |
|
BLAKE2b-256 | 300b7b80a0f5e01016cf921752c97097f7c0e824a8b385be9cd392e6a01bd234 |
Hashes for fast_mosestokenizer-0.0.3.1-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 77f16f0fce3acb507d86fc891edd1b575178ee98b8229be78890725e7db66ae7 |
|
MD5 | d6fb5d1673b76882ba4ecc87eb4e0327 |
|
BLAKE2b-256 | 936ed761ebf0a31bc913e62fb9cda30e3b2e818b3638161dbba57aa68c6b9e69 |
Hashes for fast_mosestokenizer-0.0.3.1-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 213f22b1025e1ecae8f5ff17082fd6bd45f956fa33572eac0efa9d9bb5e1bc54 |
|
MD5 | 91a78aaa8203b07360f10af3ac585a16 |
|
BLAKE2b-256 | 7316ba43e385670c83147404fadc448c11949a74a9c3a6c5db32e6b15f6671d9 |
Hashes for fast_mosestokenizer-0.0.3.1-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a03d592c1dddedbb32c3f48c7e424c072304a5a61007456150bf2f64de65fe7a |
|
MD5 | 1a06bf8fe8a7770939bfbdcafba9175b |
|
BLAKE2b-256 | 9b6cc2b7f6a42df93849f1b63d789216055988fdb58ede9e953c3bae37399115 |
Hashes for fast_mosestokenizer-0.0.3.1-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3734275f10820f11c8c70473a2b8e7afbfa28000c0db9dd32e74038e79ca1e45 |
|
MD5 | 5e01b37859ea3b32b6c9e4876206656d |
|
BLAKE2b-256 | bc0abc892f0d29ad10339f37cbe4d3c088096cea977a72ff8a44306b59a8c809 |
Hashes for fast_mosestokenizer-0.0.3.1-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ebbf3b56701d179c8a3fe08d40a8f65a33500226600f02723003c0730a19a7fc |
|
MD5 | 1c707ee9230a7731eabb9cd4c74faaf8 |
|
BLAKE2b-256 | 98569a49bfe64956a4bdfc2bdca88b4156bb28c0ec2a55cf3da4e252ad13afcf |