c++ mosestokenizer
Project description
opus-fast-mosestokenizer is a fork of fast-mosestokenizer created to ensure compability of the package with current Python environments.
fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.
The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.
The C++ script was adapted from the mosesdecoder repository
contrib/c++tokenizer
.
Benchmark
fast-mosestokenizer is also fast.
On english, it is about 6x faster than tokenizer.perl
and 15x faster than
sacremoses
.
see ./bench/README.md for more information.
Installation
Python users using linux
and osx>=10.15
can install directly from PyPI.
pip install fast-mosestokenizer
See ./INSTALL.md for more information.
Usage (Command-line tool)
# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile
# For a full list of options, refer to the help message.
mosestokenizer -h
Usage (Python)
# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer
>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
'"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for opus-fast-mosestokenizer-0.0.8.3.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | a786d2a774281ac8b0fad7ec3f0a2a9859700c4e7a532cf2fdb5d5bf460dc4b3 |
|
MD5 | 35bf64673421132835da719c4e283239 |
|
BLAKE2b-256 | 69f397283a6600c575555770741f1a49824a1dd20bae133c631cd729090689d3 |
Hashes for opus_fast_mosestokenizer-0.0.8.3-cp310-cp310-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 229103a940465aec701b9902c3942595a2d1c8435a0864a1666dbc08d78a2811 |
|
MD5 | 2f09fae0f78fd12ccec590dd4c9d1d51 |
|
BLAKE2b-256 | 939301afef179da7908d4b2c6b10430e222db84d09d2e4ddcd760b8cc2a4d53e |
Hashes for opus_fast_mosestokenizer-0.0.8.3-cp310-cp310-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4ed4ab3bf7045cf2b880874d3ab104fd92851670c64e4dce1be72deee66eced4 |
|
MD5 | 4b9a0f6c09c969c345efad957c8c2ac5 |
|
BLAKE2b-256 | ad2124abc8b972099a66227b450f247c69f17e73cb5cd0fde638cf33db7080de |
Hashes for opus_fast_mosestokenizer-0.0.8.3-cp39-cp39-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 92a1eb9e6a9afc27909636da6988b93321f62baab3ca1f7057a2b845c21244d2 |
|
MD5 | 8cb2e9b76436e0e41fb11aefdae10698 |
|
BLAKE2b-256 | d19c6915c7ec41e6ad9f2af0deff3853db64256b9feca051be019a010f01a5d2 |
Hashes for opus_fast_mosestokenizer-0.0.8.3-cp39-cp39-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 83052c1de1b80c63839da3bba3e2bd8ee76d68f00c6097dd51869d81f5e314f2 |
|
MD5 | 55dc631f447244e6fb96b152fd3b7305 |
|
BLAKE2b-256 | fd554a5df95ef571116a34b946eb780324a2a9350026733e7eb8e8e48b4e4e91 |
Hashes for opus_fast_mosestokenizer-0.0.8.3-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b6d84dcb2d239766aaa3921f0d72a633edec63789b06d9f2c0f0ea7d6d366501 |
|
MD5 | b04c7b7071b19a7fa75f7c219ac66916 |
|
BLAKE2b-256 | 613da96882dc1e83de2ee97d5af3a32c49c57a757e6d4c36fd96f12f77896227 |
Hashes for opus_fast_mosestokenizer-0.0.8.3-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ac64d5123097cd2c3b0921031b7ae41b5dc1566dca4cded066771c3829070d4d |
|
MD5 | 4573fce1462150a77a1332433c4fb8d3 |
|
BLAKE2b-256 | eb54d94c406d91c6eab602f9717b4736e59ded617f5e501322e0e967e9aa2002 |
Hashes for opus_fast_mosestokenizer-0.0.8.3-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a1fa0eabed3a84c72f4231a2bd0c8b20535989ed40c037794eaeda4c1efba5b8 |
|
MD5 | 0bdc3e2bac99bd4cb795fd974ac1378f |
|
BLAKE2b-256 | 5e45e78ef60a270727ac978a986d4d3b43988cba00d91aefce7ea01987322326 |
Hashes for opus_fast_mosestokenizer-0.0.8.3-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2292995fb1edf932ef03603f36d6d72d846f59c0eb3fae513a9726c954d0f94d |
|
MD5 | bedaf59c99a9afa75cf15e83dcf64d1b |
|
BLAKE2b-256 | 3ab6b203e06d666218cabc60b29693f293b068e4ad5ed88aedd5a9118ed3aaa3 |
Hashes for opus_fast_mosestokenizer-0.0.8.3-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d36024797f8f1e3439fa41b409aeae60572ec2c9034a16690c4d9bbe9cffa2f9 |
|
MD5 | 7902f0321f536517c30cabfa120c23e0 |
|
BLAKE2b-256 | 694804da89992a5c3e0eefb78dd9b137359afdfadf3ce74b3ad626994366e3bd |
Hashes for opus_fast_mosestokenizer-0.0.8.3-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f6f8d709afe6a80bd42c672b4bd283e2a6a3c0860af59644ea4477fc79d138bf |
|
MD5 | 566848b37f5f93cadd42e4a101ecc8d6 |
|
BLAKE2b-256 | 346078dfd736f8ecb2c931ae1d157b56320b5685f08b5a511e60145977905d82 |