c++ mosestokenizer
Project description
fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.
The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.
The C++ script was adapted from the mosesdecoder repository
contrib/c++tokenizer
.
Benchmark
fast-mosestokenizer is also fast.
On english, it is about 6x faster than tokenizer.perl
and 15x faster than
sacremoses
.
see ./bench/README.md for more information.
Installation
Python users using linux
and osx>=10.15
can install directly from PyPI.
pip install fast-mosestokenizer
See ./INSTALL.md for more information.
Usage (Command-line tool)
# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile
# For a full list of options, refer to the help message.
mosestokenizer -h
Usage (Python)
# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer
>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
'"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for fast_mosestokenizer-0.0.4-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3efed81c9660106b6240814fd5eb396e60dcbe62371f2eec2608fb7256421109 |
|
MD5 | 9e06f21dad36b443133c69c293950ef6 |
|
BLAKE2b-256 | 11d4d397a214d536eb04bd713bcf41015e2ac750e9d279ccb543cd7dc93567a3 |
Hashes for fast_mosestokenizer-0.0.4-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | af549bb1e38fc3e13b892a96c43826d7c9d92396ce326f43fa7485ed588715c1 |
|
MD5 | 3942fd35c8d531977076459bb009868e |
|
BLAKE2b-256 | efa0ecf34c1f339b646cc3ce7d07d18c48795e3876e217a75132e979572f0825 |
Hashes for fast_mosestokenizer-0.0.4-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4d6f227be0c7cd42062f6d6a75958d4b43b88345c579d0f4af6786fc02e9eed9 |
|
MD5 | 3e83ec7182ab56bb5e78649ab1c030b7 |
|
BLAKE2b-256 | 71d9bfb3bac954a62f80d0da4dc6dedfd3da978615d6d70de19fa3d269f0ca14 |
Hashes for fast_mosestokenizer-0.0.4-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fc2cdef0ef08edb1cf589cfbe133a86db75bd79e2127621ad7d4f3e90673ec07 |
|
MD5 | af77a75c7727f1a8be8439aa19039ba4 |
|
BLAKE2b-256 | 80bafcf637e333552581f338a73b83702e9a5bc6a9333accc3ea82bc123be27d |
Hashes for fast_mosestokenizer-0.0.4-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b663ec75eb81738574e7d70db1177e732bdfbf49eaf27f2e8f7716831d464839 |
|
MD5 | ca269176b3a63f66c904fd0e5a07beca |
|
BLAKE2b-256 | a261ae829b92273fd70af290beae9b781bd9b9b0c12091aed430f9109e7fbd0e |
Hashes for fast_mosestokenizer-0.0.4-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5ec216b4af7e081b86665b40b7ee4f0121d68446e1c9fccd2687add20b430b36 |
|
MD5 | 67f19d2e128d69483b39b2f347143ecc |
|
BLAKE2b-256 | 2da9c5e255b3230b9516c1fdfbd10d31b3cd9d39728536d1e3df302c6b5b137d |