Skip to main content

c++ mosestokenizer

Project description

opus-fast-mosestokenizer is a fork of fast-mosestokenizer created to ensure compability of the package with current Python environments.

fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.

The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.

The C++ script was adapted from the mosesdecoder repository contrib/c++tokenizer.

Benchmark

fast-mosestokenizer is also fast. On english, it is about 6x faster than tokenizer.perl and 15x faster than sacremoses.

see ./bench/README.md for more information.

Installation

Python users using linux and osx>=10.15 can install directly from PyPI.

pip install fast-mosestokenizer

See ./INSTALL.md for more information.

Usage (Command-line tool)

# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile

# For a full list of options, refer to the help message.
mosestokenizer -h

Usage (Python)

# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer

>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
  'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
  'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
  'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
  'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
  'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
  'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
  '"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opus-fast-mosestokenizer-0.0.8.5.tar.gz (86.8 kB view hashes)

Uploaded Source

Built Distributions

opus_fast_mosestokenizer-0.0.8.5-cp311-cp311-macosx_12_0_x86_64.whl (727.2 kB view hashes)

Uploaded CPython 3.11 macOS 12.0+ x86-64

opus_fast_mosestokenizer-0.0.8.5-cp310-cp310-macosx_12_0_x86_64.whl (727.2 kB view hashes)

Uploaded CPython 3.10 macOS 12.0+ x86-64

opus_fast_mosestokenizer-0.0.8.5-cp39-cp39-macosx_12_0_x86_64.whl (727.3 kB view hashes)

Uploaded CPython 3.9 macOS 12.0+ x86-64

opus_fast_mosestokenizer-0.0.8.5-cp38-cp38-macosx_12_0_x86_64.whl (727.2 kB view hashes)

Uploaded CPython 3.8 macOS 12.0+ x86-64

opus_fast_mosestokenizer-0.0.8.5-cp37-cp37m-macosx_12_0_x86_64.whl (726.5 kB view hashes)

Uploaded CPython 3.7m macOS 12.0+ x86-64

opus_fast_mosestokenizer-0.0.8.5-cp36-cp36m-macosx_12_0_x86_64.whl (726.6 kB view hashes)

Uploaded CPython 3.6m macOS 12.0+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page