Skip to main content

Python binding for nlpO3 Thai language processing library in Rust

Project description


SPDX-FileCopyrightText: 2024 PyThaiNLP Project SPDX-License-Identifier: Apache-2.0

nlpO3 Python binding

PyPI Python 3.7 Apache-2.0

Python binding for nlpO3, a Thai natural language processing library in Rust.

To install:

pip install nlpo3

Table of Contents

Features

  • Thai word tokenizer
    • segment() - use maximal-matching dictionary-based tokenization algorithm and honor Thai Character Cluster boundaries
      • 2.5x faster than similar pure Python implementation (PyThaiNLP's newmm)
    • load_dict() - load a dictionary from a plain text file (one word per line)

Use

Load file path/to/dict.file to memory and assign a name dict_name to it.

Then tokenize a text with the dict_name dictionary:

from nlpo3 import load_dict, segment

load_dict("path/to/dict.file", "custom_dict")
segment("สวัสดีครับ", "dict_name")

it will return a list of strings:

['สวัสดี', 'ครับ']

(result depends on words included in the dictionary)

Use multithread mode, also use the dict_name dictionary:

segment("สวัสดีครับ", dict_name="dict_name", parallel=True)

Use safe mode to avoid long waiting time in some edge cases for text with lots of ambiguous word boundaries:

segment("สวัสดีครับ", dict_name="dict_name", safe=True)

Dictionary

  • For the interest of library size, nlpO3 does not assume what dictionary the user would like to use, and it does not come with a dictionary.
  • A dictionary is needed for the dictionary-based word tokenizer.
  • For tokenization dictionary, try

Build

Requirements

  • Rust 2018 Edition
  • Python 3.7 or newer (PyO3's minimum supported version)
  • Python Development Headers
    • Ubuntu: sudo apt-get install python3-dev
    • macOS: No action needed
  • PyO3 - already included in Cargo.toml
  • setuptools-rust

Steps

python -m pip install --upgrade build
python -m build

This should generate a wheel file, in dist/ directory, which can be installed by pip.

To install a wheel from a local directory:

pip install dist/nlpo3-1.3.1-cp311-cp311-macosx_12_0_x86_64.whl 

Test

To run a Python unit test:

cd tests
python -m unittest

Issues

Please report issues at https://github.com/PyThaiNLP/nlpo3/issues

License

nlpO3 Python binding is copyrighted by its authors and licensed under terms of the Apache Software License 2.0 (Apache-2.0). See file LICENSE for details.

Binary wheels

A pre-built binary package is available from PyPI for these platforms:

Python OS Architecture Has binary wheel?
3.13 Windows x86
Windows AMD64
macOS x86_64
macOS arm64
manylinux x86_64
manylinux i686
musllinux x86_64
3.12 Windows x86
Windows AMD64
macOS x86_64
macOS arm64
manylinux x86_64
manylinux i686
musllinux x86_64
3.11 Windows x86
Windows AMD64
macOS x86_64
macOS arm64
manylinux x86_64
manylinux i686
musllinux x86_64
3.10 Windows x86
Windows AMD64
macOS x86_64
macOS arm64
manylinux x86_64
manylinux i686
musllinux x86_64
3.9 Windows x86
Windows AMD64
macOS x86_64
macOS arm64
manylinux x86_64
manylinux i686
musllinux x86_64
3.8 Windows x86
Windows AMD64
macOS x86_64
macOS arm64
manylinux x86_64
manylinux i686
musllinux x86_64
3.7 Windows x86
Windows AMD64
macOS x86_64
macOS arm64
manylinux x86_64
manylinux i686
musllinux x86_64
PyPy 3.10 Windows x86
Windows AMD64
macOS x86_64
macOS arm64
manylinux x86_64
manylinux i686
PyPy 3.9 Windows x86
Windows AMD64
macOS x86_64
macOS arm64
manylinux x86_64
manylinux i686
PyPy 3.8 Windows x86
Windows AMD64
macOS x86_64
macOS arm64
manylinux x86_64
manylinux i686
PyPy 3.7 Windows x86
Windows AMD64
macOS x86_64
macOS arm64
manylinux x86_64
manylinux i686

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page