Python binding for nlpO3 Thai language processing library in Rust
Project description
nlpO3 Python binding
Python binding for nlpO3, a Thai natural language processing library in Rust.
Features
- Thai word tokenizer
segment()
- use maximal-matching dictionary-based tokenization algorithm and honor Thai Character Cluster boundaries- 2.5x faster than similar pure Python implementation (PyThaiNLP's newmm)
load_dict()
- load a dictionary from plain text file (one word per line)
Dictionary file
- For the interest of library size, nlpO3 does not assume what dictionary the developer would like to use. It does not come with a dictionary. A dictionary is needed for the dictionary-based word tokenizer.
- For tokenization dictionary, try
- words_th.tx from PyThaiNLP - around 62,000 words (CC0)
- word break dictionary from libthai - consists of dictionaries in different categories, with make script (LGPL-2.1)
Install
pip install nlpo3
Usage
Load file path/to/dict.file
to memory and assign a name dict_name
to it.
Then tokenize a text with the dict_name
dictionary:
from nlpo3 import load_dict, segment
load_dict("path/to/dict.file", "custom_dict")
segment("สวัสดีครับ", "dict_name")
it will return a list of strings:
['สวัสดี', 'ครับ']
(result depends on words included in the dictionary)
Use multithread mode, also use the dict_name
dictionary:
segment("สวัสดีครับ", dict_name="dict_name", parallel=True)
Use safe mode to avoid long waiting time in some edge cases for text with lots of ambiguous word boundaries:
segment("สวัสดีครับ", dict_name="dict_name", safe=True)
Build
Requirements
- Rust 2018 Edition
- Python 3.6 or newer
- Python Development Headers
- Ubuntu:
sudo apt-get install python3-dev
- macOS: No action needed
- Ubuntu:
- PyO3 - already included in Cargo.toml
- setuptools-rust
Steps
python -m pip install --upgrade build
python -m build
This should generate a wheel file, in dist/
directory, which can be installed by pip.
Issues
Please report issues at https://github.com/PyThaiNLP/nlpo3/issues
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for nlpo3-1.3.0-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bf4e473999e0fa4b40f5bec9bfc8b11699a828384d26698981a234e465e49156 |
|
MD5 | 44389ce9b9692ac7c1c4110e219aa6af |
|
BLAKE2b-256 | e25a3a4aaa4325fa1ccedf5a636b865c2bcc6fe9f312b22650401fb2be6e5080 |
Hashes for nlpo3-1.3.0-cp310-cp310-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0e4235dd0c65ec91be3aafc538c082f663520ad9de74df1fc6f7cf3e7dab745c |
|
MD5 | 72c8583c6914f2fd0e9de5dd0a47f25b |
|
BLAKE2b-256 | 777dca294b956ef7f5f68bd15ddeae359168019f44d3a10b70422208231f0080 |
Hashes for nlpo3-1.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2e4a86f32045184a8fe188cd5f5c2ae6367d5b5a6492c81e07a0fdc5f6532f85 |
|
MD5 | 14396dc11f6eedb14c803c710215e4c6 |
|
BLAKE2b-256 | a489d466ea97fdbbf6d802a49ea455929a38b23c6bd1666154b7b7c9908f4431 |
Hashes for nlpo3-1.3.0-cp310-cp310-manylinux_2_12_i686.manylinux2010_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f75f0a112ce8caa69a6cd4e67d52c43b06d6d81a13d045b81d3123f49c546099 |
|
MD5 | 1813dca0b6ce100a4c25329f5b9617f3 |
|
BLAKE2b-256 | f9011fbbf0ed6697cf42c5c52732702f375dca01be753730faccb64cf2334de2 |
Hashes for nlpo3-1.3.0-cp310-cp310-macosx_12_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dda9cd3d8a160e4b57f1f43ec7a4d65f2e06ee10d170665c735644f6f46b3c57 |
|
MD5 | 76a4ce560c1d6fd950427c69776919a6 |
|
BLAKE2b-256 | 69e948bad702eb88c22b2f01b9800852a48f671208265b7fb7843d7efc60cf2f |
Hashes for nlpo3-1.3.0-cp310-cp310-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5a46d0cc1546b3807b7405779255f600f69afb0c5c27ecd8be09d9a26973b7ba |
|
MD5 | b7f91f0e99cdf4786ed8c8a2c4aa2dde |
|
BLAKE2b-256 | e3451d4d1565775776df988f461835e7d367f94c16ef4fd1fa0d5a9d2c4926cc |
Hashes for nlpo3-1.3.0-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c3e3ec102ea338951216f36100bf4c82bf9bd6076a9f12a90d93acd666ffdad1 |
|
MD5 | f0dc03bf2c96fc32388bf6d522c36f07 |
|
BLAKE2b-256 | 910a6ad4c2acef0f90199924e6469851b27f3d738b6087f0c3d834a84ed92be9 |
Hashes for nlpo3-1.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c5b0e307f24f6875dbab1abe44b4d54075208c02e5d6825f3629e2e5abd426c6 |
|
MD5 | a39e966d8027ec5fd079019bca9719ac |
|
BLAKE2b-256 | 77a08b5dac4752f52e68efe04212677024014a624508ae48d03c73afbe7b795e |
Hashes for nlpo3-1.3.0-cp39-cp39-manylinux_2_12_i686.manylinux2010_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fa69e88ce3c1fdf5f29e8152db56ae394cba736846fd288b8e01b1aba4dece52 |
|
MD5 | 8631e045a18e80ac459a79eada3e07a6 |
|
BLAKE2b-256 | b5f4ba6242a9d3ce3e3dc677a4770901de47fa151cc6ea3d7e4583b4fad2d492 |
Hashes for nlpo3-1.3.0-cp39-cp39-macosx_12_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 229e25dc72b66888039647aa8ecfa2bf2aa460a033e9e0dd6f48f5838413949f |
|
MD5 | 342ea46c978132304155cbd4702678ef |
|
BLAKE2b-256 | 1b6c6ddc100573323af389f54cb06f38da8b1198e54267d00794ad0ff9118777 |
Hashes for nlpo3-1.3.0-cp39-cp39-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 87cce2300c2db08cec7fed092ed93c4874427875045c589dae07ba5f9ab385e3 |
|
MD5 | 0f2cc4f3c0fbe1092bfe35be50b4074f |
|
BLAKE2b-256 | 96c2e698452cc706f11e0542f086e4469fa3e381b104bc94e4ea7996af7c9c27 |
Hashes for nlpo3-1.3.0-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c765d60268e39aac7fade8399940a0639b62d80f6026311b1519d6aee188ee23 |
|
MD5 | cc814f72078f1273aa5536f307ff0401 |
|
BLAKE2b-256 | 8ceadc678ee3c1463cfc9f179e82e8b8ff665e7331a63a1997a4745186387c3a |
Hashes for nlpo3-1.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bd59ab7dc9dc500651f119e9df2a043996f3dab5a687990872b5589302a007fc |
|
MD5 | fd9e0631793e381dc19b6b95ac009c04 |
|
BLAKE2b-256 | c1a7816d6ca7c5308c44710ef72a3cba8ec5f5de6530140c527edd3dafc5d3e4 |
Hashes for nlpo3-1.3.0-cp38-cp38-manylinux_2_12_i686.manylinux2010_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 71d345afe536fe4119647117281132c8b0a16eff3960800ddb7943e8a6ecc359 |
|
MD5 | befdcb1a01e6a78a24d6bd859ad024f5 |
|
BLAKE2b-256 | d87f113121a99a2cd24e758ec15a0da08678077c6f4bff3c73a3d52c0253aed8 |
Hashes for nlpo3-1.3.0-cp38-cp38-macosx_12_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 50c144e2e377a4fc199ecfc7fb18e3d8c9e9bade397bfdbfcb28107702d61636 |
|
MD5 | 23bf4d68ed81ba7a22ad98884be2cbd0 |
|
BLAKE2b-256 | bbcb663f774dc0111458e05e825b437f40c5a20b4e5992c1e14a7510cfe3bac2 |
Hashes for nlpo3-1.3.0-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2a0b53104eaa8ae222fc9e301c4f6c7a0b28d1a79eb88a9415a4c27ec51be3dd |
|
MD5 | 2ae4e896e92f67abb590b4d06ae2ccd7 |
|
BLAKE2b-256 | 984310e5c367df15bad7fb69f60a36b68e65a82e0a6d7a0cf198db2a3030abfd |
Hashes for nlpo3-1.3.0-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4610e891d8868b95d54c5f9a99813660b35d7f42b05187484dbd27071c868fc1 |
|
MD5 | db84926ce95dc1f388cf4e531f469bae |
|
BLAKE2b-256 | c7e11b914e75888de5d3f60d4abe974db4c65060296b43caf9ff46db5895852f |
Hashes for nlpo3-1.3.0-cp37-cp37m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b4b9bf7514378d6baceca473ff8f55ff4c8ff7cabcde84361e88aada647d5d24 |
|
MD5 | 6d6ff3b860e76ec7e61a968559b02b24 |
|
BLAKE2b-256 | ed145fd8037bb5f2aa9df6a2f4f10698c2fdc96fb359f20d0c78356ba672dbbf |
Hashes for nlpo3-1.3.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 25255705a1d0a9b92fbeeac54ee09646b5411c23969752c5ee31322aa98ef65e |
|
MD5 | 7b27408804b758f769346afc8365ebc3 |
|
BLAKE2b-256 | da1f30f4163906754e39a463d00b5b14fb0d18abf5f6b1bc29e1e8708a4caf66 |
Hashes for nlpo3-1.3.0-cp37-cp37m-manylinux_2_12_i686.manylinux2010_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 49558977c0318e824fdeeda1e01d0b0ad3a32374f4fd42d2cade360e8f784290 |
|
MD5 | 25837b8cf3dcbf7c033ed23b76c51c40 |
|
BLAKE2b-256 | f83cc28364daef59bf4c3bd53ed499fbc7c4674cd292cdd438ad342a4c5153fe |
Hashes for nlpo3-1.3.0-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 03659c4a8910bd0731884c5fc0ff44ee20e2fdf142c488de41b8b534539ae882 |
|
MD5 | ce39c140438b31e0c02e03f70154eb2b |
|
BLAKE2b-256 | 1427c33966a49efe73928c07c107ff2db4a819d168917981ae8afcc415f0d986 |