Python binding for nlpO3 Thai language processing library in Rust
Project description
nlpO3 Python binding
Python binding for nlpO3, a Thai natural language processing library in Rust.
Features
- Thai word tokenizer
segment()
- use maximal-matching dictionary-based tokenization algorithm and honor Thai Character Cluster boundaries- 2.5x faster than similar pure Python implementation (PyThaiNLP's newmm)
load_dict()
- load a dictionary from plain text file (one word per line)
Dictionary file
- For the interest of library size, nlpO3 does not assume what dictionary the developer would like to use. It does not come with a dictionary. A dictionary is needed for the dictionary-based word tokenizer.
- For tokenization dictionary, try
- words_th.tx from PyThaiNLP - around 62,000 words (CC0)
- word break dictionary from libthai - consists of dictionaries in different categories, with make script (LGPL-2.1)
Install
pip install nlpo3
Usage
Load file path/to/dict.file
to memory and assign a name dict_name
to it.
Then tokenize a text with the dict_name
dictionary:
from nlpo3 import load_dict, segment
load_dict("path/to/dict.file", "custom_dict")
segment("สวัสดีครับ", "dict_name")
it will return a list of strings:
['สวัสดี', 'ครับ']
(result depends on words included in the dictionary)
Use multithread mode, also use the dict_name
dictionary:
segment("สวัสดีครับ", dict_name="dict_name", parallel=True)
Use safe mode to avoid long waiting time in some edge cases for text with lots of ambiguous word boundaries:
segment("สวัสดีครับ", dict_name="dict_name", safe=True)
Build
Requirements
- Rust 2018 Edition
- Python 3.6 or newer
- Python Development Headers
- Ubuntu:
sudo apt-get install python3-dev
- macOS: No action needed
- Ubuntu:
- PyO3 - already included in Cargo.toml
- setuptools-rust
Steps
python -m pip install --upgrade build
python -m build
This should generate a wheel file, in dist/
directory, which can be installed by pip.
Issues
Please report issues at https://github.com/PyThaiNLP/nlpo3/issues
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for nlpo3-1.2.5-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d1e39041b23849e57a52c4783cace87616f95810a36641ab67a98a9dd971e0c7 |
|
MD5 | 756bad2908f63a386aa903aa3e30601b |
|
BLAKE2b-256 | 0c932052a8b7561c0ccb130ee8d6dd54226494b1cbca2c2cf0ff310506bb0643 |
Hashes for nlpo3-1.2.5-cp310-cp310-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cb22e94247b92e3394f44b94e3abb1f49a5db1eca3f14f74db7911e762c2340c |
|
MD5 | 8728db0bc6797b7bdbf241ef48d931c5 |
|
BLAKE2b-256 | a356e3a0ac72f9eebf8363ff5068ec422240405e46c94b22ec70486bb7a38a54 |
Hashes for nlpo3-1.2.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 347227a98e06d9d542fbc9834650a3ecdd4a6de16b7077e01ae08386b24fffbd |
|
MD5 | c3bbf2353fb810ac5c30a13c6fcfea3a |
|
BLAKE2b-256 | f13fb976e0dee8bc77733a20a129d54762b7675d73dd280b9c1844c663f6e9f0 |
Hashes for nlpo3-1.2.5-cp310-cp310-manylinux_2_12_i686.manylinux2010_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | aac1ea185ca77b8c70f07ec7d4c04b1fe254e3efd1ec9ec540ea97f3c504a733 |
|
MD5 | 563dded7524511e7a1812fb5757cb838 |
|
BLAKE2b-256 | 486bc2ca0d57f53823450f8bc496e7f355f727d646bd37e62feb5f7129fff192 |
Hashes for nlpo3-1.2.5-cp310-cp310-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 01e1e6cfc09756bd657bd59a484ca5221ffe4c362d6a1cf0ad8812b70f8b64c2 |
|
MD5 | 275a9d2e39badfdeecbbd7ddc1e0ab7d |
|
BLAKE2b-256 | ad249460e1317fb8850653b6893a934e7ea311ffeb4e315f7d4a5d4172a3713b |
Hashes for nlpo3-1.2.5-cp310-cp310-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cdc2b8ef1a5422c6ad10b32573d7fb253f970bc3d3ff34b088c9a76a6ee4a95f |
|
MD5 | f00a74b95dc72226e44a5e9a3581374f |
|
BLAKE2b-256 | bc6df54e802d75354e07686ee7b39a0f0e3f8cac0918b18bef6b2fe0ab461a14 |
Hashes for nlpo3-1.2.5-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4fcc966f4619823d1ef0c27418113bb8859fa515cb51cb7867a4ad46ae5123c5 |
|
MD5 | f311c462822f16d9abfa0045d1cdebf3 |
|
BLAKE2b-256 | b9a352547abef98950f0384b8051a015c9acdc33c9480401dafdfd6ca91ea847 |
Hashes for nlpo3-1.2.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5802f88e310e80146de35ee6eae75d31d8ee6e53e6b6fda1e9fdf37f200cb46a |
|
MD5 | 13b411467c04209cf06e45cc0a5799dd |
|
BLAKE2b-256 | 231d984eb09da8b862d6eafe75fe72f67745f4ceb8765f28869a9214f25df992 |
Hashes for nlpo3-1.2.5-cp39-cp39-manylinux_2_12_i686.manylinux2010_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6944a21229e995a25621872b54120f0bacf2efb875d271a86286c26ba2c1802a |
|
MD5 | 1faa94058e3e7ccf9dffebe4235b6ed2 |
|
BLAKE2b-256 | 98d10271c4291d675b7a7a3b515c84f14f23b6da9eeb49a52216429d4146493f |
Hashes for nlpo3-1.2.5-cp39-cp39-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8a6ef753ba0679df2189313802f1eba0f75aa0c9f6455994f9c0d0c9a2229b17 |
|
MD5 | 7dc5712f8f049317a885c6ed1264b210 |
|
BLAKE2b-256 | e4f73f92c23d033bdef671155970a316d783f199fffaa989119f4da61dd99d91 |
Hashes for nlpo3-1.2.5-cp39-cp39-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ce93d6a1ea8e0a689ccf6743f939bce1599bccc2881ef12fe20fd72e214ef099 |
|
MD5 | ab240812ecf67d93f529201b9718b254 |
|
BLAKE2b-256 | 6b9ef2d0d473c08a103a57ea3197bb50eb594b55045b43da6b295cde76eb658b |
Hashes for nlpo3-1.2.5-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7cecf683f283a03bf1f4b28ab2ab2b1d10f84c8740143e982076883d475016e0 |
|
MD5 | 358c328bfda8038429fa51e6418bcbfd |
|
BLAKE2b-256 | 23094572c7d515b9baea417cf857b0cf893db5f96d5391079822d6896ad6de25 |
Hashes for nlpo3-1.2.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fbeaf3efa2f186953491ab33f601dff7824e8d234c54f2a92251b4afa2a0a8b0 |
|
MD5 | b710be0ae47421ee51a2bcaecbee628c |
|
BLAKE2b-256 | 4b8ab322f23fa2fd07d5bd9d0227c625f15631f7eba3aff8323556867b29e775 |
Hashes for nlpo3-1.2.5-cp38-cp38-manylinux_2_12_i686.manylinux2010_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 94be84fd269a6c60c089953952dedcd0fadd2d79710ec46a795ff84910e457bc |
|
MD5 | 850e5f0ce9c7f1c77c30d44fac2680ee |
|
BLAKE2b-256 | b042f9f6db58880eab9baba3be8fa7600cf7e0f610ab69e2693bde3503205849 |
Hashes for nlpo3-1.2.5-cp38-cp38-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 00345daa70a69886d8208814d54c6d4238fe23b10414755e5881194f73e1bc4c |
|
MD5 | 0ad3c0b16a7dd8a9ce2611476c901938 |
|
BLAKE2b-256 | 325b7acc4147c8d570e7f07b191b2e574676b6415aa0307b23e33df9a84f1b99 |
Hashes for nlpo3-1.2.5-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 76c7fa77f56b4c4d4e5a7cfca41007f1fcda0d0a6abe3cbb368aa3b5fd2b3b50 |
|
MD5 | 5caecc8288d0530d464f06274b787457 |
|
BLAKE2b-256 | e48b63089b18f4509744f69708ddda8afcd68b5bd41e60f2fd3a643a601cc9b4 |
Hashes for nlpo3-1.2.5-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b7db7c5a39475de5186b88a194167cb9bcb3e2da6324f903f41207ee6ade4028 |
|
MD5 | ab0e179187a13a5930cb158f83eecc75 |
|
BLAKE2b-256 | 9d4a6258517f5352666e0bee7f4df57af451ae5e2800a91be315d7a13114c482 |
Hashes for nlpo3-1.2.5-cp37-cp37m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2cee02c1d39465b0d18f62c7743f415658ac100bf87ad35c8beb3a7ac59b9fdd |
|
MD5 | 3c6eee685717b170d5d0f91a2925fab2 |
|
BLAKE2b-256 | 118ad99221e3bd13ccf3b769b71464c721135b268008c3549b62a03a495361b6 |
Hashes for nlpo3-1.2.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 53e9c038ca1513d8288b74a306cdaeb6b6c0ef9dcf4598433553a4b57847651e |
|
MD5 | 26b7867f07f474b816ddc4b84189eba6 |
|
BLAKE2b-256 | eee967d387134dd3322349fb8c6332b48bdd445262557dd06f7c6973076e199f |
Hashes for nlpo3-1.2.5-cp37-cp37m-manylinux_2_12_i686.manylinux2010_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 34708db446a52497bddaec6a1a73c15c1019820bea25e1d986656ab2c33b8677 |
|
MD5 | c3f175fc4d092ce79b6dd3809d82f9fb |
|
BLAKE2b-256 | b3d11d536a8e677a47de51b9404513ed943b8d3a06be100d8ad14b698279033e |
Hashes for nlpo3-1.2.5-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3da512dcb71d2487195043194c9720227680b7c4d60c0a76eaff7a8d9dfbafa2 |
|
MD5 | 70bb55a68a1e5413290fbd6ae21ce612 |
|
BLAKE2b-256 | 246b6d041eed9ec8f5b66f68e3d7841d886622d5d95dc2c780fb28081c185f0d |
Hashes for nlpo3-1.2.5-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b2ad3d124d8b58d3c45d4f7175affdb13da6672ccb1e2d58e9b59a6b24390733 |
|
MD5 | edcab81252af89a585754419c137f1cd |
|
BLAKE2b-256 | c0d9566ae8a9da143efa9418e4be728fb262e4abb807b20e5354d035ee697c00 |
Hashes for nlpo3-1.2.5-cp36-cp36m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 627026a4cca644fcd0232e6cf2031c155feb9d33c20bd92cd4ba3413dc8dac9f |
|
MD5 | ec92a77e07d926631ba46760618066f8 |
|
BLAKE2b-256 | 38f6ba18e81105c40afc7c6e1b00d64cff43b4f21d068efda899d93dcba48806 |
Hashes for nlpo3-1.2.5-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e9967fe4036701dffe43a29baf4bcd11686d780de768d99561071b1fc1f59c59 |
|
MD5 | 5cf9e86b33e036696d6b7c9c4a0bdae0 |
|
BLAKE2b-256 | 00ccd6d534a71e3c666e5d031aae86b1e62765e7c4ea336de8d9b461fcaa0c66 |
Hashes for nlpo3-1.2.5-cp36-cp36m-manylinux_2_12_i686.manylinux2010_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c0bf2954e89c1ce7a213bdc50eaf3cd70a404dd1cfb957696cc6788b5f17bcd5 |
|
MD5 | da3135b81d2a818c12ec62d76365a7ed |
|
BLAKE2b-256 | 80cd8ea4f3cb4bbca9262f3decf977fa873275f99e0ca68407cf01c4341a1f1f |
Hashes for nlpo3-1.2.5-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 65384fa23472495c552a466d88cad22405fb606768acd87bdd3a3896a3934db6 |
|
MD5 | 39620080889facdc7ee833d358a8cb11 |
|
BLAKE2b-256 | ff986c47a2ddb29eebe51e64b05af8d32f67ee739938bcc2bb45f5c4b4d90833 |