Japanese text normalizer for mecab-neologd
Project description
neologdn
neologdn is a Japanese text normalizer for mecab-neologd.
The normalization is based on the neologd’s rules: https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja
Contributions are welcome!
NOTE: Installing this module requires C++11 compiler.
Installation
$ pip install neologdn
Usage
import neologdn
neologdn.normalize("ハンカクカナ")
# => 'ハンカクカナ'
neologdn.normalize("全角記号!?@#")
# => '全角記号!?@#'
neologdn.normalize("全角記号例外「・」")
# => '全角記号例外「・」'
neologdn.normalize("長音短縮ウェーーーーイ")
# => '長音短縮ウェーイ'
neologdn.normalize("チルダ削除ウェ~∼∾〜〰~イ")
# => 'チルダ削除ウェイ'
neologdn.normalize("いろんなハイフン˗֊‐‑‒–⁃⁻₋−")
# => 'いろんなハイフン-'
neologdn.normalize(" PRML 副 読 本 ")
# => 'PRML副読本'
neologdn.normalize(" Natural Language Processing ")
# => 'Natural Language Processing'
neologdn.normalize("かわいいいいいいいいい", repeat=6)
# => 'かわいいいいいい'
neologdn.normalize("無駄無駄無駄無駄ァ", repeat=1)
# => '無駄ァ'
neologdn.normalize("1995〜2001年", tilde="normalize")
# => '1995~2001年'
neologdn.normalize("1995~2001年", tilde="normalize_zenkaku")
# => '1995〜2001年'
neologdn.normalize("1995〜2001年", tilde="ignore") # Don't convert tilde
# => '1995〜2001年'
neologdn.normalize("1995〜2001年", tilde="remove")
# => '19952001年'
neologdn.normalize("1995〜2001年") # Default parameter
# => '19952001年'
Benchmark
# Sample code from
# https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja#python-written-by-hideaki-t--overlast
import normalize_neologd
%timeit normalize(normalize_neologd.normalize_neologd)
# => 9.55 s ± 29.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
import neologdn
%timeit normalize(neologdn.normalize)
# => 6.66 s ± 35.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
neologdn is about x1.43 faster than sample code.
details are described as the below notebook: https://github.com/ikegami-yukino/neologdn/blob/master/benchmark/benchmark.ipynb
License
Apache Software License.
Contribution
Contributions are welcome! See: https://github.com/ikegami-yukino/neologdn/blob/master/.github/CONTRIBUTING.md
CHANGES
0.5.2 (2023-08-03)
Support Python 3.10 and 3.11 (Many thanks @polm)
0.5.1 (2021-05-02)
Improve performance of shorten_repeat function (Many thanks @yskn67)
Add tilde option to normalize function
0.4 (2018-12-06)
Add shorten_repeat function, which shortening contiguous substring. For example: neologdn.normalize(“無駄無駄無駄無駄ァ”, repeat=1) -> 無駄ァ
0.3.2 (2018-05-17)
Add option for suppression removal of spaces between Japanese characters
0.2.2 (2018-03-10)
Fix bug (daku-ten & handaku-ten)
Support mac osx 10.13 (Many thanks @r9y9)
0.2.1 (2017-01-23)
Fix bug (Check if a previous character of daku-ten character is in maps) (Many thanks @unnonouno)
0.2 (2016-04-12)
Add lengthened expression (repeating character) threshold
0.1.2 (2016-03-29)
Fix installation bug
0.1.1.1 (2016-03-19)
Support Windows
Explicitly specify to -std=c++11 in build (Many thanks @id774)
0.1.1 (2015-10-10)
Initial release.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for neologdn-0.5.2-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 61a7b3d9b8f6c6a49de333f618051e73312bf84241c8cdc4093e71e4b94bef9a |
|
MD5 | 7931088c08442224e7e4aa537410ae8b |
|
BLAKE2b-256 | 7da4d3b937acabe5039d0869c93325f195012f31545bcd7c395e26712ff91013 |
Hashes for neologdn-0.5.2-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b6446d648b3d2f73a69746138b6f8037117b2d14bc1336256e98745dc68577c2 |
|
MD5 | f7ffc219bafa4100d795df2d2ca7c525 |
|
BLAKE2b-256 | 7030645df850d36cbeee3c9df89deb09b72815ef183e89f9341f092a5828481b |
Hashes for neologdn-0.5.2-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 607c22febe363666fdab9a8fae0650eab2df5dcd12324e97239a6767caabeca4 |
|
MD5 | 00c85b08a6a8c87f19fe2811ea24be61 |
|
BLAKE2b-256 | 5a74e14b9f814b3122413f81b07d72718900fcc9fd0c8d1690d4a8e2418b5a60 |
Hashes for neologdn-0.5.2-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ebbe4df98b4784b75c18aed22db79dc985912a090ce8cda876cac103e89f2bae |
|
MD5 | 05130539ea9681e4a70aa2deab06af38 |
|
BLAKE2b-256 | d558a7452f5a0c110566f8a271438939e8a61a9e370b6deb45b8b89d3676e4f8 |
Hashes for neologdn-0.5.2-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 881098e7478cfd76181f7967ab47424cd60c2fd19507e0334c33509a63c8af1c |
|
MD5 | cb5bfd8c969d2f44cc07966301f5ff63 |
|
BLAKE2b-256 | edaf6db458262272640c3c12796849d90a3f97a91a1601a95e027c2cfd40ddb9 |
Hashes for neologdn-0.5.2-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5ae0fb12a2816d65f1ecc4e09a6e7555320283c2cb7d881e2baf2449bb1fc794 |
|
MD5 | 0dd8286656042b98afeb13497180e2e5 |
|
BLAKE2b-256 | 039665fcd58d305f7b4b846ca4734705c5f98f78bd3b4675595a199206731df8 |
Hashes for neologdn-0.5.2-cp27-cp27m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f4032406ef974aa3d452ba121475f70bb35325588d4695a589d363ded59b076a |
|
MD5 | e7f5ced96eb7d7e7926bd0b71ad01b13 |
|
BLAKE2b-256 | 028a4a979d01235313a0b18bf5591fda4c87acf5cc1ffd8d99b7c80af33fc714 |