An implementation of the Unicode algorithm for breaking code point sequences into extended grapheme clusters as specified in UAX #29. This library supports version 14.0 of the Unicode Standard.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

pyuegc

An implementation of the Unicode algorithm for breaking strings of text (i.e., code point sequences) into extended grapheme clusters (“user-perceived characters”) as specified in UAX #29, “Unicode Text Segmentation”. This package supports version 14.0 of the Unicode Standard (released September 14, 2021). It has been successfully tested against the Unicode test file.

Installation

pip install pyuegc

UCD version

To get the version of the Unicode character database currently used:

>>> from pyuegc import UCD_VERSION
>>> UCD_VERSION
'14.0.0'

Example usage

from pyuegc import EGC

for s in ["e\u0301le\u0300ve", "Z̷̳̎a̸̛ͅl̷̻̇g̵͉̉o̸̰͒", "기운찰만하다"]:
    egc = EGC(s)
    print(f"{len(s):>2}, {len(egc)}: {egc}")

#  7, 5: ['é', 'l', 'è', 'v', 'e']
# 20, 5: ['Z̷̳̎', 'a̸̛ͅ', 'l̷̻̇', 'g̵͉̉', 'o̸̰͒']
# 15, 6: ['기', '운', '찰', '만', '하', '다']


s = "ai\u0302ne\u0301e"  # aînée
print("".join(reversed(s)))
print("".join(reversed(EGC(s))))

# éen̂ia -> wrong (diacritics are messed up)
# eénîa -> right (regardless of the Unicode normalization form)

References

Licenses

The pyuegc library is released under an MIT license.

Usage of Unicode data files is governed by the Unicode Terms of Use, a copy of which is included as UNICODE-LICENSE.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

15.1.0

Nov 11, 2023

15.0.0

Oct 1, 2022

This version

14.0.0

Oct 17, 2021

14.0.0rc2 pre-release

Oct 17, 2021

14.0.0rc1 pre-release

Oct 16, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyuegc-14.0.0.tar.gz (58.8 kB view hashes)

Uploaded Oct 17, 2021 Source

Hashes for pyuegc-14.0.0.tar.gz

Hashes for pyuegc-14.0.0.tar.gz
Algorithm	Hash digest
SHA256	`801457feee3ac57ee61509da33f7e62368450825a2da48ef9cb1135ec5533ecf`
MD5	`8c82d6322bc901fe4fd8c05b740f241a`
BLAKE2b-256	`ee9ae5321e534d9564b44d0811d0966d316ef4115b5aba2a43001fa2fabcf1c1`