unidump

hexdump for your unicode data

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

A Unicode codepoint dump.

Think of it as hexdump(1) for Unicode. The command analyses the input and
prints then three columns: the raw byte count of the first codepoint in this
row, codepoints in their hex notation, and finally the raw input characters
with control and whitespace replaced by a dot.

Invalid byte sequences are represented with an “X” and with the hex value en-
closed in question marks, e.g., “?F5?”.

You can pipe in data from stdin, select several files at once, or even mix
all those input methods together.

Examples:

* Basic usage with stdin:

echo -n 'ABCDEFGHIJKLMNOP' | unidump -n 4
0 0041 0042 0043 0044 ABCD
4 0045 0046 0047 0048 EFGH
8 0049 004A 004B 004C IJKL
12 004D 004E 004F 0050 MNOP

* Dump the code points translated from another encoding:

unidump -c latin-1 some-legacy-file

* Dump many files at the same time:

unidump foo-*.txt

* Control characters and whitespace are safely rendered:

echo -n -e '\x01' | unidump -n 1
0 0001 .

* Finally learn what your favorite Emoji is composed of:

( echo -n -e '\xf0\x9f\xa7\x9d\xf0\x9f\x8f\xbd\xe2' ; \
echo -n -e '\x80\x8d\xe2\x99\x82\xef\xb8\x8f' ; ) | \
unidump -n 5
0 1F9DD 1F3FD 200D 2642 FE0F .🏽.♂️

See <http://emojipedia.org/man-elf-medium-skin-tone/> for images. The “elf”
emoji (the first character) is replaced with a dot here, because the current
version of Python’s unicodedata doesn’t know of this character yet.

* Use it like strings(1):

unidump -e '{data}' some-file.bin

This will replace every unknown byte from the input file with “X” and every
control and whitespace character with “.”.

* Only print the code points of the input:

unidump -e '{repr}'$'\n' -n 1 some-file.txt

This results in a stream of codepoints in hex notation, each on a new line,
without byte counter or rendering of actual data. You can use this to count
the total amount of characters (as opposed to raw bytes) in a file, if you
pipe it through `wc -l`.

This is version 1.1.2 of unidump, using Unicode 8.0.0 data.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

1.1.2

Mar 19, 2017

1.1.1

Mar 18, 2017

1.1.0

Mar 17, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unidump-1.1.2.tar.gz (4.8 kB view hashes)

Uploaded Mar 19, 2017 Source

Built Distribution

unidump-1.1.2-py3-none-any.whl (9.0 kB view hashes)

Uploaded Mar 19, 2017 Python 3

Hashes for unidump-1.1.2.tar.gz

Hashes for unidump-1.1.2.tar.gz
Algorithm	Hash digest
SHA256	`0a1fdc21b2a192575df027ffa4831486e91cba17fa8f8a33db2dde9667daa1e4`
MD5	`2154603637ddb72ef1b6c4e5d2f143b2`
BLAKE2b-256	`92f52bee49dcfa63a0bffb1733e76134b1a1f6de14bf238608c8645d5310a9a6`

Hashes for unidump-1.1.2-py3-none-any.whl

Hashes for unidump-1.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`120e0bc9e33ff1ba316c98fcf76244915bac7eb17ce91c33a8d295c49a151b3c`
MD5	`107ce9300e9cfa6858cff9578e161270`
BLAKE2b-256	`51ff250738722804b02620ad8d952bc01f00639d30e95f7f3aaa186ca0a39f38`