skip to navigation
skip to content

unihan-tabular 0.8.1

Export UNIHAN to Python, Data Package, CSV, JSON and YAML

unihan-tabular - tool to build UNIHAN into tabular-friendly formats like python, JSON, CSV and YAML. Part of the cihai project.

UNIHAN’s data is dispersed across multiple files in the format of:

U+3400      kCantonese      jau1
U+3400      kDefinition     (same as U+4E18 丘) hillock or mound
U+3400      kMandarin       qiū
U+3401      kCantonese      tim2
U+3401      kDefinition     to lick; to taste, a mat, bamboo bark
U+3401      kHanyuPinyin    10019.020:tiàn
U+3401      kMandarin       tiàn

$ unihan-tabular will download Unihan.zip and build all files into a single tabular friendly format.

CSV (default), $ unihan-tabular:

char,ucn,kCantonese,kDefinition,kHanyuPinyin,kMandarin
㐀,U+3400,jau1,(same as U+4E18 丘) hillock or mound,,qiū
㐁,U+3401,tim2,"to lick; to taste, a mat, bamboo bark",10019.020:tiàn,tiàn

JSON, $ unihan-tabular -F json:

[
  {
    "char": "㐀",
    "ucn": "U+3400",
    "kCantonese": "jau1",
    "kDefinition": "(same as U+4E18 丘) hillock or mound",
    "kHanyuPinyin": null,
    "kMandarin": "qiū"
  },
  {
    "char": "㐁",
    "ucn": "U+3401",
    "kCantonese": "tim2",
    "kDefinition": "to lick; to taste, a mat, bamboo bark",
    "kHanyuPinyin": "10019.020:tiàn",
    "kMandarin": "tiàn"
  }
]

YAML $ unihan-tabular -F yaml:

- char: 
  kCantonese: jau1
  kDefinition: (same as U+4E18 丘) hillock or mound
  kHanyuPinyin: null
  kMandarin: qiū
  ucn: U+3400
- char: 
  kCantonese: tim2
  kDefinition: to lick; to taste, a mat, bamboo bark
  kHanyuPinyin: 10019.020:tiàn
  kMandarin: tiàn
  ucn: U+3401

Features

  • automatically downloads UNIHAN from the internet
  • export to JSON, CSV and YAML (requires pyyaml) via -F
  • configurable to export specific fields via -f
  • accounts for encoding conflicts due to the Unicode-heavy content
  • designed as a technical proof for future CJK (Chinese, Japanese, Korean) datasets
  • core component and dependency of cihai, a CJK library
  • data package support
  • supports python 2.7, >= 3.5 and pypy

If you encounter a problem or have a question, please create an issue.

Usage

unihan-tabular supports command line arguments. See unihan-tabular CLI arguments for information on how you can specify custom columns, files, download URL’s and output destinations.

To download and build your own UNIHAN export:

$ pip install unihan-tabular

To output CSV, the default format:

$ unihan-tabular

To output JSON:

$ unihan-tabular -F json

To output YAML:

$ pip install pyyaml
$ unihan-tabular -F yaml

To only output the kDefinition field in a csv:

$ unihan-tabular -f kDefinition

To output multiple fields, separate with spaces:

$ unihan-tabular -f kCantonese kDefinition

To output to a custom file:

$ unihan-tabular --destination ./exported.csv

To output to a custom file (templated file extension):

$ unihan-tabular --destination ./exported.{ext}

See unihan-tabular CLI arguments for advanced usage examples.

Structure

# output w/ JSON
{XDG data dir}/unihan_tabular/unihan.json

# output w/ CSV
{XDG data dir}/unihan_tabular/unihan.csv

# output w/ yaml (requires pyyaml)
{XDG data dir}/unihan_tabular/unihan.yaml

# script to download + build a SDF csv of unihan.
unihan_tabular/process.py

# unit tests to verify behavior / consistency of builder
tests/*

# python 2/3 compatibility module
unihan_tabular/_compat.py

# utility / helper functions
unihan_tabular/util.py
 
File Type Py Version Uploaded on Size
unihan-tabular-0.8.1.tar.gz (md5) Source 2017-05-21 15KB