Skip to main content

An open-source Chinese NLP Dataset Reader library, built on allennlp & pytorch.

Project description

chreader

中文自然语言处理数据集工具包

优秀特性

  • 易用
    • 支持自动下载和缓存,一行命令即可获得指定数据集
    • 支持命令行的方式展示已有数据集及其详细描述
    • 无缝衔接 allennlpcatalystpytorch_lightningFARM 等常用 NLP 框架
  • 丰富,支持分类、生成、标注等多种类型数据集,共计 2
  • 灵活
    • 可以自由添加自定义数据集,只需继承 ChDatasetReader 即可
    • 借助 allennlp 可使用各种 tokenizertoken_indexervocab 等组件,并对其进行高级配置

安装

  • git
    git clone https://github.com/wangyuxinwhy/chreader.git
    pip install -e .
    
  • pip
    pip install -U chreader
    

使用

构建 Dataset & DataLoader

from chreader import load_dataset, DataLoader
train_dataset = load_dataset("tnews", "train")
dev_dataset = load_dataset("tnews", "dev")
train_dataloader = DataLoader(train_dataset, batch_size=32)
dev_dataloader = DataLoader(dev_dataset, batch_size=32)
for data in train_dataloader:
    ...

命令行

// 列出所有可用数据集
chreader list

17EOZQ

// 展示数据集详细信息
chreader show tnews

prGxJd

TODO

  • 添加更多数据集
  • 添加 dataset_type 字段,现在只有 classification 一种
    • classification
      • sentiment
    • generation
      • summarization
    • tagging
      • ner
      • dependency_parsing
  • 支持外部的配置
  • 美化命令行的输出
  • 录一个 gif
  • 添加 docs
  • 添加 tutorial

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chreader-0.2.1.tar.gz (10.9 kB view hashes)

Uploaded Source

Built Distribution

chreader-0.2.1-py3-none-any.whl (12.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page