Skip to main content

Package Placeholder

Reason this release was yanked:

testing

Project description

Dataset API

Dataset API structure

dataset_api
├── conda
│   └── recipes
│       ├── py38_recipe
│       └── py39_recipe
├── src
│   └── dataset_librarian
│       ├── dataset_api
│       ├── scripts
│       ├── __init__.py
│       ├── dataset.py
│       ├── datasets_urls.json
├── MANIFEST.in
├── README.md
├── pyproject.toml
└── requirements.txt

Environment setup

Clone the Model Zoo for Intel® Architecture repository and navigate to the dataset_api directory.

# Step 1 (recommended): Create and activate a virtual environment
## Option 1: Using virtualenv
virtualenv -p python3 venv
. venv/bin/activate
## Option 2: Using conda
conda create -n venv python=<3.8 or 3.9> -c conda-forge
conda activate venv

# Step 2: Installing package
## Option 1: Installing from source code
cd models/datasets/dataset_api
python -m pip install --upgrade pip build setuptools wheel
python -m pip install .
## Option 2: Installing from PyPI
python -m pip install dataset-librarian

PyPI package can be found here.

Datasets

Dataset name Description Download Preprocessing command
brca Breast Cancer dataset that contains categorized contrast enhanced mammography data and radiologists’ notes. supported A prerequisite: Use a browser, download the Low Energy and Subtracted images, then provide the path to the directory that contains the downloaded images using --directory argument. python -m dataset_librarian.dataset -n brca --download --preprocess -d <path to the dataset directory>
tabformer Credit card data for TabFormer supported not supported python -m dataset_librarian.dataset -n tabformer --download
dureader-vis DuReader-vis for document automation. Chinese Open-domain Document Visual Question Answering (Open-Domain DocVQA) dataset, containing about 15K question-answering pairs and 158K document images from the Baidu search engine. supported not supported python -m dataset_librarian.dataset -n dureader-vis --download
msmarco MS MARCO is a collection of datasets focused on deep learning in search supported not supported python -m dataset_librarian.dataset -n msmarco --download
mvtec-ad MVTEC Anomaly Detection DATASET for industrial inspection. It contains over 5000 high-resolution images divided into fifteen different object and texture categories. supported supported python -m dataset_librarian.dataset -n mvtec-ad --download --preprocess -d <path to the dataset directory>

Command-line Interface

Input Arguments Description
--list (-l) list the supported datasets.
--name (-n) dataset name
--directory (-d) directory location where the raw dataset will be saved on your system. It's also where the preprocessed dataset files will be written. If not set, a directory with the dataset name will be created.
--download download the dataset specified.
--preprocess preprocess the dataset if supported.

Python API

from dataset_librarian.dataset_api.download import download_dataset
from dataset_librarian.dataset_api.preprocess import preprocess_dataset

# Download the datasets
download_dataset('brca', <path to the raw dataset directory>)

# Preprocess the datasets
preprocess_dataset('brca', <path to the raw dataset directory>)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataset_librarian-0.0.0.dev1.tar.gz (3.2 kB view hashes)

Uploaded Source

Built Distribution

dataset_librarian-0.0.0.dev1-py3-none-any.whl (3.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page