Skip to main content

datadings is a collection of tools to prepare datasets for machine learning. It's easy to use, space-efficient, and blazingly fast.

Project description

datadings is a collection of tools to prepare datasets for machine learning, based on two simple principles

Datasets are collections of individual data samples.

Each sample is a dictionary with descriptive keys.

For supervised training with images samples are dictionaries like this:

{"key": unique_key, "image": imagedata, "label": label}

msgpack is used as an efficient storage format for most supported datasets.

Check out the documentation for more details.

Supported datasets

Dataset

Short Description

ADE20k

Scene Parsing, Segmentation

ANP460

own Eye-Tracking dataset (Jalpa)

CAMVID

Motion-based Segmentation

CAT2000

MIT Saliency

CIFAR

32x32 color image classification with 10/100 classes

Cityscapes

Segmentation, Semantic understanding of urban street scenes

Coutrot1

Eye-Tracking, Saliency

FIGRIMFixation

Eye-Tracking, Saliency

ILSVRC2012

Imagenet Large Scale Visual Recognition Challenge

ImageNet21k

A superset of ILSVRC2012 with 11 M images for 10450 classes

InriaBuildings

Inria Areal Image Labeling Dataset (Buildings), Segmentation, Remote Sensing

MIT1003

Eye-Tracking, Saliency, Learning to predict where humans look

MIT300

Eye-Tracking, Saliency

Places2017

MIT Places, Scene Recognition

Places365

MIT Places365, Scene Recognition

RIT18

High-Res Multispectral Semantic Segmentation, Remote Sensing

SALICON2015

Saliency in Context, Eye-Tracking

SALICON2017

Saliency in Context, Eye-Tracking

VOC2012

Pascal Visual Object Classes Challenge

Vaihingen

Remote Sensing, Semantic Object Classification, Segmentation

YFCC100m

Yahoo Flickr Creative Commons 100 M pics

Command line tools

  • datadings-write creates new dataset files.

  • datadings-cat prints the (abbreviated) contents of a dataset file.

  • datadings-shuffle shuffles an existing dataset file.

  • datadings-merge merges two or more dataset files.

  • datadings-split splits a dataset file into two or more subsets.

  • datadings-bench runs some basic read performance benchmarks.

Basic usage

Each dataset defines modules to read and write in the datadings.sets package. For most datasets the reading module only contains additional metadata like class labels and distributions.

Let’s consider the MIT1003 dataset as an example.

MIT1003_write is an executable that creates dataset files. It can be called directly or through datadings-write. Three files will be written:

  • MIT1003.msgpack contains sample data

  • MIT1003.msgpack.index contains index for random access

  • MIT1003.msgpack.md5 contains MD5 hashes of both files

Reading all samples sequentially, using a MsgpackReader as a context manager:

with MsgpackReader('MIT1003.msgpack') as reader:
    for sample in reader:
        [do dataset things]

This standard iterator returns dictionaries. Use the rawiter() method to get samples as messagepack encoded bytes instead.

Reading specific samples:

reader.seek_key('i14020903.jpeg')
print(reader.next()['key'])
reader.seek_index(100)
print(reader.next()['key'])

Reading samples as raw bytes:

raw = reader.rawnext()
for raw in reader.rawiter():
    print(type(raw), len(raw))

Number of samples:

print(len(reader))

You can also change the order and selection of iterated samples with augments. For example, to randomize the order of samples, wrap the reader in a Shuffler:

from datadings.reader import Shuffler
with Shuffler(MsgpackReader('MIT1003.msgpack')) as reader:
    for sample in reader:
        # do dataset things, but in random order!

A common use case is to iterate over the whole dataset multiple times. This can be done with the Cycler:

from datadings.reader import Cycler
with Cycler(MsgpackReader('MIT1003.msgpack')) as reader:
    for sample in reader:
        # do dataset things, but FOREVER!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

datadings-3.4.6-py3-none-any.whl (2.9 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page