Project description

vdtk: Visual Description Data Evaluation Tools

This tool is designed to allow for a deep investigation of diversity in visual description datasets, and to help users understand their data at a token, n-gram, description, and dataset level.

Installation

To use this tool, you can easily pip install with pip install . from this directory. Note: Some metrics (METEOR) require a working installation of Java. Please follow the directions (here) to install the Java runtime if you do not already have access to a JRE.

Data format

In order to prepare datasets to work with this tool, datasets must be formatted as JSON files with the following schema

// List of samples in the dataset
[
    // JSON object for each sample
    {
        "_id": "string", // A string ID for each sample. This can help keep track of samples during use.
        "split": "string", // A string corresponding to the split of the data. Default splits are "train", "validate" and "test"
        "references": [
            // List of string references
            "reference 1...",
            "reference 2...",
        ],
        "metadata": {} // Any JSON object. This field is not used by the toolkit at this time.
    }
]

Usage

After installation, the basic menu of commands can be accessed with vdtk-cli --help. We make several experiments/tools available for use:

Command	Details
vocab-stats	Run with `vdtk-cli vocab-stats DATASET_JSON_PATH`. Compute basic token-level vocab statistics
ngram-stats	Run with `vdtk-cli ngram-stats DATASET_JSON_PATH`. Compute n-gram statistics, EVS@N and ED@N
caption-stats	Run with `vdtk-cli caption-stats DATASET_JSON_PATH`. Compute caption-level dataset statistics
semantic-variance	Run with `vdtk-cli semantic-variance DATASET_JSON_PATH`. Compute within-sample BERT embedding semantic variance
coreset	Run with `vdtk-cli coreset DATASET_JSON_PATH`. Compute the caption coreset from the training split needed to solve the validation split
concept-overlap	Run with `vdtk-cli concept-overlap DATASET_JSON_PATH`. Compute the concept overlap between popular feature extractors, and the dataset
concept-leave-one-out	Run with `vdtk-cli concept-leave-one-out DATASET_JSON_PATH`. Compute the performance with a coreset of concept captions
leave-one-out	Run with `vdtk-cli vocab-stats DATASET_JSON_PATH`. Compute leave-one-out ground truth performance on a dataset with multiple ground truths
[BETA] balanced-split	Run with `vdtk-cli balanced-split DATASET_JSON_PATH`. Compute a set of splits of the data which best balance the data diversity

For more details and options, see the --help command for any of the commands above. Note that some tools are relatively compute intensive. This toolkit will make use of a GPU if available and necessary, as well as a large number of CPU cores and RAM depending on the task.

[BETA] See the API Docs for usage as a library.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.3.0

Dec 13, 2022

This version

0.1.0

Aug 1, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vdtk-0.1.0.tar.gz (11.8 MB view hashes)

Uploaded Aug 1, 2022 Source

Built Distribution

vdtk-0.1.0-py3-none-any.whl (11.8 MB view hashes)

Uploaded Aug 1, 2022 Python 3

Hashes for vdtk-0.1.0.tar.gz

Hashes for vdtk-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`cc5764d298ab3d24eb5d43464d9f97a95b227fee20089c2cead3c5fadfd61d8b`
MD5	`6605dfbe1ec8650a646dee9204f5d675`
BLAKE2b-256	`d37bf162ac3a26b5b4379098bd94bd6684ae85113f08d4dfded5fb169492ad4b`

Hashes for vdtk-0.1.0-py3-none-any.whl

Hashes for vdtk-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3184d471ea815f2ca8b72494d3e76c03bd96181073c23a8051fc68c8f9bcf008`
MD5	`56b25c975c51ca5cb15257c3ee8165f2`
BLAKE2b-256	`077440409ee9c58d767899bbf0ff77768a528804300d189df10997260c9ca445`