visual-graph-datasets

Datasets for the training of graph neural networks (GNNs) and subsequent visualization of attributional explanations of XAI methods

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Visual Graph Datasets

This package supplies management and utilities for graph datasets used to train graph neural networks and more specifically aimed at explainable AI (XAI) methods

W.r.t to the structure and management of these datasets this package employs a different philosophy. Instead of the usual minimal packaging to CSV files, a visual graph dataset (VGD) represents each dataset as a folder where each element is represented by two files:

A json file containing metadata information, including the full graph representation
A png file containing a canonical visualization of the graph.

We believe that providing a canonical graph representation as well as a canonical visualization will help to make AI methods trained on such datasets more comparable. The canonical visualization and standard utilities for the visualization of attributional XAI explanations specifically are aimed to improve the comparability and reproducability of XAI methods in the future.

Installation

First clone this repository:

git clone https://github/username/visual_graph_datasets.git

Then install it like this:

cd visual_graph_datasets
pip3 install -e .

Command Line Interface

Download datasets

NOTE: We strongly encourage to store datasets on an SSD instead of an HDD, as this can make a difference of multiple hours(!) when loading especially large datasets.

Datasets can simply be downloaded by name by using the download command:

// Example for the dataset 'rb_dual_motifs'
python3 -m visual_graph_datasets.cli download "rb_dual_motifs"

By default this dataset will be downloaded into the folder $HOME/.visual_graph_datasets/datasets where HOME is the current users home directory.

The dataset download destination can be changed in a config file by using the config command:

python3 -m visual_graph_datasets.cli config

This command will open the config file at $HOME/.visual_graph_datasets/config.yaml using the systems default text editor.

List available datasets

You can display a list of all the currently available datasets of the current remote file share provider and some metadata information about them by using the command list:

python3 -m visual_graph_datasets.cli list

Quickstart

The datasets are mainly intended to be used in combination with other packages, but this package provides some basic utilities to load and explore the datasets themselves within python programs.

import os
import typing as t
import matplotlib.pyplot as plt

from visual_graph_datasets.config import Config
from visual_graph_datasets.web import ensure_dataset
from visual_graph_datasets.data import load_visual_graph_dataset
from visual_graph_datasets.visualization.base import draw_image
from visual_graph_datasets.visualization.importances import plot_node_importances_border
from visual_graph_datasets.visualization.importances import plot_edge_importances_border

# This object will load the settings from the main config file. This config file contains options
# such as changing the default datasets folder and defining custom alternative file share providers
config = Config()
config.load()

# First of all we need to make sure that the dataset exists locally, this function will download it from
# the default file share provider if it does not exist.
ensure_dataset('rb_dual_motifs', config)

# Afterwards we can be sure that the datasets exists and can now load it from the default datasets path.
# The data will be loaded as a dictionary whose int keys are the indices of the corresponding elements
# and the values are dictionaries which contain all the relevant data about the dataset element,
# (Dataset format is explained below)
dataset_path = os.path.join(config.get_datasets_path(), 'rb_dual_motifs')
data_index_map: t.Dict[int, dict] = {}
_, data_index_map = load_visual_graph_dataset(dataset_path)

# Using this information we can visualize the ground truth importance explanation annotations for one
# element of the dataset like this.
index = 0
data = data_index_map[index]
# This is the dictionary which represents the graph structure of the dataset element. Descriptive
# string keys and numpy array values.
g = data['metadata']['graph']
fig, ax = plt.subplots(ncols=1, nrows=1, figsize=(10, 10))
draw_image(ax, image_path=data['image_path'])
plot_node_importances_border(
    ax=ax,
    g=g,
    node_positions=g['image_node_positions'],
    node_importances=g['node_importances_2'][:, 0],
)
plot_edge_importances_border(
    ax=ax,
    g=g,
    node_positions=g['image_node_positions'],
    edge_importances=g['edge_importances_2'][:, 0],
)
fig_path = os.path.join(os.getcwd(), 'importances.pdf')
fig.savefig(fig_path)

Dataset Format

Visual Graph Datasets are represented as folders containing multiple files. The primary content of these dataset folders is made up of 2 files per element in the dataset:

A PNG file. This is the canonical visualization of the graph which can subsequently be used to create explanation visualizations as well. The pixel position of each node in the graph is attached as metadata of the graph representation.
A JSON file. Primarily contains the full graph representation consisting of node attributes, edge attributes, an edge list etc. May also contain custom metadata for each graph depending on the dataset.

Additionally, a dataset folder may also contain a .meta.yml file which contains additional metadata about the dataset as a whole.

Also, a dataset folder contains a python module process.py it contains the standalone implementation of the preprocessing procedure which turns a domain-specific graph representation (think of SMILES codes for molecular graphs) into valid graph representations for that specific dataset. This module can be imported and used directly from python code. Alternatively, the module can be used as a standalone command line application for programming language agnostic preprocessing of elements.

Element Metadata JSON

One such metadata file belonging to one element of the dataset may have the following nested structure:

target: a 1d array containing the target values for the element. For classification this usually a one-hot encoded vector of classes already. For multi-task regression this vector may have an arbitrary number of continuous regression targets. For single-task regression this will still be a vector, albeit with the shape (1, )
index: The canonical index of this element within the dataset
(train_split optional) A list of int indices, where each index represents a different split. if the number “1” is for example part of this list, that means that the corresponding element is considered to be part of the training set of split “1”. What each particular split is may be described in the documentation of the dataset.
(test_split optional) A list of int indices, where each index represents a different split. if the number “1” is for example part of this list, that means that the corresponding element is considered to be part of the test set of the split “1”.
graph: A dictionary which contains the entire graph representation of this element.
- node_indices: array of shape (V, 1) with the integer node indices.
- node_attributes: array of shape (V, N)
- edge_indices: array of shape (E, 2) which are the tuples of integer node indices that determine edges
- edge_attributes: array of shape (E, M)
- node_positions array of shape (V, 2) which are the xy positions of each node in pixel values within the corresponding image visualization of the element. This is the crucial information which is required to use the existing image representations to visualize attributional explanations!
- (node_importances_{K}_{suffix} optional) array of shape (V, K) containing ground truth node importance explanations, which assign an importance value of 0 to 1 to each node of the graph across K channels. One dataset element may have none or multiple such annotations with different suffixes determining the number of explanation channels and origin.
- (edge_importances_{K}_{suffix} optional) array of shape (E, K) containing ground truth edge importance explanations, which assign an importance value of 0 to 1 to each edge of the graph across K channels. One dataset element may have none or multiple such annotations with different suffixes determining the number of explanation channels and origin.

Assuming the following shape definitions:

V - the number of nodes in a graph
E - the number of edges in a graph
N - the number of node attributes / features associated with each node
M - the number of edge attributes / features associated with each edge
K - the number of importance channels

Dataset Metadata YML

One such metadata file may have the following nested structure. Additionally, it may also contain custom additional fields depending on each dataset.

version: A string determining the current version of the dataset
description: Short string description of what the dataset is about (for example where the data came from, what types of graphs it consists of, what the prediction target is etc.)
visualization_description: String description of what can be seen in the visualization of the graph. There are many different types of graphs out there which may have very domain specific visualizations. This string should provide a short description of how the visualizations may be interpreted.
references: A list of strings, where each string is a short description of online resources which are relevant to the dataset, usually including a URL. This could for example include references to scientific publications where a dataset was first introduced.
file_size: The integer accumulated size of all the files that make up the dataset in bytes.
num_elements: The integer number of elements in the dataset
num_targets: The size of the prediction target vector
num_node_attributes: The size of the node attribute vector
num_edge_attributes: The size of the edge attribute vector

Datasets

Here is a list of the datasets currently uploaded on the main file share provider.

For more information about the individual datasets use the list command in the CLI (see above).

TO BE DONE

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.15.5

Apr 9, 2024

0.15.4

Mar 14, 2024

0.15.3

Mar 13, 2024

0.15.2

Mar 11, 2024

0.15.1

Mar 6, 2024

0.15.0

Mar 3, 2024

0.14.3

Jan 24, 2024

0.14.2

Jan 22, 2024

0.14.0

Oct 25, 2023

0.13.5

Jun 12, 2023

0.13.4

Jun 12, 2023

0.13.3

Jun 12, 2023

0.13.2

Jun 12, 2023

0.13.1

Jun 12, 2023

0.13.0

Jun 11, 2023

0.12.1

May 20, 2023

0.12.0

May 8, 2023

0.11.0

May 2, 2023

0.10.3

Mar 27, 2023

0.10.2

Mar 24, 2023

This version

0.10.1

Mar 24, 2023

0.10.0

Mar 24, 2023

0.9.0

Jan 29, 2023

0.8.0

Dec 30, 2022

0.7.2

Dec 15, 2022

0.7.1

Dec 15, 2022

0.7.0

Dec 15, 2022

0.6.2

Dec 4, 2022

0.6.1

Dec 4, 2022

0.6.0

Dec 4, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

visual_graph_datasets-0.10.1.tar.gz (113.5 kB view hashes)

Uploaded Mar 24, 2023 Source

Built Distribution

visual_graph_datasets-0.10.1-py3-none-any.whl (96.7 kB view hashes)

Uploaded Mar 24, 2023 Python 3

Hashes for visual_graph_datasets-0.10.1.tar.gz

Hashes for visual_graph_datasets-0.10.1.tar.gz
Algorithm	Hash digest
SHA256	`8f4894b72540634f852a0cd9defbbbae54275e8923d52821534d09e4e5528327`
MD5	`9b66ff64fce10c08478330193fe93f56`
BLAKE2b-256	`88bb5eb91caa295158c0a1c18ca97b58a7cb03fa6ac45adef08ffd45fc67ddba`

Hashes for visual_graph_datasets-0.10.1-py3-none-any.whl

Hashes for visual_graph_datasets-0.10.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f5895e900dfde0a3553abf46a6485458a0a5ddde8a664cf96d0fc8edf4472aec`
MD5	`d05939b9251272ab7760d24ec7abe374`
BLAKE2b-256	`ccc43b893a515e76ff213d341e71746c39874e60b9fb3fa2180af55ba2f79942`