Skip to main content

Datasets for the training of graph neural networks (GNNs) and subsequent visualization of attributional explanations of XAI methods

Project description

made-with-python python-version os-linux

Visual Graph Datasets

This package provides the possibility to manage a collection of datasets primarily for the training of graph neural networks. Each dataset is represented by one folder. Inside these folders each element of the dataset is represented by two files: (1) A metadata JSON file which contains the full graph representation as well as additional metadata such as the canonical index, the target value to be predicted etc… (2) A PNG image file which shows a domain specific illustration of the graph (molecular graphs for chemical datasets as an example). These additional visualizations of each graph can be used to easily visualize attributional graph XAI methods which assign importance values to each node and edge of the original input graph.

Motivation

Usually datasets are packaged as compact as possible. An example would be that chemical graph datasets are usually packaged as CSV files which only contain the index a SMILES representation of the molecule and the target value, looking something like this:

index, smiles, value
0, ccc, 0.24
1, ccc, 0.52
2, ccc, 1.77

This has the major advantage that even large datasets will have file sizes of only a few MB. These files are easy to download online and easy to store. The disadvantage however is that these files need to be processed to be usable to train graph neural networks (GNNs): The encoded SMILES representation first has to be transformed into a graph representation where node and edge features have to be generated by some kind of chemical pre-processor. Instead of putting the major storage and bandwidth requirements on the user, this puts the major processing requirements on the user. Additionally, this method places a greater burden on the visualization step of generated explanations.

Ultimately we decided to rather put the burden of downloading larger amounts of data on the user a single time in exchange of simplifying and reducing the burden of pre-processing and data visualization for each training process.

Additionally, by distributing both canonical indexing and canonical visualizations we aim to make explanation results more comparable in the future.

Installation

First clone this repository:

git clone https://github/username/visual_graph_datasets.git

Then install it like this:

cd visual_graph_datasets
pip3 install -e .

Download datasets

NOTE: We strongly encourage to store datasets on an SSD instead of an HDD, as this can make a difference of multiple hours(!) when loading especially large datasets.

Datasets can simply be downloaded by name by using the download command:

// Example for the dataset 'rb_dual_motifs'
python3 -m visual_graph_datasets.cli download "rb_dual_motifs"

By default this dataset will be downloaded into the folder $HOME/.visual_graph_datasets/datasets where HOME is the current users home directory.

The dataset download destination can be changed in a config file by using the config command:

python3 -m visual_graph_datasets.cli config

This command will open the config file at $HOME/.visual_graph_datasets/config.yaml using the systems default text editor.

List available datasets

You can display a list of all the currently available datasets of the current remote file share provider and some metadata information about them by using the command list:

python3 -m visual_graph_datasets.cli list

Running the unittests

After installation you can optionally run the unitests to confirm that all datasets have been correctly downloaded and that everything works properly:

cd visual_graph_datasets
pytest ./tests/*

Usage

The datasets are mainly intended to be used in combination with other packages, but this package provides some basic utilities to load and explore the datasets themselves within python programs.

from visual_graph_datasets.config import Config
from visual_graph_datasets.data import load_visual_graph_dataset

# The function only needs the absolute path to the dataset folder and will load all the entire datasets
# from all the files within that folder.
# The function returns two dictionaries. The first maps the string names of the elements to the content
# dictionaries and the second dict maps the integer indices of the elements to the very same content
# dictionaries. Two separate dictionaries are returned to provide different ways of accessing the data
# of the elements which are needed in different situations.
dataset_path = os.path.join(Config().get_datasets_path(), 'rb_dual_motifs')
data_name_map, data_index_map = load_visual_graph_dataset(dataset_path)

One such content dictionary which are the values of the two dicts returned by the function have the following nested dictionary structure:

  • image_path: The absolute path to the image file that visualizes this element

  • metadata_path: the absolute path to the metadata file

  • metadata: A dict which contains all the metadata for that element
    • value: The target value for the element, which can be a single value (usually with regression) or a one-hot vector for classification.

    • index: The canonical index of this element within the dataset

    • (split): If defined, either “train” or “test” - assignment for the canonical train test split

    • graph: A dictionary which contains the entire graph representation of this element.
      • node_attributes: tensor of shape (V, N)

      • edge_attributes: tensor of shape (E, M)

      • edge_indices: tensor of shape (E, 2) which are the tuples of integer node indices that determine edges

      • node_coordinates tensor of shape (V, 2) which are the xy positions of each node in pixel values within the corresponding image visualization of the element. This is the crucial information which is required to use the existing image representations to visualize attributional explanations!

With the following variable definitions:

  • V - the number of nodes in a graph

  • E - the number of edges in a graph

  • N - the number of node attributes / features associated with each node

  • M - the number of edge attributes / features associated with each edge

Datasets

Here is a list of the datasets currently included.

For more information about the individual datasets use the list command in the CLI (see above).

  • rb_dual_motifs

  • tadf

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

visual_graph_datasets-0.6.2.tar.gz (7.1 MB view hashes)

Uploaded Source

Built Distribution

visual_graph_datasets-0.6.2-py3-none-any.whl (7.1 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page