Griddify high-dimensional tabular data for easy visualization and deep learning

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Griddify

Redistribute tabular data into a grid for easy visualization and image-based deep learning. This library is greatly inspired by the excellent MolMap library.

Installation

git clone https://github.com/ersilia-os/griddify.git
cd griddify
pip install -e .

Step by step

Get a multidimensional dataset and preprocess it

In this example, we will use a dataset of 200 physicochemical descriptors calculated for about 10k compounds. You can get these data with the following command.

from griddify import datasets

data = datasets.get_compound_descriptors()

It is important that you preprocess your data (impute missing values, normalize, etc.). We provide functionality to do so.

from griddify import Preprocessing

pp = Preprocessing()
pp.fit(data)
data = pp.transform(data)

Create a 2D cloud of data features

Start by calculating distances between features.

from griddify import FeatureDistances

fd = FeatureDistances(metric="cosine").calculate(data)

You can now obtain a 2D cloud of your data features. By default, UMAP is used.

from griddify import Tabular2Cloud

tc = Tabular2Cloud()
tc.fit(fd)
Xc = tc.transform(fd)

It is always good to inspect the resulting projection. The cloud contains as many points as features exist in your dataset.

from griddify.plots import cloud_plot

cloud_plot(Xc)

Rearrange the 2D cloud onto a grid

Distribute cloud points on a grid using a linear assignment algorithm.

from griddify import Cloud2Grid

cg = Cloud2Grid()
cg.fit(Xc)
Xg = cg.transform(Xc)

You can check the rearrangement with an arrows plot.

from griddify.plots import arrows_plot

arrows_plot(Xc, Xg)

To continue with the next steps, it is actually more convenient to get mappings as integers. The following method gives you the size of the grid as well.

mappings, side = cg.get_mappings(Xc)

Rearrange your flat data points into grids

Let's go back to the original tabular data. We want to transform the input data, where each data sample is represented with a one-dimensional array, into an output data where each sample is represented with an image (i.e. a two-dimensional grid). Please ensure that data are normalize or scaled.

from griddify import Flat2Grid

fg = Flat2Grid(mappings, side)
Xi = fg.transform(data)

Explore one sample.

from griddify.plots import grid_plot

grid_plot(Xi[0])

Full pipeline

You can run the full pipeline described above in only a few lines of code.

from griddify import datasets
from griddify import Griddify

data = datasets.get_compound_descriptors()

gf = Griddify(preprocess=True)
gf.fit(data)
Xi = gf.transform(data)

You can find more examples as Jupyter Notebooks in the notebooks folder.

Learn more

The Ersilia Open Source Initiative is on a mission to strenghten research capacity in low income countries. Please reach out to us if you want to contribute: hello@ersilia.io

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.0.1

Aug 12, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

griddify-0.0.1.tar.gz (9.7 MB view hashes)

Uploaded Aug 12, 2022 Source

Hashes for griddify-0.0.1.tar.gz

Hashes for griddify-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`43ccd859767df88d8f236a3e7589fdb643c72ebffb7b0abd752fffacf0e6b179`
MD5	`a3987095f42c1017ba70ef8f88712a44`
BLAKE2b-256	`a6dbcc942b27aa84f8cf776612148fc2036356b835508e65dfe88882f6f0978c`