Automated Generative Modeling and Sampling

These details have been verified by PyPI

Maintainers

amontanez24 fealho francesh kveerama lajohn mit_dai_lab npatki pvkdeveloper rwedge-datacebo

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Project description

“SDV” An open source project from Data to AI Lab at MIT.

SDV - Synthetic Data Vault

Free Software: MIT License
Documentation: https://HDI-Project.github.io/SDV
Homepage: https://github.com/HDI-Project/SDV

Overview

SDV is an automated generative modeling and sampling tool that allows the users to generate synthetic data after creating generative models for multi-table, relational datasets.

Install

Requirements

SDV has been developed and tested on Python 3.5, 3.6 and 3.7

Also, although it is not strictly required, the usage of a virtualenv is highly recommended in order to avoid interfering with other software installed in the system where SDV is run.

These are the minimum commands needed to create a virtualenv using python3.6 for SDV:

pip install virtualenv
virtualenv -p $(which python3.6) sdv-venv

Afterwards, you have to execute this command to have the virtualenv activated:

source sdv-venv/bin/activate

Remember about executing it every time you start a new console to work on SDV!

Install with pip

After creating the virtualenv and activating it, we recommend using pip in order to install SDV:

pip install sdv

This will pull and install the latest stable release from PyPi.

Install from source

With your virtualenv activated, you can clone the repository and install it from source by running make install on the stable branch:

git clone git@github.com:HDI-Project/SDV.git
cd SDV
git checkout stable
make install

Install for Development

If you want to contribute to the project, a few more steps are required to make the project ready for development.

Please head to the Contributing Guide for more details about this process.

Data Requirements

SDV can work with both single table and multi table, relational datasets, as far as they comply with the following data requirements:

All the data columns must be either numerical, categorical, boolean or datatimes. Mixed value types are not supported, but columns can have null values.
All the tables in the dataset can have at most one primary key, which can either be numerical or categorical. Datetime indexes might be supported in future versions.
All the tables can have at most one foreign key to a parent table, meaning that each table can have at most one parent.
Tables are either loaded as pandas.DataFrame objects or stored as CSV files.

Metadata

Alongside the actual tables, SDV needs to be provided some metadata about the dataset, which can either be provided as a python dict object or as a JSON file.

For more details about the Metadata format, please refer to the corresponding section of the documentation

Quickstart

In this short tutorial we will guide you through a series of steps that will help you getting started using SDV.

1. Load some data

SDV comes with a toy dataset to play with, which can be loaded using the sdv.load_demo function:

from sdv import load_demo

metadata, tables = load_demo()

This will return two objects:

A metadata dictionary with all the information that SDV needs to know about the dataset:

{
    "tables": [
        {
            "fields": [
                {"name": "user_id", "type": "id"},
                {"name": "country", "type": "categorical"},
                {"name": "gender", "type": "categorical"},
                {"name": "age", "type": "numerical", "subtype": "integer"}
            ],
            "name": "users",
            "primary_key": "user_id"
        },
        {
            "fields": [
                {"name": "session_id", "type": "id"},
                {"name": "user_id", "type": "id", "ref": {
                    "field": "user_id", "table": "users"},
                },
                {"name": "device", "type": "categorical"},
                {"name": "os", "type": "categorical"}
            ],
            "name": "sessions",
            "primary_key": "session_id"
        },
        {
            "fields": [
                {"name": "transaction_id", "type": "id"},
                {"name": "session_id", "type": "id", "ref": {
                    "field": "session_id", "table": "sessions"},
                },
                {"name": "timestamp", "format": "%Y-%m-%d", "type": "datetime"},
                {"name": "amount", "type": "numerical", "subtype": "float"},
                {"name": "approved", "type": "boolean"}
            ],
            "name": "transactions",
            "primary_key": "transaction_id"
        }
    ]
}

A dictionary containing three pandas.DataFrames with the tables described in the metadata dictionary.

The returned objects contain the following information:

{
    'users':
            user_id country gender  age
          0        0     USA      M   34
          1        1      UK      F   23
          2        2      ES   None   44
          3        3      UK      M   22
          4        4     USA      F   54
          5        5      DE      M   57
          6        6      BG      F   45
          7        7      ES   None   41
          8        8      FR      F   23
          9        9      UK   None   30,
  'sessions':
          session_id  user_id  device       os
          0           0        0  mobile  android
          1           1        1  tablet      ios
          2           2        1  tablet  android
          3           3        2  mobile  android
          4           4        4  mobile      ios
          5           5        5  mobile  android
          6           6        6  mobile      ios
          7           7        6  tablet      ios
          8           8        6  mobile      ios
          9           9        8  tablet      ios,
  'transactions':
          transaction_id  session_id           timestamp  amount  approved
          0               0           0 2019-01-01 12:34:32   100.0      True
          1               1           0 2019-01-01 12:42:21    55.3      True
          2               2           1 2019-01-07 17:23:11    79.5      True
          3               3           3 2019-01-10 11:08:57   112.1     False
          4               4           5 2019-01-10 21:54:08   110.0     False
          5               5           5 2019-01-11 11:21:20    76.3      True
          6               6           7 2019-01-22 14:44:10    89.5      True
          7               7           8 2019-01-23 10:14:09   132.1     False
          8               8           9 2019-01-27 16:09:17    68.0      True
          9               9           9 2019-01-29 12:10:48    99.9      True
}

2. Create and fit an SDV instance

Before sampling, SDV needs to learn about your data in a process called Database Modeling.

During this process, SDV will walk across all the tables in your dataset learning about the table relationships and the probability distributions of their values.

To do this, we create an instance of the sdv.SDV class and call its fit method passing it both the metadata and tables that we obtained before:

from sdv import SDV

sdv = SDV()
sdv.fit(metadata, tables)

3. Sample data

Once the modeling has finished, we can sample new data using our fitted SDV instance.

In order to do this, we call its sample_all method passing the number of rows that we want to sample.

samples = sdv.sample_all(5)

The output will be a dictionary with the same structure as the original tables dict, but filled with synthetic data instead of the real one.

Notice that only the parent tables of your dataset will have the specified number of rows, as the number of child rows that each row in the parent table has is also sampled following the original distribution of your dataset.

What's next?

If you would like to see more usage examples, please have a look at the examples folder.

Also do not forget to check the project documentation site!

History

0.1.2 - 2019-09-18

New Features

Add option to model the amount of child rows - Issue 93 by @ManuelAlvarezC

General Improvements

Add Evaluation Metrics - Issue 52 by @ManuelAlvarezC
Ensure unicity on primary keys on different calls - Issue 63 by @ManuelAlvarezC

Bugs fixed

executing readme: 'not supported between instances of 'int' and 'NoneType' - Issue 104 by @csala

0.1.1 - Anonymization of data

Add warnings when trying to model an unsupported dataset structure. GH#73
Add option to anonymize data. GH#51
Add support for modeling data with different distributions, when using GaussianMultivariate model. GH#68
Add support for VineCopulas as a model. GH#71
Improve GaussianMultivariate parameter sampling, avoiding warnings and unvalid parameters. GH#58
Fix issue that caused that sampled categorical values sometimes got numerical values mixed. GH#81
Improve the validation of extensions. GH#69
Update examples. GH#61
Replaced Table class with a NamedTuple. GH#92
Fix inconsistent dependencies and add upper bound to dependencies. GH#96
Fix error when merging extension in Modeler.CPA when running examples. GH#86

0.1.0 - First Release

First release on PyPI.

Project details

These details have been verified by PyPI

Maintainers

amontanez24 fealho francesh kveerama lajohn mit_dai_lab npatki pvkdeveloper rwedge-datacebo

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

1.12.1

Apr 19, 2024

1.12.1.dev1 pre-release

Apr 19, 2024

1.12.1.dev0 pre-release

Apr 19, 2024

1.12.0

Apr 16, 2024

1.12.0.dev0 pre-release

Apr 12, 2024

1.11.0

Mar 21, 2024

1.11.0.dev0 pre-release

Mar 21, 2024

1.10.0

Feb 15, 2024

1.10.0.dev0 pre-release

Feb 15, 2024

1.9.0

Jan 11, 2024

1.9.0.dev0 pre-release

Jan 11, 2024

1.8.0

Dec 5, 2023

1.8.0.dev0 pre-release

Dec 4, 2023

1.7.0

Nov 16, 2023

1.7.0.dev0 pre-release

Nov 15, 2023

1.6.0

Nov 7, 2023

1.6.0.dev1 pre-release

Nov 7, 2023

1.6.0.dev0 pre-release

Nov 6, 2023

1.5.0

Oct 13, 2023

1.5.0.dev0 pre-release

Oct 11, 2023

1.4.0

Aug 23, 2023

1.4.0.dev1 pre-release

Aug 23, 2023

1.4.0.dev0 pre-release

Aug 22, 2023

1.3.0

Aug 14, 2023

1.3.0.dev1 pre-release

Aug 14, 2023

1.3.0.dev0 pre-release

Aug 13, 2023

1.2.2.dev1 pre-release

Aug 2, 2023

1.2.2.dev0 pre-release

Jul 21, 2023

1.2.1

Jul 13, 2023

1.2.1.dev0 pre-release

Jul 10, 2023

1.2.0

Jun 7, 2023

1.2.0.dev1 pre-release

Jun 7, 2023

1.2.0.dev0 pre-release

Jun 6, 2023

1.1.0

May 10, 2023

1.1.0.dev0 pre-release

May 10, 2023

1.0.1

Apr 20, 2023

1.0.1.dev0 pre-release

Apr 19, 2023

1.0.0

Mar 28, 2023

1.0.0rc0 pre-release

Mar 28, 2023

1.0.0b1 pre-release

Mar 20, 2023

1.0.0b0 pre-release

Feb 24, 2023

0.18.0

Jan 24, 2023

0.18.0.dev0 pre-release

Jan 23, 2023

0.17.2

Dec 8, 2022

0.17.2.dev0 pre-release

Dec 8, 2022

0.17.1

Sep 29, 2022

0.17.1.dev0 pre-release

Sep 29, 2022

0.17.0

Sep 9, 2022

0.17.0.dev2 pre-release

Sep 8, 2022

0.17.0.dev1 pre-release

Aug 19, 2022

0.17.0.dev0 pre-release

Aug 16, 2022

0.16.0

Jul 22, 2022

0.16.0.dev5 pre-release

Jul 22, 2022

0.16.0.dev4 pre-release

Jul 21, 2022

0.16.0.dev3 pre-release

Jul 19, 2022

0.16.0.dev2 pre-release

Jul 15, 2022

0.16.0.dev1 pre-release

Jul 8, 2022

0.16.0.dev0 pre-release

Jul 1, 2022

0.15.0

May 25, 2022

0.15.0.dev1 pre-release

May 25, 2022

0.15.0.dev0 pre-release

May 24, 2022

0.14.1

May 3, 2022

0.14.1.dev0 pre-release

May 3, 2022

0.14.0

Mar 21, 2022

0.14.0.dev2 pre-release

Mar 14, 2022

0.14.0.dev1 pre-release

Mar 9, 2022

0.14.0.dev0 pre-release

Mar 4, 2022

0.13.1

Dec 22, 2021

0.13.1.dev0 pre-release

Dec 22, 2021

0.13.0

Nov 22, 2021

0.13.0.dev0 pre-release

Nov 20, 2021

0.12.1

Oct 12, 2021

0.12.1.dev0 pre-release

Oct 12, 2021

0.12.0

Aug 19, 2021

0.12.0.dev1 pre-release

Aug 17, 2021

0.12.0.dev0 pre-release

Aug 13, 2021

0.11.0

Jul 12, 2021

0.11.0.dev0 pre-release

Jul 7, 2021

0.10.1

Jun 11, 2021

0.10.1.dev0 pre-release

Jun 10, 2021

0.10.0

May 21, 2021

0.10.0.dev0 pre-release

May 21, 2021

0.9.1

Apr 29, 2021

0.9.1.dev1 pre-release

Apr 29, 2021

0.9.1.dev0 pre-release

Apr 28, 2021

0.9.0

Apr 1, 2021

0.9.0.dev0 pre-release

Mar 31, 2021

0.8.0

Feb 24, 2021

0.8.0.dev0 pre-release

Feb 24, 2021

0.7.0

Jan 28, 2021

0.7.0.dev1 pre-release

Jan 27, 2021

0.7.0.dev0 pre-release

Jan 27, 2021

0.6.2.dev2 pre-release

Jan 27, 2021

0.6.2.dev1 pre-release

Jan 25, 2021

0.6.2.dev0 pre-release

Jan 20, 2021

0.6.1

Dec 31, 2020

0.6.0

Dec 22, 2020

0.6.0.dev0 pre-release

Dec 22, 2020

0.5.0

Nov 25, 2020

0.5.0.dev0 pre-release

Nov 25, 2020

0.4.6.dev2 pre-release

Nov 16, 2020

0.4.6.dev1 pre-release

Nov 9, 2020

0.4.6.dev0 pre-release

Nov 4, 2020

0.4.5

Oct 17, 2020

0.4.4

Oct 6, 2020

0.4.4.dev0 pre-release

Oct 6, 2020

0.4.3

Sep 28, 2020

0.4.2

Sep 19, 2020

0.4.1

Sep 7, 2020

0.4.1.dev0 pre-release

Sep 7, 2020

0.4.0

Aug 8, 2020

0.4.0.dev0 pre-release

Aug 8, 2020

0.3.6

Jul 23, 2020

0.3.6.dev0 pre-release

Jul 23, 2020

0.3.5

Jul 9, 2020

0.3.4

Jul 4, 2020

0.3.4.dev0 pre-release

Jul 4, 2020

0.3.3

Jun 26, 2020

0.3.2

Feb 3, 2020

0.3.1

Jan 22, 2020

0.3.0

Dec 23, 2019

0.2.2

Dec 10, 2019

0.2.1

Nov 25, 2019

0.2.0

Nov 11, 2019

This version

0.2.0.dev0 pre-release

Nov 6, 2019

0.1.2

Sep 18, 2019

0.1.1

Apr 2, 2019

0.1.0

Sep 27, 2018

0.0.0

Jun 28, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdv-0.2.0.dev0.tar.gz (78.3 kB view hashes)

Uploaded Nov 6, 2019 Source

Built Distribution

sdv-0.2.0.dev0-py2.py3-none-any.whl (22.5 kB view hashes)

Uploaded Nov 6, 2019 Python 2 Python 3

Hashes for sdv-0.2.0.dev0.tar.gz

Hashes for sdv-0.2.0.dev0.tar.gz
Algorithm	Hash digest
SHA256	`7eb6587c34b2d7a71235ceef487c2e5555c7ad6f612c3c7d0621018fc5f7c4e8`
MD5	`6a9a708e4ef785f360966554e4ad6259`
BLAKE2b-256	`8d839927889891a01a6f7851574d40660cc5845453ce6fc3cb773ae8f15dbc9b`

Hashes for sdv-0.2.0.dev0-py2.py3-none-any.whl

Hashes for sdv-0.2.0.dev0-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`5f55785cca7c4afb211b708a0c9bbce1652f3aa35c3c9f5861fb8bc238dcd94d`
MD5	`964f901123e613036fe7275be0869af4`
BLAKE2b-256	`e4de4cce7257fc8c24786916f193300a877cc31ec9faa5ffa2d5699e090580b1`

sdv 0.2.0.dev0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

SDV - Synthetic Data Vault

Overview

Install

Requirements

Install with pip

Install from source

Install for Development

Data Requirements

Metadata

Quickstart

1. Load some data

2. Create and fit an SDV instance

3. Sample data

What's next?

History

0.1.2 - 2019-09-18

New Features

General Improvements

Bugs fixed

0.1.1 - Anonymization of data

0.1.0 - First Release

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution