Data Lineage Tracing Library

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Project description

“DAI-Lab” An open source project from Data to AI Lab at MIT.

DataTracer

Data Lineage Tracing Library

License: MIT
Development Status: Pre-Alpha
Homepage: https://github.com/data-dev/DataTracer

Overview

DataTracer is a Python library for solving Data Lineage problems using statistical methods, machine learning techniques, and hand-crafted heuristics.

Currently the Data Tracer library implements discovery of the following properties:

Primary Key: Identify which column is the primary key in each table.
Foreign Key: Find which relationships exist between the tables.
Column Mapping: Given a field in a table, deduce which other fields, from the same table or other tables, are more related or contributed the most in generating the given field.

Install

Requirements

DataTracer has been developed and tested on Python 3.5 and 3.6, 3.7

Also, although it is not strictly required, the usage of a virtualenv is highly recommended in order to avoid interfering with other software installed in the system where DataTracer is run.

Install with pip

The easiest and recommended way to install DataTracer is using pip:

pip install datatracer

This will pull and install the latest stable release from PyPi.

If you want to install from source or contribute to the project please read the Contributing Guide.

Data Format: Datasets and Metadata

The DataTracer library is prepared to work using datasets, which are a collection of tables loaded as pandas.DataFrames and a MetaData JSON which provides information about the dataset structure.

You can find more information about the MetaData format in the MetaData repository.

The DataTracer also includes a few demo datasets which you can easily download to your computer using the datatracer.get_demo_data function:

import datatracer

print(datatracer.__file__)
import os
print(os.getcwd())
from mlblocks import discovery

print(discovery.get_pipelines_paths())


from datatracer import get_demo_data

get_demo_data()

This will create a folder called datatracer_demo in your working directory with a few datasets ready to use inside it.

Quickstart

In this short tutorial we will guide you through a series of steps that will help you getting started with Data Tracer.

Load data

The first step will be to load the data in the format expected by DataTracer.

For this, we can use the datatracer.load_datasets function passing the path to where we have our datasets.

For example, if we are using the demo datasets, we can load them using:

from datatracer import load_datasets

datasets = load_datasets('datatracer_demo')

This will return a dictionary of dataset names and tuples, each one of them containing:

A MetaData instance with details about the dataset.
A dict with all the tables of the dataset loaded as a pandas.DataFrame.

For the rest of the tutorial, we will use the dataset called classicalmodels for our testing, and use the rest of the datasets to train the DataTracer.

metadata, tables = datasets.pop('classicmodels')

Select a Pipeline

In the DataTracer project, the different Data Lineage problems are solved using what we call pipelines.

We can see the list of available pipelines using the get_pipelines function:

from datatracer import get_pipelines

get_pipelines()

This will return a list with the names of the available pipelines:

['datatracer.column_map',
 'datatracer.detection.primary',
 'datatracer.foreign_key.basic',
 'datatracer.foreign_key.standard',
 'datatracer.primary_key.basic']

Use a DataTracer instance to find table relationships

In order to use a pipeline you will need to create a DataTracer instance passing the name of the pipeline that we want to use.

In this example, we will try to figure out the relationships between the tables in our dataset by using the pipeline datatracer.foreign_key.standard.

from datatracer import DataTracer

# Create the DataTrace instance
dtr = DataTracer('datatracer.foreign_key.standard')

# Fit it to our training datasets
dtr.fit(datasets)

# Solve the Data Lineage problem
foreign_keys = dtr.solve(tables)

The result will be a dictionary containing the foreign key candidates:

[{'table': 'products',
  'field': 'productLine',
  'ref_table': 'productlines',
  'ref_field': 'productLine'},
 {'table': 'payments',
  'field': 'customerNumber',
  'ref_table': 'customers',
  'ref_field': 'customerNumber'},
 {'table': 'orders',
  'field': 'customerNumber',
  'ref_table': 'customers',
  'ref_field': 'customerNumber'},
 {'table': 'orderdetails',
  'field': 'productCode',
  'ref_table': 'products',
  'ref_field': 'productCode'},
 {'table': 'orderdetails',
  'field': 'orderNumber',
  'ref_table': 'orders',
  'ref_field': 'orderNumber'},
 {'table': 'employees',
  'field': 'officeCode',
  'ref_table': 'offices',
  'ref_field': 'officeCode'}]

History

0.0.1 - 2020-05-22

First release.

Features:

Primary Key Detection
Foreign Key Detection
Column Mapping

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

0.0.6

Jun 19, 2020

0.0.6.dev0 pre-release

Jun 19, 2020

0.0.5

Jun 12, 2020

0.0.5.dev1 pre-release

Jun 12, 2020

0.0.5.dev0 pre-release

Jun 12, 2020

0.0.4

Jun 5, 2020

0.0.4.dev0 pre-release

Jun 5, 2020

0.0.3

May 28, 2020

0.0.3.dev0 pre-release

May 28, 2020

0.0.2

May 26, 2020

This version

0.0.2.dev0 pre-release

May 26, 2020

0.0.1

May 23, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datatracer-0.0.2.dev0.tar.gz (420.9 kB view hashes)

Uploaded May 26, 2020 Source

Built Distribution

datatracer-0.0.2.dev0-py2.py3-none-any.whl (207.4 kB view hashes)

Uploaded May 26, 2020 Python 2 Python 3

Hashes for datatracer-0.0.2.dev0.tar.gz

Hashes for datatracer-0.0.2.dev0.tar.gz
Algorithm	Hash digest
SHA256	`cf18437bc75c13890d56502da8b5867ca4e8492518b9b098b55d35a23d3d3076`
MD5	`1c40f52c2f9c11c9ccd9e3a438ef6071`
BLAKE2b-256	`458d887b9399e4f941b52619a0c174f837e35fceaff793ab64d756aba277aa61`

Hashes for datatracer-0.0.2.dev0-py2.py3-none-any.whl

Hashes for datatracer-0.0.2.dev0-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`a02b5476c08d6b1d17675b64f6531a195b795b1143bac008f3bdf197abf57044`
MD5	`45828890069a9c2af797a13bae8e702c`
BLAKE2b-256	`2fc847d6b80115e4ed4f4da2e9a70ae2eef9c04b7461ea82f7160e4b6bc329ec`