A collection of measures for Approximate Functional Dependencies in relational data.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

AFD measures

A collection of measures for Approximate Functional Dependencies in relational data. Additionally, this repository contains all artifacts to "Approximately Measuring Functional Dependencies: a Comparative Study".

Short description

In real-world research projects, we often encounter unknown relational (tabular) datasets. In order to process them efficiently, functional dependencies (FDs) give us structural insight into relational data, describing strong relationships between columns. Errors in real-world data let traditional FD detection techniques fail. Hence we consider approximate FDs (AFDs): FDs that approximately hold in relational data.

This repository contains the implemented measures as well as the all artifacts to "Approximately Measuring Functional Dependencies: a Comparative Study".

Overview

code: this directory holds the code used to generate the results in the paper
- afd_measures: all Python source code relating to the implemented AFD measures
- experiments: Jupyter notebooks containing the processing steps to generate the results, figures or tables in the paper
- synthetic_data: all Python source code relating to the synthetic data generation process
data: the datasets used in the paper
- rwd: manually annotated dataset of files found on the web (see data/ground_truth.csv)
- rwd_e: datasets from rwd with errors introduced into them. Generated by the notebook code/experiments/create_rwd_e_dataset.ipynb.
- syn_e: synthetic dataset generated focussing on errors. Generated by the notebook code/experiments/create_syn_e.ipynb
- syn_u: synthetic dataset generated focussing on left-hand side uniqueness. Generated by the notebook code/experiments/create_syn_u.ipynb
- syn_s: synthetic dataset generated focussing on right-hand side skewness. Generated by the notebook code/experiments/create_syn_s.ipynb
paper: A full version of the paper including all proofs.
results: results of applying the AFD measures to the datasets.

Installation (measure library)

This library can be found on PyPI: afd-measures. Install it using pip like this:

pip install afd-measures

Usage (measure library)

To apply one of the measures to your data, you will need a pandas DataFrame of your relation. Pandas will automatically installed as a dependency of afd-measures. You can start with this Python snippet to analyse your own data (a CSV file in this example):

import afd_measures
import pandas as pd

my_data = pd.read_csv("my_amazing_table.csv")
print(afd_measures.mu_plus(my_data, lhs="X", rhs="Y"))

Installation (experiments)

To revisit the experiments that we did, clone this repository and install all requirements with Poetry (preferred) or Conda.

Poetry

Install the requirements using poetry. Use the extra flag "experiments" to install all additional requirements for the experiments to work. This includes (amongst others) Jupyter Lab.

$ poetry install -E experiments
$ jupyter lab

Conda

Create a new environment from the conda_environment.yaml file, activate it and run Jupyter lab to investigate the code.

$ conda create -f conda_environment.yaml
$ jupyter lab

Dataset References

In addition to this repository, we made our benchmark also available on Zenodo: find it here

adult.csv: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.
claims.csv: TSA Claims Data 2002 to 2006, published by the U.S. Department of Homeland Security.
dblp10k.csv: Frequency-aware Similarity Measures. Lange, Dustin; Naumann, Felix (2011). 243–248. Made available as DBLP Dataset 2.
hospital.csv: Hospital dataset used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824). Made available as part the dataset collection to that paper.
t_biocase_... files: t_bioc_... files used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824). Made available as part the dataset collection to that paper.
tax.csv: Tax dataset used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824). Made available as part the dataset collection to that paper.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

1.0.0

Oct 18, 2023

0.9.2

Oct 18, 2023

0.9.1

Oct 18, 2023

0.9.0

Oct 18, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

afd_measures-1.0.0.tar.gz (8.4 kB view hashes)

Uploaded Oct 18, 2023 Source

Built Distribution

afd_measures-1.0.0-py3-none-any.whl (8.3 kB view hashes)

Uploaded Oct 18, 2023 Python 3

Hashes for afd_measures-1.0.0.tar.gz

Hashes for afd_measures-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`3b473a3eb4856ca126da322a2e5fec992a78d5803ec28ef432411409f43a6cc9`
MD5	`4d3f44a0ed04308bda4e3c28650528bb`
BLAKE2b-256	`eff26a2f268af398bf629e79f49e0ead4dd4757125744ed89c94c86bf10fdc64`

Hashes for afd_measures-1.0.0-py3-none-any.whl

Hashes for afd_measures-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c18e35a2bbd58ff943f523ea5624e8e64f66d139b147f0578b53b6d54e3d956f`
MD5	`34abbdb254bb8b3ba4d0843399d8c219`
BLAKE2b-256	`60d238a304a2ce2a0a846100600411f068770d7c7234729c674a9e36884f6f74`