Skip to main content

DataCockpit is a package to analyze the usage and quality metrics of your database.

Project description

DataCockpit

PyPI license PyPI - Downloads

DataCockpit is a Python toolkit that leverages the data, usage logs, and associated meta-data within a data lake and provisions two kinds of auxiliary information:

  1. quality – the validity and appropriateness of data required to perform certain analytic tasks.
  2. usage – the historical utilization characteristics of data across multiple users.

Architecture

DataCockpit architecture diagram

Install

DataCockpit is written in Python 3. Please ensure you have a Python 3 environment already installed.

Installing DataCockpit is as simple as making microwave popcorn! Just pip install datacockpit and then sit back and let it do the work while you put your feet up.


How to Use

from datacockpit import DataCockpit

# Setup database connection with SQL engine and log-file CSV
dcp_obj = DataCockpit(engine=your_sqlalchemy_engine,
        logs_path="your/logs/path.csv")

# Compute and persist quality & usage metrics
dcp_obj.compute_quality(levels=None, metrics=None)
dcp_obj.compute_usage(levels=None, metrics=None)
  • First, we initialize a SQLAlchemy engine that is used to connect to the database.
  • Then we create an object, dcp_obj, of the DataCockpit class, by passing the engine and logs_path as arguments. logs_path points to the location where the historical usage logs (SQL queries and metadata such as the user who ran them and the timestamp) are saved in a CSV file.
  • The next two lines call methods to compute and persist quality and usage with parameters to support different levels (e.g., ['attribute', 'record', 'all']) and metrics (e.g., ['correctness', 'completeness', 'all']).
  • The compute_ commands persist the computed metrics to database tables.
  • You can retrieve the computed metrics for use in downstream applications through the below get_ commands.
# Retrieve computed information for use in downstream applications
quality_info = dcp_obj.get_quality()
usage_info = dcp_obj.get_usage()

Depending on your data and query patterns, the get_quality() and get_usage() functions will return the following quality and usage information.


Data Quality Information


Attribute metrics table (attribute_metrics):

  • Completeness is the percentage of non-missing values for an attribute.
  • Correctness is the percentage of correct values for an attribute based on pre-defined constraints.
  • Objectivity is the amount of distortion in the data distribution.
  • Uniqueness is the percentage of unique values for an attribute.
attribute metrics

Record metrics table (record_metrics):

  • Completeness is the percentage of non-missing values in each dataset record.
  • Correctness is the percentage of correct values in each dataset record.
  • Uniqueness is the percentage of unique values in each dataset record.
record metrics

Data Usage Information

The usage metrics are fairly self explanatory. The SQL queries are parsed to get the Metadata Table that shows usage statistics for every attribute in the datasets (analogous to tables). The Aggregate Table and the Dataset Usage Tables are rolled up from the Metadata Table. Other analyses such as timeseries analyses are shown in the Jupyter notebooks in the assets/notebooks directory. It runs on a sample usage file available in assets/data/query_logs.csv.


Metadata table (dcp_metadata):

usage metadata table

Aggregate table (dcp_aggr):

usage aggregate table

Data usage (dcp_dataset_usage)

usage dataset table

Build

DataCockpit can be installed as a Python package and imported in your own awesome applications!

  • DataCockpit is written in Python 3. Please ensure you have a Python 3 environment already installed.
  • Clone this repository (review branch) and enter (cd) into it.
  • Create a new virtual environment, virtualenv --python=python3 venv
  • Activate it using, source venv/bin/activate (MacOSX/Linux) or venv\Scripts\activate.bat (Windows)
  • Install dependencies, python -m pip install -r requirements.txt
  • <make your changes>
  • Bump up the version in setup.py and create a Python distributable, python setup.py sdist
  • This will create a new file inside datacockpit-..*.tar.gz inside the dist directory.
  • Install the above file in your Python environment using, python -m pip install <PATH-TO-datacockpit-*.*.*.tar.gz>
  • Verify by opening your Python console and importing it:
>>> from datacockpit import DataCockpit
  • Enjoy, DataCockpit is now available for use as a Python package!

Analysis

Look at assets for examples of how to use the metrics obtained, such as visualizing the temporal trends in data or finding the most critical attributes.


Credits

DataCockpit was created by Arpit Narechania, Fan Du, Atanu R. Sinha, Ryan A. Rossi, Jane Hoffswell, Shunan Guo, Eunyee Koh, Surya Chakraborty, Shivam Agarwal, Shamkant B. Navathe, and Alex Endert.


License

The software is available under the MIT License.


Contact

If you have any questions, feel free to open an issue or contact Arpit Narechania, Surya Chakraborty (chakraborty [at] gatech.edu), or Shivam Agarwal (s.agarwal [at] gatech.edu).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datacockpit-0.2.0.tar.gz (15.0 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page