Pandera Report for row-based reporting by using the power of pandera.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
Topic
- Scientific/Engineering

Project description

Pandera Extension for row-based reporting

🚀 Description

pandera provides a flexible and expressive API for performing data validation on dataframe-like objects to make data processing pipelines more readable and robust

If you have to report potential quality issues resulting from the dataframe validation via pandera, than pandera-report is your friend. Based on the information of possible validation issues that pandera provides, your original dataframe will be extended with these issues on a row-level base.

With pandera-report, you can:

Seamlessly integrates with the pandera library to provide enhanced data validation capabilities without interfering with the pandera functionality.
Provides a convenient way to enrich your data with information about why specific rows failed validation.

⚡ Setup

Using pip:

pip install pandera-report

Using poetry:

poetry add pandera-report

Quick start

The following example is taken from the pandera documentation and shows the definition of a DataFrameSchema which will end in a valid result for the provided dataframe.

import pandas as pd
import pandera as pa


# data to validate
df = pd.DataFrame({
    "column1": [1, 4, 0, 10, 9],
    "column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
    "column3": ["value_1", "value_2", "value_3", "value_2", "value_1"]
})

# define schema
schema = pa.DataFrameSchema({
    "column1": pa.Column(int, checks=pa.Check.le(10)),
    "column2": pa.Column(float, checks=pa.Check.lt(-1.2)),
    "column3": pa.Column(str, checks=[
        pa.Check.str_startswith("value_"),
        # define custom checks as functions that take a series as input and
        # outputs a boolean or boolean Series
        pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
    ]),
})

validated_df = schema(df)
print(validated_df)

#     column1  column2  column3
#  0        1     -1.3  value_1
#  1        4     -1.4  value_2
#  2        0     -2.9  value_3
#  3       10    -10.1  value_2
#  4        9    -20.4  value_1

To make usage of the pandera-report functionality for the same schema and dataframe, you can do this:

validator = DataFrameValidator() # default is quality_report=True, lazy=True
print(validator.validate(schema, df))

#     column1  column2  column3 quality_issues quality_status
#  0        1     -1.3  value_1           None          Valid
#  1        4     -1.4  value_2           None          Valid
#  2        0     -2.9  value_3           None          Valid
#  3       10    -10.1  value_2           None          Valid
#  4        9    -20.4  value_1           None          Valid

You see?! Same result but extended by the fact that the validation of the dataframe was completely valid. This can also be deactivated for the case that everything is 100% valid.

But what if the dataframe contains data quality issues? pandera will throw SchemaErrors or SchemaError (depends on the lazyness). Let's see what pandera-report does, if we change the dataframe against the schema definition:

# data to validate
df = pd.DataFrame({
    "column1": [1, 4, 0, 10, 9],
    "column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
    "column3": ["value_1", "value_2", "value_3", "value_2", "value1"]
})

validator = DataFrameValidator()
print(validator.validate(schema, df))

#     column1  column2  column3                              quality_issues quality_status
#  0        1     -1.3  value_1                                        None          Valid
#  1        4     -1.4  value_2                                        None          Valid
#  2        0     -2.9  value_3                                        None          Valid
#  3       10    -10.1  value_2                                        None          Valid
#  4        9    -20.4   value1  Column <column3>: str_startswith('value_')        Invalid

Why is this useful? Quite simply, it becomes particularly interesting when you are not the one who has to prepare a valid file so that it can be processed into a valid DataFrame in the end.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
Topic
- Scientific/Engineering

Release history Release notifications | RSS feed

This version

0.1.1

Oct 21, 2023

0.1.0

Sep 22, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandera_report-0.1.1.tar.gz (9.5 kB view hashes)

Uploaded Oct 21, 2023 Source

Built Distribution

pandera_report-0.1.1-py3-none-any.whl (8.3 kB view hashes)

Uploaded Oct 21, 2023 Python 3

Hashes for pandera_report-0.1.1.tar.gz

Hashes for pandera_report-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`94f4911985c3cb041f7a4772294ae7b4731809526a6bc4ec0c5271ea8936b90e`
MD5	`ad56d80792f9a4e28a53021a1e131a30`
BLAKE2b-256	`4960aae789274e67c0e33f5797f456096fa3afdbd6078501e663cb38a1eae73c`

Hashes for pandera_report-0.1.1-py3-none-any.whl

Hashes for pandera_report-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d4a6b0b55097333bbe49eaefc1710b4383c462458fa20d9f4c8f9be984f93457`
MD5	`e8f64125d7c488247b4ebf7b73f3a21e`
BLAKE2b-256	`172bb5fb4ae82b99eab3e95936664cce02b3dde922a776bbe6bc5d16ba335765`