Clean your data using a scikit-learn transformer in a single line of code

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

pandas_dq

Analyze and clean your data in a single line of code with a Scikit-Learn compatible Transformer.

What is pandas_dq
How to use pandas_dq
How to install pandas_dq
Usage
API
Maintainers
Contributing
License

Introduction

What is pandas_dq?

pandas_dq is a new python library for automatically cleaning your dirty dataset using pandas scikit_learn functions. You can analyze your dataset and fix them - all in a single line of code!

pandas_dq

Uses

pandas_dq has two important modules: find_dq and Fix_DQ.

1. find_dq function

find_dq

`find_dq` is a function that is probably the most popular way to use pandas_dq and it performs following data quality analysis steps:

It detects missing values and suggests to impute them with mean, median, mode, or a constant value.
It identifies rare categories and suggests to group them into a single category or drop them.
It finds infinite values and suggests to replace them with NaN or a large value.
It detects mixed data types and suggests to convert them to a single type or split them into multiple columns.
It detects outliers and suggests to remove them or use robust statistics.
It detects high cardinality features and suggests to reduce them using encoding techniques or feature selection methods.
It detects highly correlated features and suggests to drop one of them or use dimensionality reduction techniques.
It detects duplicate rows and columns and suggests to drop them or keep only one copy.
It detects skewed distributions and suggests to apply transformations or scaling techniques.
It detects imbalanced classes and suggests to use resampling techniques or class weights.
It detects feature leakage and suggests to avoid using features that are not available at prediction time.

2. Fix_DQ class: a scikit_learn transformer which can detect data quality issues and clean them all in one line of code

fix_dq

`Fix_DQ` is a great way to clean an entire train data set and apply the same steps in an MLOps pipeline to a test dataset. `Fix_DQ` can be used to detect most issues in your data (similar to find_dq but without the target related steps) in one step (during `fit` method). This transformer can then be saved (or "pickled") for applying the same steps on test data either at the same time or later.

How can we use Fix_DQ in GridSearchCV to find the best model pipeline?

This is another way to find the best data cleaning steps for your train data and then use the cleaned data in hyper parameter tuning using GridSearchCV or RandomizedSearchCV along with a LightGBM or an XGBoost or a scikit-learn model.

Install

Prerequsites:

pandas_dq is built using pandas, numpy and scikit-learn - that's all. It should run on almost all Python3 Anaconda installations without additional installs. You won't have to import any special libraries.

The best method to install lazytransform is to use conda:

pip install pandas_dq

To install from source:

cd <pandas_dq_Destination>
git clone git@github.com:AutoViML/pandas_dq.git

or download and unzip https://github.com/AutoViML/pandas_dq/archive/master.zip

conda create -n <your_env_name> python=3.7 anaconda
conda activate <your_env_name> # ON WINDOWS: `source activate <your_env_name>`
cd pandas_dq
pip install -r requirements.txt

Usage

You can invoke `Fix_DQ` as a scikit-learn compatible fit and transform object. See syntax below.

from pandas_dq import Fix_DQ

# Call the transformer to print data quality issues 
# as well as clean your data - all in one step
# Create an instance of the fix_data_quality transformer with default parameters
fdq = Fix_DQ()

# Fit the transformer on X_train and transform it
X_train_transformed = fdq.fit_transform(X_train)

# Transform X_test using the fitted transformer
X_test_transformed = fdq.transform(X_test)

if you are not using the Transformer, you can simply call the function, find_dq

from pandas_dq import find_dq
find_dq(df, target=target, verbose=0)

API

pandas_dq has a very simple API with the following inputs. You need to create a sklearn-compatible transformer pipeline object by importing Fix_DQ from pandas_dq library.

Once you import it, you can define the object by giving several options such as:

Arguments

Caution: X_train and y_train must be pandas Dataframes or pandas Series. I have not tested it on numpy arrays. You can try your luck.

find_dq has only 3 arguments:

df: default is a pandas DataFrame.
target: default: None. Otherwise, it should be a string name representing the name of a column in df. You can leave it as None if you don't want any target related issues.
verbose: This has 2 possible states:
0 silent output. Great for running where it prints only high level data quality issues.
1 more verbiage. Great for knowing details behind each issue and what the suggestions are.

Fix_DQ has slightly more arguments:

quantile: float (0.75): Define a threshold for IQR for outlier detection. Could be any float between 0 and 1.
- cat_fill_value: string ("missing"): Define a fill value for missing categories in your object or categorical variables. This is a global default for your entire dataset. I will try to change it to a dictionary so that you can specify different values for different columns.
- num_fill_value: integer or float (999): Define a fill value for missing numbers in your integer or float variables. This is a global default for your entire dataset. I will try to change it to a dictionary so that you can specify different values for different columns.
- rare_threshold: float (0.05): Define a threshold for rare categories. If a certain category in a column is less 5% (say) of samples, then it will considered rare. All rare categories will be merged with a category value called "Rare".
- correlation_threshold: float (0.8): Define a correlation limit. Anything above this limit, the variable will be dropped.

Maintainers

@AutoViML

Contributing

See the contributing file!

PRs accepted.

License

Note of Gratitude

This libray would not have been possible without the help of ChatGPT and Bard. This library is dedicated to the thousands of people who worked to create LLM's.

DISCLAIMER

This project is not an official Google project. It is not supported by Google and Google specifically disclaims all warranties as to its quality, merchantability, or fitness for a particular purpose.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.29

Dec 13, 2023

1.28

Jun 4, 2023

1.27

Jun 4, 2023

1.22

May 3, 2023

1.21

May 2, 2023

1.20

May 2, 2023

1.12

Apr 26, 2023

1.11

Apr 26, 2023

1.10

Apr 22, 2023

1.9

Apr 14, 2023

1.8

Apr 7, 2023

1.7

Apr 7, 2023

1.6

Apr 7, 2023

1.5

Apr 7, 2023

1.4

Apr 7, 2023

1.3

Apr 6, 2023

1.2

Apr 3, 2023

1.1

Apr 3, 2023

This version

1.0

Apr 3, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas_dq-1.0.tar.gz (14.0 kB view hashes)

Uploaded Apr 3, 2023 Source

Built Distribution

pandas_dq-1.0-py3-none-any.whl (13.8 kB view hashes)

Uploaded Apr 3, 2023 Python 3

Hashes for pandas_dq-1.0.tar.gz

Hashes for pandas_dq-1.0.tar.gz
Algorithm	Hash digest
SHA256	`bb38c2cf48ae73a815f460809a57aaf7fa3908c907a9f0653381c5a7307f33d7`
MD5	`0734bd624e11902428fe415012932bb9`
BLAKE2b-256	`9e8d43dee06ea0b5a5bb3ebc21b06364dfed20cc79cc8b2b04237c96027fd269`

Hashes for pandas_dq-1.0-py3-none-any.whl

Hashes for pandas_dq-1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`731f3c0ee24c7092cd0a9be3c6f6cfa435fada543e6e4dc4d4daadeb1201ed76`
MD5	`7b172822335b346821598230b9eb8503`
BLAKE2b-256	`6c12bedabbf27a642eecd3089106560515031b3ffb1bb512d4a5f9a0134c7165`