A pandas extension for cleaning datasets.
Project description
Pandas-cleaner
Pandas-cleaner is a Python package, built on top of pandas, that provides methods detect, analyze and clean errors in datasets with different types of data (numerical, categorical, text, datetimes...).
Features
Pandas-cleaner offers functionnalities to automatically :
:arrow_right: detect different kind of potential errors in datasets such as outliers, inconsistencies, typos, wrong-typed ..., given predefined rules or statistiscal estimations, via an easy-to-use API extending pandas,
:arrow_right: analyze these errors, via reports and plots, to check the validity of the set and/or decide if any correction is needed,
:arrow_right: clean the datasets, either by dropping the lines with errors, emptying, correcting or replacing bad values,
:arrow_right: reapply the same rules to any other incoming fresh data.
Usage
Import the package
import pandas as pd
import pdcleaner
Create an example data series
series = pd.Series([1, 5, -6, 100, 10])
Detect the errors in the series with a given method (such as bounded
, iqr
, zscore
and many more depending the type of data...)
detector = series.cleaner.detect('bounded', lower=0, upper=10)
Inspect the result:
detector.report()
Detection report
==============================================================================
Method: bounded Nb samples: 5
Date: January 24,2022 Nb errors: 2
Time: 16:06:08 Nb rows with NaN: 0
------------------------------------------------------------------------------
lower 0 upper 10
inclusive both sided both
==============================================================================
Check the potential errors that have been detected
detector.detected()
2 -6
3 100
dtype: int64
Clean the detected errors from the series using the chosen method among drop
, to_na
, clip
, replace
...
series.cleaner.clean("drop", detector, inplace=True)
series
0 1
1 5
4 10
dtype: int64
Contributing to pandas-cleaner
All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.
Issues and bugs can be reported at https://github.com/eurodecision/pandas-cleaner/issues
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pandas_cleaner-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ee02ba0130ed521a070256c1fadb1589ce587f2a63b26724d8e5c7db3e35a890 |
|
MD5 | a9d1a49788ff9959510cc4f5229f26a2 |
|
BLAKE2b-256 | 406b6c3395f65aeb2123f21d088d13c29103af92c7dd34458a22d90037fce05b |