Find PII data in databases

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

PII Catcher for Databases and Data Warehouses

Overview

PIICatcher is a scanner for PII and PHI information. It finds PII data in your databases and file systems and tracks critical data. PIICatcher uses two techniques to detect PII:

Match regular expressions with column names
Match regular expressions and using NLP libraries to match sample data in columns.

Read more in the blog post on both these strategies.

PIICatcher is battery-included with a growing set of regular expressions for scanning column names as well as data. It also include Spacy.

PIICatcher supports incremental scans and will only scan new or not-yet scanned columns. Incremental scans allow easy scheduling of scans. It also provides powerful options to include or exclude schema and tables to manage compute resources.

There are ingestion functions for both Datahub and Amundsen which will tag columns and tables with PII and the type of PII tags.

PIIcatcher Screencast

Resources

AWS Glue & Lake Formation Privilege Analyzer for an example of how piicatcher is used in production.
Two strategies to scan data warehouses

Quick Start

PIICatcher is available as a docker image or command-line application.

Docker (preferred)

alias piicatcher='docker run -v ${HOME}/.config/tokern:/config -u $(id -u ${USER}):$(id -g ${USER}) -it --add-host=host.docker.internal:host-gateway tokern/piicatcher:latest'
piicatcher --help
piicatcher scan sqlite --name sqldb --path '/db/sqldb'

Command-line

To install use pip:

python3 -m venv .env
source .env/bin/activate
pip install piicatcher

# Install Spacy English package
python -m spacy download en_core_web_sm

# run piicatcher on a sqlite db and print report to console
piicatcher scan sqlite --name sqldb --path '/db/sqldb'
╭─────────────┬─────────────┬─────────────┬─────────────╮
│   schema    │    table    │   column    │   has_pii   │
├─────────────┼─────────────┼─────────────┼─────────────┤
│        main │    full_pii │           a │           1 │
│        main │    full_pii │           b │           1 │
│        main │      no_pii │           a │           0 │
│        main │      no_pii │           b │           0 │
│        main │ partial_pii │           a │           1 │
│        main │ partial_pii │           b │           0 │
╰─────────────┴─────────────┴─────────────┴─────────────╯

API

from piicatcher.api import scan_postgresql

# PIICatcher uses a catalog to store its state. 
# The easiest option is to use a sqlite memory database.
# For production usage check, https://tokern.io/docs/data-catalog
catalog_params={'catalog_path': ':memory:'}
output = scan_postrgresql(catalog_params=catalog_params, name="pg_db", uri="127.0.0.1", 
                          username="piiuser", password="p11secret", database="piidb", 
                          include_table_regex=["sample"])
print(output)

# Example Output
[['public', 'sample', 'gender', 'PiiTypes.GENDER'], 
 ['public', 'sample', 'maiden_name', 'PiiTypes.PERSON'], 
 ['public', 'sample', 'lname', 'PiiTypes.PERSON'], 
 ['public', 'sample', 'fname', 'PiiTypes.PERSON'], 
 ['public', 'sample', 'address', 'PiiTypes.ADDRESS'], 
 ['public', 'sample', 'city', 'PiiTypes.ADDRESS'], 
 ['public', 'sample', 'state', 'PiiTypes.ADDRESS'], 
 ['public', 'sample', 'email', 'PiiTypes.EMAIL']]

Supported Databases

PIICatcher supports the following databases:

Sqlite3 v3.24.0 or greater
MySQL 5.6 or greater
PostgreSQL 9.4 or greater
AWS Redshift
AWS Athena
Snowflake

Documentation

For advanced usage refer documentation PIICatcher Documentation.

Survey

Please take this survey if you are a user or considering using PIICatcher. The responses will help to prioritize improvements to the project.

Contributing

For Contribution guidelines, PIICatcher Developer documentation.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.21.2

Jul 6, 2023

0.21.1

Jun 26, 2023

0.21.0

Jun 23, 2023

0.20.2

Dec 21, 2022

0.20.1

Nov 30, 2022

0.20.0

Nov 29, 2022

0.19.2

Jan 28, 2022

0.19.1

Dec 19, 2021

This version

0.19.0

Dec 17, 2021

0.18.2

Dec 3, 2021

0.18.1

Nov 29, 2021

0.18.0

Nov 29, 2021

0.17.5

Nov 26, 2021

0.17.4

Nov 25, 2021

0.17.3

Nov 24, 2021

0.17.2

Nov 23, 2021

0.17.1

Nov 18, 2021

0.17.0

Nov 17, 2021

0.16.0

Nov 9, 2021

0.15.0

Aug 17, 2021

0.14.0

Aug 13, 2021

0.13.0

Dec 31, 2020

0.12.3

Dec 22, 2020

0.12.2

Nov 20, 2020

0.12.1

Sep 8, 2020

0.12.0

Jul 24, 2020

0.10.3

Jul 16, 2020

0.10.2

Jul 9, 2020

0.10.1

Jun 9, 2020

0.10.0

Apr 30, 2020

0.9.6

Apr 6, 2020

0.9.5

Mar 23, 2020

0.9.4

Mar 3, 2020

0.9.3

Feb 18, 2020

0.8.1

Feb 13, 2020

0.8.0

Feb 11, 2020

0.7.2

Jan 21, 2020

0.7.1

Jan 14, 2020

0.7.0

Jan 14, 2020

0.6.5

Jan 1, 2020

0.6.4

Dec 20, 2019

0.6.3

Dec 19, 2019

0.6.2

Dec 19, 2019

0.6.1

Dec 18, 2019

0.6.0

Dec 10, 2019

0.5.5

Nov 18, 2019

0.5.3

Nov 13, 2019

0.5.2

Nov 2, 2019

0.5.1

Nov 1, 2019

0.5.0

Nov 1, 2019

0.4.2

Aug 15, 2019

0.3.0

Mar 29, 2019

0.2.0

Mar 26, 2019

0.1.0

Mar 23, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

piicatcher-0.19.0.tar.gz (17.9 kB view hashes)

Uploaded Dec 17, 2021 Source

Built Distribution

piicatcher-0.19.0-py3-none-any.whl (18.0 kB view hashes)

Uploaded Dec 17, 2021 Python 3

Hashes for piicatcher-0.19.0.tar.gz

Hashes for piicatcher-0.19.0.tar.gz
Algorithm	Hash digest
SHA256	`2fd5ddeab7cd05758fbfea1d8961d06bb4b96d2ff89bcf68b3cce21504b00404`
MD5	`88bd7929635037432fda857c3facc959`
BLAKE2b-256	`9e3e7d76ef0021ee2057ae1b48e7099b3a7d453629fe53eaa1a9b0bbad3c2443`

Hashes for piicatcher-0.19.0-py3-none-any.whl

Hashes for piicatcher-0.19.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`028e5c6067ed1912fc69051c22b17a19473219efd85128f10dcdb6c2b4d07fd3`
MD5	`adcc8ffa25617951c9397d7938b74185`
BLAKE2b-256	`8250691f701f1128a5ef7464df1c191790297fedf5730949c7ef287e3833a58c`