Find PII data in databases
Project description
PII Catcher for Databases and Data Warehouses
Overview
PIICatcher is a scanner for PII and PHI information. It finds PII data in your databases and file systems and tracks critical data. PIICatcher uses two techniques to detect PII:
- Match regular expressions with column names
- Match regular expressions and using NLP libraries to match sample data in columns.
Read more in the blog post on both these strategies.
PIICatcher is battery-included with a growing set of regular expressions for scanning column names as well as data. It also include Spacy.
PIICatcher supports incremental scans and will only scan new or not-yet scanned columns. Incremental scans allow easy scheduling of scans. It also provides powerful options to include or exclude schema and tables to manage compute resources.
There are ingestion functions for both Datahub and Amundsen which will tag columns and tables with PII and the type of PII tags.
Resources
- AWS Glue & Lake Formation Privilege Analyzer for an example of how piicatcher is used in production.
- Two strategies to scan data warehouses
Quick Start
PIICatcher is available as a docker image or command-line application.
Docker (preferred)
alias piicatcher='docker run -v ${HOME}/.config/tokern:/config -u $(id -u ${USER}):$(id -g ${USER}) -it --add-host=host.docker.internal:host-gateway tokern/piicatcher:latest'
piicatcher --help
piicatcher scan sqlite --name sqldb --path '/db/sqldb'
Command-line
To install use pip:
python3 -m venv .env
source .env/bin/activate
pip install piicatcher
# Install Spacy English package
python -m spacy download en_core_web_sm
# run piicatcher on a sqlite db and print report to console
piicatcher scan sqlite --name sqldb --path '/db/sqldb'
╭─────────────┬─────────────┬─────────────┬─────────────╮
│ schema │ table │ column │ has_pii │
├─────────────┼─────────────┼─────────────┼─────────────┤
│ main │ full_pii │ a │ 1 │
│ main │ full_pii │ b │ 1 │
│ main │ no_pii │ a │ 0 │
│ main │ no_pii │ b │ 0 │
│ main │ partial_pii │ a │ 1 │
│ main │ partial_pii │ b │ 0 │
╰─────────────┴─────────────┴─────────────┴─────────────╯
API
from piicatcher.api import scan_postgresql
# PIICatcher uses a catalog to store its state.
# The easiest option is to use a sqlite memory database.
# For production usage check, https://tokern.io/docs/data-catalog
catalog_params={'catalog_path': ':memory:'}
output = scan_postrgresql(catalog_params=catalog_params, name="pg_db", uri="127.0.0.1",
username="piiuser", password="p11secret", database="piidb",
include_table_regex=["sample"])
print(output)
# Example Output
[['public', 'sample', 'gender', 'PiiTypes.GENDER'],
['public', 'sample', 'maiden_name', 'PiiTypes.PERSON'],
['public', 'sample', 'lname', 'PiiTypes.PERSON'],
['public', 'sample', 'fname', 'PiiTypes.PERSON'],
['public', 'sample', 'address', 'PiiTypes.ADDRESS'],
['public', 'sample', 'city', 'PiiTypes.ADDRESS'],
['public', 'sample', 'state', 'PiiTypes.ADDRESS'],
['public', 'sample', 'email', 'PiiTypes.EMAIL']]
Supported Databases
PIICatcher supports the following databases:
- Sqlite3 v3.24.0 or greater
- MySQL 5.6 or greater
- PostgreSQL 9.4 or greater
- AWS Redshift
- AWS Athena
- Snowflake
Documentation
For advanced usage refer documentation PIICatcher Documentation.
Survey
Please take this survey if you are a user or considering using PIICatcher. The responses will help to prioritize improvements to the project.
Contributing
For Contribution guidelines, PIICatcher Developer documentation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for piicatcher-0.19.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 028e5c6067ed1912fc69051c22b17a19473219efd85128f10dcdb6c2b4d07fd3 |
|
MD5 | adcc8ffa25617951c9397d7938b74185 |
|
BLAKE2b-256 | 8250691f701f1128a5ef7464df1c191790297fedf5730949c7ef287e3833a58c |