Pseudonymization extensions for Dapla

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
License
- OSI Approved :: MIT License
Programming Language

Project description

Dapla Toolbelt Pseudo

Pseudonymize, repseudonymize and depseudonymize data on Dapla.

Features

Pseudonymize

from dapla_pseudo import PseudoData
import pandas as pd

file_path="data/personer.json"

df = pd.read_json(file_path) # Create DataFrame from file

# Example: Single field default encryption (DAEAD)
result_df = (
    PseudoData.from_pandas(df)                     # Specify what dataframe to use
    .on_field("fornavn")                           # Select the field to pseudonymize
    .pseudonymize()                                # Apply pseudonymization to the selected field
    .to_polars()                                   # Get the result as a polars dataframe
)

# Example: Multiple fields default encryption (DAEAD)
result_df = (
    PseudoData.from_pandas(df)                     # Specify what dataframe to use
    .on_fields("fornavn", "etternavn")             # Select multiple fields to pseudonymize
    .pseudonymize()                                # Apply pseudonymization to the selected fields
    .to_polars()                                   # Get the result as a polars dataframe
)

# Example: Single field sid mapping and pseudonymization (FPE)
result_df = (
    PseudoData.from_pandas(df)                     # Specify what dataframe to use
    .on_field("fnr")                               # Select the field to pseudonymize
    .map_to_stable_id()                            # Map the selected field to stable id
    .pseudonymize()                                # Apply pseudonymization to the selected fields
    .to_polars()                                   # Get the result as a polars dataframe
)

The default encryption algorithm is DAEAD (Deterministic Authenticated Encryption with Associated Data). However, if the field is a valid Norwegian personal identification number (fnr, dnr), the recommended way to pseudonymize is to use the function map_to_stable_id() to convert the identification number to a stable ID (SID) prior to pseudonymization. In that case, the pseudonymization algorithm is FPE (Format Preserving Encryption).

Validate SID mapping

from dapla_pseudo import Validator
import pandas as pd

file_path="data/personer.json"

df = pd.read_json(file_path)

result = (
    Validator.from_pandas(df)                   # Specify what dataframe to use
    .on_field("fnr")                            # Select the field to validate
    .validate_map_to_stable_id()                # Validate that all the field values can be mapped to a SID
)
# The resulting dataframe contains the field values that didn't have a corresponding SID
result.to_pandas()

A sid_snapshot_date can also be specified to validate that the field values can be mapped to a SID at a specific date:

from dapla_pseudo import Validator
from dapla_pseudo.utils import convert_to_date
import pandas as pd

file_path="data/personer.json"

df = pd.read_json(file_path)

result = (
    Validator.from_pandas(df)
    .on_field("fnr")
    .validate_map_to_stable_id(
        sid_snapshot_date=convert_to_date("2023-08-29")
    )
)
# Show metadata about the validation (e.g. which version of the SID catalog was used)
result.metadata
# Show the field values that didn't have a corresponding SID
result.to_pandas()

Advanced usage

Pseudonymize

Read from file systems

from dapla_pseudo import PseudoData
from dapla import AuthClient


file_path="data/personer.json"

options = {
    # Specify data types of columns in the dataset
    "dtype" : { "fnr": "string","fornavn": "string","etternavn": "string","kjonn": "category","fodselsdato": "string"}
}

# Example: Read dataframe from file
result_df = (
    PseudoData.from_file(file_path, **options)     # Read the DataFrame from file
    .on_fields("fornavn", "etternavn")             # Select multiple fields to pseudonymize
    .pseudonymize()                                # Apply pseudonymization to the selected fields
    .to_polars()                                   # Get the result as a polars dataframe
)

# Example: Read dataframe from GCS bucket
options = {
    # Specify data types of columns in the dataset
    "dtype" : { "fnr": "string","fornavn": "string","etternavn": "string","kjonn": "category","fodselsdato": "string"},
    # Specify storage options for Google Cloud Storage (GCS)
    "storage_options" : {"token": AuthClient.fetch_google_credentials()}
}

gcs_file_path = "gs://ssb-staging-dapla-felles-data-delt/felles/pseudo-examples/andeby_personer.csv"

result_df = (
    PseudoData.from_file(gcs_file_path, **options) # Read DataFrame from GCS
    .on_fields("fornavn", "etternavn")             # Select multiple fields to pseudonymize
    .pseudonymize()                                # Apply pseudonymization to the selected fields
    .to_polars()                                   # Get the result as a polars dataframe
)

Pseudoyminize using a custom keyset

from dapla_pseudo import pseudonymize

# Pseudonymize fields in a local file using the default key:
pseudonymize(file_path="./data/personer.json", fields=["fnr", "fornavn"])

# Pseudonymize fields in a local file, explicitly denoting the key to use:
pseudonymize(file_path="./data/personer.json", fields=["fnr", "fornavn"], key="ssb-common-key-1")

# Pseudonymize a local file using a custom key:
import json
custom_keyset = json.dumps({
    "encryptedKeyset": "CiQAp91NBhLdknX3j9jF6vwhdyURaqcT9/M/iczV7fLn...8XYFKwxiwMtCzDT6QGzCCCM=",
    "keysetInfo": {
        "primaryKeyId": 1234567890,
        "keyInfo": [
            {
                "typeUrl": "type.googleapis.com/google.crypto.tink.AesSivKey",
                "status": "ENABLED",
                "keyId": 1234567890,
                "outputPrefixType": "TINK",
            }
        ],
    },
    "kekUri": "gcp-kms://projects/some-project-id/locations/europe-north1/keyRings/some-keyring/cryptoKeys/some-kek-1",
})
pseudonymize(file_path="./data/personer.json", fields=["fnr", "fornavn"], key=custom_keyset)

# Operate on data in a streaming manner:
import shutil
with pseudonymize("./data/personer.json", fields=["fnr", "fornavn", "etternavn"], stream=True) as res:
    with open("./data/personer_deid.json", 'wb') as f:
        res.raw.decode_content = True
        shutil.copyfileobj(res.raw, f)

# Map certain fields to stabil ID
pseudonymize(file_path="./data/personer.json", fields=["fornavn"], sid_fields=["fnr"])

Repseudonymize

from dapla_pseudo import repseudonymize

# Repseudonymize fields in a local file, denoting source and target keys to use:
repseudonymize(file_path="./data/personer_deid.json", fields=["fnr", "fornavn"], source_key="ssb-common-key-1", target_key="ssb-common-key-2")

Depseudonymize

from dapla_pseudo import depseudonymize

# Depseudonymize fields in a local file using the default key:
depseudonymize(file_path="./data/personer_deid.json", fields=["fnr", "fornavn"])

# Depseudonymize fields in a local file, explicitly denoting the key to use:
depseudonymize(file_path="./data/personer_deid.json", fields=["fnr", "fornavn"], key="ssb-common-key-1")

Note that depseudonymization requires elevated access privileges.

Requirements

Python >= 3.10
Dependencies can be found in pyproject.toml

Installation

You can install Dapla Toolbelt Pseudo via pip from PyPI:

pip install dapla-toolbelt-pseudo

Usage

Please see the Reference Guide for details.

Contributing

Contributions are very welcome. To learn more, see the Contributor Guide.

License

Distributed under the terms of the MIT license, Dapla Toolbelt Pseudo is free and open source software.

Issues

If you encounter any problems, please file an issue along with a detailed description.

Credits

This project was generated from Statistics Norway's SSB PyPI Template.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
License
- OSI Approved :: MIT License
Programming Language

Release history Release notifications | RSS feed

2.0.3

May 23, 2024

2.0.2

May 22, 2024

2.0.1

May 22, 2024

2.0.0

May 16, 2024

1.8.6

Apr 16, 2024

1.8.5

Apr 15, 2024

1.8.4

Apr 11, 2024

1.8.3

Apr 11, 2024

1.8.2

Apr 11, 2024

1.8.1

Apr 11, 2024

1.8.0

Apr 10, 2024

1.7.0

Apr 9, 2024

1.6.0

Apr 4, 2024

1.5.0

Apr 4, 2024

1.4.0

Mar 22, 2024

1.3.0

Mar 7, 2024

1.2.2

Mar 4, 2024

1.2.1

Feb 29, 2024

1.2.0

Feb 28, 2024

1.1.0

Feb 13, 2024

1.0.4

Feb 8, 2024

1.0.3

Jan 24, 2024

This version

1.0.2

Jan 18, 2024

1.0.0

Jan 8, 2024

0.6.3

Dec 6, 2023

0.6.2

Dec 1, 2023

0.6.1

Dec 1, 2023

0.6.0

Nov 30, 2023

0.5.9

Oct 17, 2023

0.5.7

Oct 9, 2023

0.5.6

Oct 9, 2023

0.5.5

Oct 4, 2023

0.5.4

Sep 29, 2023

0.5.3

Sep 28, 2023

0.5.2

Sep 27, 2023

0.4.0

Aug 17, 2023

0.3.0

Jun 29, 2023

0.2.9

Jun 7, 2023

0.2.8

Jun 6, 2023

0.2.7

May 30, 2023

0.2.6

May 30, 2023

0.2.4

May 24, 2023

0.2.3

Mar 22, 2023

0.2.2

Mar 17, 2023

0.2.1

Mar 17, 2023

0.2.0

Jan 5, 2023

0.1.0

Nov 23, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dapla_toolbelt_pseudo-1.0.2.tar.gz (25.6 kB view hashes)

Uploaded Jan 18, 2024 Source

Built Distribution

dapla_toolbelt_pseudo-1.0.2-py3-none-any.whl (28.6 kB view hashes)

Uploaded Jan 18, 2024 Python 3

Hashes for dapla_toolbelt_pseudo-1.0.2.tar.gz

Hashes for dapla_toolbelt_pseudo-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`42e74d48a4a5f892bd1c05add4af5c6db56a33c952a4c7fe4533cf02751d9b80`
MD5	`94355455f813846386c70588a60467fb`
BLAKE2b-256	`b5ed08a5536f9aeaa94869a245b5c335ac101b8f8cbba3584728802e8ecaadec`

Hashes for dapla_toolbelt_pseudo-1.0.2-py3-none-any.whl

Hashes for dapla_toolbelt_pseudo-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1bdb8e201c83460583e5ab48c26e8a0418283798459e6b5a815966166a7c34a1`
MD5	`03d0a6faf3752ae56e6eda8b14f07c6e`
BLAKE2b-256	`d40d5b463d4c2a4eb678750c2bc18aeef65cc2986b2dd6dbf52712c946c8a3ab`