Pseudonymization extensions for Dapla Toolbelt

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Pseudonymization extensions for Dapla Toolbelt

Pseudonymize, repseudonymize and depseudonymize data on Dapla.

Usage

See the command-line reference for details.

Pseudonymize

from dapla_pseudo import pseudonymize

# Pseudonymize fields in a local file using the default key:
pseudonymize(file_path="./data/personer.json", fields=["fnr", "fornavn"])

# Pseudonymize fields in a local file, explicitly denoting the key to use:
pseudonymize(file_path="./data/personer.json", fields=["fnr", "fornavn"], key="ssb-common-key-1")

# Pseudonymize a local file using a custom key:
import json
custom_keyset = json.dumps(    {
    "encryptedKeyset": "CiQAp91NBhLdknX3j9jF6vwhdyURaqcT9/M/iczV7fLn...8XYFKwxiwMtCzDT6QGzCCCM=",
    "keysetInfo": {
        "primaryKeyId": 1234567890,
        "keyInfo": [
            {
                "typeUrl": "type.googleapis.com/google.crypto.tink.AesSivKey",
                "status": "ENABLED",
                "keyId": 1234567890,
                "outputPrefixType": "TINK",
            }
        ],
    },
    "kekUri": "gcp-kms://projects/some-project-id/locations/europe-north1/keyRings/some-keyring/cryptoKeys/some-kek-1",
})
pseudonymize(file_path="./data/personer.json", fields=["fnr", "fornavn"], key=custom_keyset)

# Operate on data in a streaming manner:
import shutil
with pseudonymize("./data/personer.json", fields=["fnr", "fornavn", "etternavn"], stream=True) as res:
    with open("./data/personer_deid.json", 'wb') as f:
        res.raw.decode_content = True
        shutil.copyfileobj(res.raw, f)

# Map certain fields to stabil ID
pseudonymize(file_path="./data/personer.json", fields=["fornavn"], sid_fields=["fnr"])

Builder pattern pseudonymization examples

# Import necessary modules
from dapla_pseudo import PseudoData
from dapla import AuthClient
import pandas as pd


file_path="data/personer.json"

options = {
    # Specify data types of columns in the dataset
    "dtype" : { "fnr": "string","fornavn": "string","etternavn": "string","kjonn": "category","fodselsdato": "string"}
}

# Example: Single field default encryption (DAEAD)
df = pd.read_json(file_path,**options) # Create DataFrame from file

result_df = (
    PseudoData.from_pandas(df)                     # Specify what dataframe to use
    .on_field("fornavn")                           # Select the field to pseudonymize
    .pseudonymize()                                # Apply pseudonymization to the selected field
    .to_polars()                                   # Get the result as a polars dataframe
)

# Example: Multiple fields default encryption (DAEAD)
result_df = (
    PseudoData.from_file(file_path, **options)     # Read the DataFrame from file
    .on_fields("fornavn", "etternavn")             # Select multiple fields to pseudonymize
    .pseudonymize()                                # Apply pseudonymization to the selected fields
    .to_polars()                                   # Get the result as a polars dataframe
)

# Example: Single field sid mapping (FPE)
options = {
    # Specify data types of columns in the dataset
    "dtype" : { "fnr": "string","fornavn": "string","etternavn": "string","kjonn": "category","fodselsdato": "string"},
    # Specify storage options for Google Cloud Storage (GCS)
    "storage_options" : {"token": AuthClient.fetch_google_credentials()}
}

gcs_file_path = "gs://ssb-staging-dapla-felles-data-delt/felles/pseudo-examples/andeby_personer.csv"

result_df = (
    PseudoData.from_file(gcs_file_path, **options) # Read DataFrame from GCS
    .on_field("fnr")                               # Select multiple fields to pseudonymize
    .map_to_stable_id()                            # Map the selected field to stable id
    .pseudonymize()                                # Apply pseudonymization to the selected fields
    .to_polars()                                   # Get the result as a polars dataframe
)

Repseudonymize

from dapla_pseudo import repseudonymize

# Repseudonymize fields in a local file, denoting source and target keys to use:
repseudonymize(file_path="./data/personer_deid.json", fields=["fnr", "fornavn"], source_key="ssb-common-key-1", target_key="ssb-common-key-2")

Depseudonymize

from dapla_pseudo import depseudonymize

# Depseudonymize fields in a local file using the default key:
depseudonymize(file_path="./data/personer_deid.json", fields=["fnr", "fornavn"])

# Depseudonymize fields in a local file, explicitly denoting the key to use:
depseudonymize(file_path="./data/personer_deid.json", fields=["fnr", "fornavn"], key="ssb-common-key-1")

Note that depseudonymization requires elevated access privileges.

Requirements

Dapla Toolbelt

Installation

You can install dapla-toolbelt-pseudo via pip from PyPI:

pip install dapla-toolbelt-pseudo

Contributing

Contributions are very welcome. To learn more, see the Contributor Guide.

License

Distributed under the terms of the MIT license, Pseudonymization extensions for Dapla Toolbelt is free and open source software.

Issues

If you encounter any problems, please file an issue along with a detailed description.

Credits

This project was generated from @cjolowicz's Hypermodern Python Cookiecutter template.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

2.0.3

May 23, 2024

2.0.2

May 22, 2024

2.0.1

May 22, 2024

2.0.0

May 16, 2024

1.8.6

Apr 16, 2024

1.8.5

Apr 15, 2024

1.8.4

Apr 11, 2024

1.8.3

Apr 11, 2024

1.8.2

Apr 11, 2024

1.8.1

Apr 11, 2024

1.8.0

Apr 10, 2024

1.7.0

Apr 9, 2024

1.6.0

Apr 4, 2024

1.5.0

Apr 4, 2024

1.4.0

Mar 22, 2024

1.3.0

Mar 7, 2024

1.2.2

Mar 4, 2024

1.2.1

Feb 29, 2024

1.2.0

Feb 28, 2024

1.1.0

Feb 13, 2024

1.0.4

Feb 8, 2024

1.0.3

Jan 24, 2024

1.0.2

Jan 18, 2024

1.0.0

Jan 8, 2024

0.6.3

Dec 6, 2023

0.6.2

Dec 1, 2023

This version

0.6.1

Dec 1, 2023

0.6.0

Nov 30, 2023

0.5.9

Oct 17, 2023

0.5.7

Oct 9, 2023

0.5.6

Oct 9, 2023

0.5.5

Oct 4, 2023

0.5.4

Sep 29, 2023

0.5.3

Sep 28, 2023

0.5.2

Sep 27, 2023

0.4.0

Aug 17, 2023

0.3.0

Jun 29, 2023

0.2.9

Jun 7, 2023

0.2.8

Jun 6, 2023

0.2.7

May 30, 2023

0.2.6

May 30, 2023

0.2.4

May 24, 2023

0.2.3

Mar 22, 2023

0.2.2

Mar 17, 2023

0.2.1

Mar 17, 2023

0.2.0

Jan 5, 2023

0.1.0

Nov 23, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dapla_toolbelt_pseudo-0.6.1.tar.gz (20.7 kB view hashes)

Uploaded Dec 1, 2023 Source

Built Distribution

dapla_toolbelt_pseudo-0.6.1-py3-none-any.whl (24.4 kB view hashes)

Uploaded Dec 1, 2023 Python 3

Hashes for dapla_toolbelt_pseudo-0.6.1.tar.gz

Hashes for dapla_toolbelt_pseudo-0.6.1.tar.gz
Algorithm	Hash digest
SHA256	`51c57e224eb2735c63186769b5e3880225394cb19ea76bbf12f5fea53ca81dc0`
MD5	`39147c391fdcaa34ac135d9870f56b13`
BLAKE2b-256	`971a955b26918cab0b387079237c7769a81bfb3373d752c535db9ac88850c23d`

Hashes for dapla_toolbelt_pseudo-0.6.1-py3-none-any.whl

Hashes for dapla_toolbelt_pseudo-0.6.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4c869c097c4d1085f83e8de082291a07b16387fe8decd6f9e2f2e00179e6e29e`
MD5	`cbff72d41dde319c436793f5b4822fd8`
BLAKE2b-256	`beeb320d1fcfe98b62b63924da65036a4d682c86b4a2eef885f6de510d6ef469`