datasaurus

Data Engineering framework based on Polars.rs

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Datasaurus is a Data Engineering framework written in Python 3.7+

It is based in Polars and heavily influenced by Django.

The intention is to offer an opinionated, feature-rich and powerful framework to help you write data pipelines, ETLs or data manipulation programs.

Documentation (TODO)

It supports:

✅ Fully support read/write operations.
⭕ Not yet but will be implemented.
💀 Won't be implemented in the near future.

Storages:

Sqlite ✅
PostgresSQL ✅
MySQL/MariaDB ⭕
Local Storage ✅
Azure blob storage ⭕
AWS S3 ⭕

Formats:

CSV ✅
JSON ✅
PARQUET ✅
EXCEL ✅
AVRO ✅
TSV ⭕
SQL ⭕ (Like sql inserts)

Features:

Delta Tables ⭕
Field validations ⭕

Simple example

# settings.py 
from datasaurus.core.storage import PostgresStorage, StorageGroup, SqliteStorage
from datasaurus.core.models import StringField, IntegerField

# We set the environment that will be used.
os.environ['DATASAURUS_ENVIRONMENT'] = 'dev'

class ProfilesData(StorageGroup):
    dev = SqliteStorage(path='/data/data.sqlite')
    live = PostgresStorage(username='user', password='user', host='localhost', database='postgres')

    
# models.py
from datasaurus.core.models import Model
from datasaurus.core.models import StringField, IntegerField

class ProfileModel(Model):
    id = IntegerField()
    username = StringField()
    mail = StringField()
    sex = StringField()

    class Meta:
        storage = ProfilesData
        table_name = 'PROFILE'

We can access the raw Polar's dataframe with 'Model.df', it's lazy, meaning it'll only load the data if we access the attribute.

>>> ProfileModel.df
shape: (100, 4)
┌─────┬────────────────────┬──────────────────────────┬─────┐
│ id  ┆ username           ┆ mail                     ┆ sex │
│ --- ┆ ---                ┆ ---                      ┆ --- │
│ i64 ┆ str                ┆ str                      ┆ str │
╞═════╪════════════════════╪══════════════════════════╪═════╡
│ 1   ┆ ehayes             ┆ colleen63@hotmail.com    ┆ F   │
│ 2   ┆ thompsondeborah    ┆ judyortega@hotmail.com   ┆ F   │
│ 3   ┆ orivera            ┆ iperkins@hotmail.com     ┆ F   │
│ 4   ┆ ychase             ┆ sophia92@hotmail.com     ┆ F   │
│ …   ┆ …                  ┆ …                        ┆ …   │
│ 97  ┆ mary38             ┆ sylvia80@yahoo.com       ┆ F   │
│ 98  ┆ charlessteven      ┆ usmith@gmail.com         ┆ F   │
│ 99  ┆ plee               ┆ powens@hotmail.com       ┆ F   │
│ 100 ┆ elliottchristopher ┆ wilsonbenjamin@yahoo.com ┆ M   │
└─────┴────────────────────┴──────────────────────────┴─────┘

We could now create a new model whose data is created from ProfileModel

class FemaleProfiles(Model):
    id = IntegerField()
    profile_id = IntegerField()
    mail = StringField()

    def calculate_data(self):
        return (
            ProfileModel.df
            .filter(ProfileModel.sex == 'F')
            .with_row_count('new_id')
            .with_columns(
                pl.col('new_id')
            )
            .with_columns(
                pl.col('id').alias('profile_id')
            )
        )

    class Meta:
        auto_select = True
        recalculate = True
        storage = ProfilesData
        table_name = 'PROFILE_FEMALES'

Et voilá! We can now create new dataframes from other dataframes,

If we now call:

FemaleProfiles.ensure_exists()

In this example, by just calling ensure_exists it will:

Check if the table exists in 'dev' (sqlite).
Read ProfileModel from the 'dev' (sqlite).
Calculate the new data (calculate_data).
Validate that the columns of the resulting dataframe matches of the model's (In this case it will auto_select).
Write the table in 'dev' (sqlite), if the table does not exist, it'll create it.

You can even move data to different environments or storages, making it easy to change formats or move data around.

You could for example call:

FemaleProfiles.save(to=ProfilesData.live)

Effectively moving data from SQLITE (dev) to PostgreSQL (live),

# Can also change formats
FemaleProfiles.save(to=ProfilesData.otherenvironment, format=LocalFormat.JSON)
FemaleProfiles.save(to=ProfilesData.otherenvironment, format=LocalFormat.CSV)
FemaleProfiles.save(to=ProfilesData.otherenvironment, format=LocalFormat.PARQUET)

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.0.2.dev4 pre-release

Dec 19, 2023

0.0.2.dev3 pre-release

Dec 19, 2023

0.0.2.dev0 pre-release

Dec 19, 2023

0.0.1.dev2 pre-release

Jun 27, 2023

This version

0.0.1.dev1 pre-release

Jun 14, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datasaurus-0.0.1.dev1.tar.gz (10.9 kB view hashes)

Uploaded Jun 14, 2023 Source

Built Distribution

datasaurus-0.0.1.dev1-py3-none-any.whl (12.3 kB view hashes)

Uploaded Jun 14, 2023 Python 3

Hashes for datasaurus-0.0.1.dev1.tar.gz

Hashes for datasaurus-0.0.1.dev1.tar.gz
Algorithm	Hash digest
SHA256	`f1c6a588d35d1e8633bbaeaef3fd8aa8409b6de5d317a2ebfa0ba9edcb4790a0`
MD5	`d7ba43666816227f4c096e1af0632ae5`
BLAKE2b-256	`31968807ac9e3455cfb124320cf2fb19ac1177508920c9b0f81fa151c34eac19`

Hashes for datasaurus-0.0.1.dev1-py3-none-any.whl

Hashes for datasaurus-0.0.1.dev1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`708a8c91564695f532c4b9d02b85505ebfde67dd750abbc4e3b694b479267828`
MD5	`02e3fa5319145577bce2ada5d06aba7e`
BLAKE2b-256	`23b17206bb530252e3075c5448558c0306e15db3d3e82cecee38654a00fe6ed3`