Skip to main content

Data Engineering framework based on Polars.rs

Project description

Datasaurus is a Data Engineering framework written in Python 3.7+

It is based in Polars and heavily influenced by Django.

The intention is to offer an opinionated, feature-rich and powerful framework to help you write data pipelines, ETLs or data manipulation programs.

Documentation (TODO)

It supports:

  • โœ… Fully support read/write operations.
  • โญ• Not yet but will be implemented.
  • ๐Ÿ’€ Won't be implemented in the near future.

Storages:

  • Sqlite โœ…
  • PostgresSQL โœ…
  • MySQL/MariaDB โญ•
  • Local Storage โœ…
  • Azure blob storage โญ•
  • AWS S3 โญ•

Formats:

  • CSV โœ…
  • JSON โœ…
  • PARQUET โœ…
  • EXCEL โœ…
  • AVRO โœ…
  • TSV โญ•
  • SQL โญ• (Like sql inserts)

Features:

  • Delta Tables โญ•
  • Field validations โญ•

Simple example

# settings.py 
from datasaurus.core.storage import PostgresStorage, StorageGroup, SqliteStorage
from datasaurus.core.models import StringField, IntegerField

# We set the environment that will be used.
os.environ['DATASAURUS_ENVIRONMENT'] = 'dev'

class ProfilesData(StorageGroup):
    dev = SqliteStorage(path='/data/data.sqlite')
    live = PostgresStorage(username='user', password='user', host='localhost', database='postgres')

    
# models.py
from datasaurus.core.models import Model
from datasaurus.core.models import StringField, IntegerField

class ProfileModel(Model):
    id = IntegerField()
    username = StringField()
    mail = StringField()
    sex = StringField()

    class Meta:
        storage = ProfilesData
        table_name = 'PROFILE'

We can access the raw Polar's dataframe with 'Model.df', it's lazy, meaning it'll only load the data if we access the attribute.

>>> ProfileModel.df
shape: (100, 4)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”
โ”‚ id  โ”† username           โ”† mail                     โ”† sex โ”‚
โ”‚ --- โ”† ---                โ”† ---                      โ”† --- โ”‚
โ”‚ i64 โ”† str                โ”† str                      โ”† str โ”‚
โ•žโ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•ก
โ”‚ 1   โ”† ehayes             โ”† colleen63@hotmail.com    โ”† F   โ”‚
โ”‚ 2   โ”† thompsondeborah    โ”† judyortega@hotmail.com   โ”† F   โ”‚
โ”‚ 3   โ”† orivera            โ”† iperkins@hotmail.com     โ”† F   โ”‚
โ”‚ 4   โ”† ychase             โ”† sophia92@hotmail.com     โ”† F   โ”‚
โ”‚ โ€ฆ   โ”† โ€ฆ                  โ”† โ€ฆ                        โ”† โ€ฆ   โ”‚
โ”‚ 97  โ”† mary38             โ”† sylvia80@yahoo.com       โ”† F   โ”‚
โ”‚ 98  โ”† charlessteven      โ”† usmith@gmail.com         โ”† F   โ”‚
โ”‚ 99  โ”† plee               โ”† powens@hotmail.com       โ”† F   โ”‚
โ”‚ 100 โ”† elliottchristopher โ”† wilsonbenjamin@yahoo.com โ”† M   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”˜

We could now create a new model whose data is created from ProfileModel

class FemaleProfiles(Model):
    id = IntegerField()
    profile_id = IntegerField()
    mail = StringField()

    def calculate_data(self):
        return (
            ProfileModel.df
            .filter(ProfileModel.sex == 'F')
            .with_row_count('new_id')
            .with_columns(
                pl.col('new_id')
            )
            .with_columns(
                pl.col('id').alias('profile_id')
            )
        )

    class Meta:
        auto_select = True
        recalculate = True
        storage = ProfilesData
        table_name = 'PROFILE_FEMALES'

Et voilรก! We can now create new dataframes from other dataframes,

If we now call:

FemaleProfiles.ensure_exists()

In this example, by just calling ensure_exists it will:

  1. Check if the table exists in 'dev' (sqlite).
  2. Read ProfileModel from the 'dev' (sqlite).
  3. Calculate the new data (calculate_data).
  4. Validate that the columns of the resulting dataframe matches of the model's (In this case it will auto_select).
  5. Write the table in 'dev' (sqlite), if the table does not exist, it'll create it.

You can even move data to different environments or storages, making it easy to change formats or move data around.

You could for example call:

FemaleProfiles.save(to=ProfilesData.live)

Effectively moving data from SQLITE (dev) to PostgreSQL (live),

# Can also change formats
FemaleProfiles.save(to=ProfilesData.otherenvironment, format=LocalFormat.JSON)
FemaleProfiles.save(to=ProfilesData.otherenvironment, format=LocalFormat.CSV)
FemaleProfiles.save(to=ProfilesData.otherenvironment, format=LocalFormat.PARQUET)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datasaurus-0.0.1.dev1.tar.gz (10.9 kB view hashes)

Uploaded Source

Built Distribution

datasaurus-0.0.1.dev1-py3-none-any.whl (12.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page