Skip to main content

ETL programming in Python

Project description

https://github.com/chrthomsen/pygrametl/actions/workflows/python-unittest-on-pr-and-push.yml/badge.svg

pygrametl (pronounced py-gram-e-t-l) is a Python framework that provides functionality commonly used when developing Extract-Transform-Load (ETL) programs. It is fully open-source and released under a 2-clause BSD license. As shown in the figure below, an ETL program that uses pygrametl is a standard Python program that imports pygrametl and uses the abstractions it provides. To provide developers with complete control over the data warehouse’s schema, pygrametl assumes that all of the dimension tables and fact tables used in the ETL program have already been created using SQL.

http://chrthomsen.github.io/pygrametl/assets/etl-with-pygrametl.svg

Defining the data warehouse’s schema using SQL and implementing the ETL program itself using standard Python turns out to be very efficient and effective, even when compared to drawing the program in a graphical user interface like Apache Hop or Pentaho Data Integration. pygrametl supports CPython and Jython so both existing Python code that uses native extensions models and PEP 249 connectors and JVM-based code that uses JDBC drivers can be used in the ETL program.

When using pygrametl, the developer creates an object for each data source, dimension and fact table and operate on rows in the form of standard Python dicts. Thus, (s)he can easily read rows from a data source using a loop like for row in datasource:, transform the rows using arbitrary Python code like row["price"] *= 1.25, and then add new dimension members to a dimension and facts to a fact table using dimension.insert(row) and facttable.insert(row), respectively. This is a very simple example, but pygrametl also supports much more complicated scenarios. For example, it is possible to create a single object for an entire snowflaked dimension. It is then possible to add a new dimension member with a single method call by using snowflake.insert(row). This will automatically perform all of the necessary lookups and insertions in the tables participating in the snowflaked dimension. pygrametl also supports multiple types of slowly changing dimensions. Again, the programmer only has to invoke a single method: slowlychanging.scdensure(row). This will perform the needed updates of both type 1 (i.e., overwrites) and type 2 (i.e., adding new versions).

pygrametl was first made publicly available in 2009. Since then, we have continuously made improvements and added new features. Version 2.8 was released in September 2023. Today, pygrametl is used in production systems in different sectors such as healthcare, finance, and transport.

Installation

pygrametl can be installed from PyPI with the following command:

$ pip install pygrametl

The current development version of pygrametl is available on GitHub:

$ git clone https://github.com/chrthomsen/pygrametl.git

For more information about installation see the Install Guide.

Documentation

The documentation is available in HTML and as a PDF. There are also installation and beginner guides available.

In addition to the documentation, multiple papers have been published about pygrametl. The papers are listed here and provide a more detailed description of the foundational ideas behind pygrametl but is obviously not keep up to date with changes and improvements implemented in the framework, for such see the documentation. If you use pygrametl in academia, please cite the relevant paper(s).

Community

To keep the development of pygrametl open for external participation, we have public mailing lists and use Github. Feel free to ask questions and provide all kinds of feedback:

  • pygrametl-user - For any questions about how to deploy and utilize pygrametl for ETL.

  • pygrametl-dev - For - questions and discussion about the development of pygrametl.

  • Github - Bugs and patches should be submitted to Github as issues and pull requests.

When asking a question or reporting a possible bug in pygrametl, please first verify that the problem still occurs with the latest version of pygrametl. If the problem persists after updating please include the following information, preferably with detailed version information, when reporting the problem:

  • Operating System.

  • Python Implementation.

  • Relational Database Management System.

  • Python Database Connector.

  • A short description of the problem with a minimal code example that reproduces the problem.

We encourage the use of Github and the mailing lists. For discussions not suitable for a public mailing list, you can, however, send us a private email.

Maintainers

pygrametl is maintained at Aalborg University by Christian Thomsen and Søren Kejser Jensen.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pygrametl-2.8.tar.gz (325.1 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page