Skip to main content

ETL from third-party APIs into Snowflake

Project description

Mr. Plow

Mr. Plow is Poppin’s ETL system to persist data from third-party APIs into a Snowflake database for later business analysis.

We use Python to:

  1. Call said APIs and store the data in AWS S3 (“Extract”)

  2. Issue Snowflake commands to import the data from AWS S3 (“Stage”)

  3. Issue Snowflake commands to transform the new data from its original unstructured form to the tabular form used for analysis (“Transform”)

  4. Issue Snowflake commands to load the new tabular data to our main store, eliminating any duplicates (“Load”)

Mr. Plow can be run from the command line. In the future we will add support for running its functions as AWS Lambda functions.

Why that ridiculous name?

Two reasons.

First, this is primarily a Snowflake client.

Second: https://www.youtube.com/watch?v=uYXEt7xOh1M

Installation

Currently, we support only Python 3.6. This is the highest current version. There is no reason to suspect 3.7 and up will not work to run Mr. Plow.

Lower Python 3.x versions may work, but they are not tested. You are free to try them. If you succeed running Mr. Plow in a 3.3, 3.4, or 3.5 environment, please tell us on the Github issue tracker and we will add testing support.

First run python3 --version to ensure you have the required Python version. It may also be installed as python3.6.

We recommend you install using a virtual python environment to manage library dependencies. Virtualenv is a great tool for this.

Suggested usage:

  1. Install virtualenv if not already present

  2. Create a virtual environment inside the clone: virtualenv -p python3.6 venv

    • This creates a directory venv/, which contains an entire python environment.

  3. Activate your virtual environment: source venv/bin/activate

    • This adds the virtual environment’s python and pip executables to your PATH.

  4. Install dependencies: pip install mr-plow

Now you should have the executable plow in your PATH. Simply run plow at the command line to see available commands and an example workflow.

All commands accept the --help option and print a helpful usage guide.

Configuration

Since we access both Snowflake and AWS S3, credentials are required. These may be provided via a config file. It can be provided to the plow CLI tool either via the -c option or by exporting the environment variable MR_PLOW_CONFIG.

The supported formats are JSON, YAML, and INI. INI files must have the appropriate settings in the [mr-plow] section. Run plow generate-config to have a minimal example generated for you to fill in.

Mr. Plow fills in configuration it does not find by searching environment variables.

To see more including a full list of config options, install Mr. Plow and run python -m pydoc plow.config.Config.

Snowflake

Those commands that require Snowflake access draw the credentials from the execution environment. Mr. Plow uses the following environment variables to create a Snowflake connection. If both SNOWSQL_FOO and SNOWFLAKE_FOO are present as environment variables, Mr. Plow uses SNOWFLAKE_FOO.

  • SNOWFLAKE_ACCOUNT or SNOWSQL_ACCOUNT

    • visible in the URL you use to log in to Snowflake. For the author and his peers, this is “poppin”.

  • SNOWFLAKE_USER or SNOWSQL_USER

    • Same as your login on the Snowflake website.

  • SNOWFLAKE_PASSWORD or SNOWSQL_PASSWORD

    • This is the password for that login. With the command line, this is optional as it can be provided as a console prompt.

  • SNOWFLAKE_ROLE or SNOWSQL_ROLE

  • SNOWFLAKE_DATABASE or SNOWSQL_DATABASE

  • SNOWFLAKE_SCHEMA or SNOWSQL_SCHEMA

    • E.g. “public”. However please keep in mind that Mr. Plow creates its own schemas and never references tables without explicitly providing a schema.

  • SNOWFLAKE_WAREHOUSE or SNOWSQL_WAREHOUSE

AWS

We use boto3 to connect to AWS S3. It has a rich system for specifying credentials, which can be used in its entirety by omitting AWS settings from your config file. This is appropriate in AWS Lambda, where credentials are taken care of in the background, or if you are an AWS CLI user and have existing configs. See boto3 documentation for more detail.

To specify config directly, you may either give the aws_access_key_id and aws_secret_access_key settings directly, or if you have an existing AWS CLI configuration, you may simply specify aws_profile.

When Snowflake reads your API data from S3, it requires you to provide AWS credentials with the appropriate S3 read permissions. Mr. Plow picks these up as distinct config options: staging_aws_access_key_id and staging_aws_secret_access_key. As with the direct S3 credentials, staging_aws_profile may also be provided.

Third-party API’s

We currently provide integrations for Livechat and Snapfulfil.

The Livechat integration requires an API login, which is specified as config options livechat_login and livechat_api_key.

The Snapfulfil integration requires an API login (config snapfulfil_login and snapfulfil_password) as well as an explicit designation of which Snap domain you want to make requests to (config snapfulfil_api_domain). Typically the latter is either https://treltestapi.snapfulfil.net or https://trelliveapi.snapfulfil.net.

Extending Mr. Plow

Mr. Plow is designed to be broadly useful and easily extensible. We currently provide integrations for Livechat and Snapfulfil, and more will be added, but you can easily add your own.

You must create your own implementation of plow.op.extract.Extractor to define how to fetch data. In a pinch you can use a RestExtractor; to allow for automatic fetching of subsequent pages, you’ll have to subclass it and implement postprocess_response(). See documentation of plow.op.Extractor; also see plow.vendors.* for examples.

You must furthermore create your own instances of plow.queries.Table to specify how to translate data from the documents you fetch using the Extractor into Snowflake DB tables. See documentation of plow.queries.Table, and see plow.queries.livechat and plow.queries.snapfulfil for tested examples.

Finally, you must create a plow.cli.Source pointing to all of these.

If you write your own adapter, we’d love to include it. Please send a pull request!

Example

If you have the following files in your Python project and mr-plow installed with pip:

# mymodule/plow/extract.py
from plow.op.extract import RestExtractor
class Extractor(RestExtractor):
    ...
# mymodule/plow/tables.py
from plow.queries import Table
class Table1(Table):
    select = "..."
    # etc...

class Table2(Table):
    ...

class Table3(Table):
    ...
# mymodule/plow/cli.py
from plow.cli import Source
from mymodule.plow.extract import Extractor
from mymodule.plow.tables import Table1, Table2, Table3

extractor = Extractor()
tables = {t.name: t for t in (Table1(), Table2(), Table3())}
source = Source(extractor=extractor, tables=tables)

Then you can invoke the plow CLI tool, using the --source option to point to your code:

$ plow -c mr-plow.ini extract --source mymodule.plow.cli:source [options]...

Development

Developer installation

To develop on Mr. Plow, clone this repository, set up and activate a virtualenv (see Installation) in the new working copy, and run pip install -e .[dev]. This installs the plow executable as well as development dependencies like Flake8 (the linter we use) and pytest.

Testing

Mr. Plow is tested using pytest. If you clone the source and install using pip install -e .[dev], it is installed automatically along with several other test dependencies. Run pytest to run the unit tests; add --cov-report=term-missing or --cov-report=html to see detailed coverage information.

Testing is separated into two section, unit tests and integration tests. Integration tests are disabled by default: specify pytest --integration to run integration tests as well, or pytest --no-unit to disable unit tests and run only integration tests.

The unit tests use mocking for all external functionality, including Snowflake, S3, and third-party API’s, and so may be run without an internet connection or any of the service-specific configuration specified above. However, at this time, with very few exceptions these tests do not verify any specific SQL queries, nor almost any vendor-specific logic.

The integration tests run all Snowflake setup operations and a full run of ETL operations through S3 and Snowflake, so you must do some setup in order to run them. We mock access to the third party API, so that we can simulate the processing of a constant dataset and verify the result with precision. To run integration tests, you must supply a configuration file by exporting its path to PLOW_TEST_CONFIG.

Git hooks

This project adheres to several standards including a style guide and unit tests. To aid developers in complying, we include hooks that can be run upon a commit. Install them as follows:

  • Include whatever hooks you wish in your own .git/hooks/pre-commit:

    $ echo '#!/usr/bin/env bash' >> .git/hooks/pre-commit
    $ echo 'flake8-hook.py || exit $?' >> .git/hooks/pre-commit
    $ echo 'unittest-hook.sh || exit $?' >> .git/hooks/pre-commit
    $ chmod u+x .git/hooks/pre-commit
  • Add the appropriate options to your git config:

    $ git config flake8.strict true
    $ git config plow.unit.strict true

Now, the linter and the unit tests will run every time you commit and you will be prompted to fix any deficiencies before committing. These checks can be disabled temporarily using environment variables. To avoid a linting check:

$ git commit
Style errors found!
$ FLAKE8_STRICT=false git commit
[...] Success!

To skip unit tests:

$ PLOW_UNIT_STRICT=false git commit

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mr-plow-0.1.5.tar.gz (35.8 kB view hashes)

Uploaded Source

Built Distribution

mr_plow-0.1.5-py3-none-any.whl (43.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page