Skip to main content

Scheduled task execution on top of AWS Data Pipeline

Project description

# Pipewelder

![A worker welding a pipe](welder.jpg)

Pipewelder is a framework that provides a command-line tool and Python API
to manage [AWS Data Pipeline](http://aws.amazon.com/datapipeline/) jobs from flat files.
Simple uses it as a cron-like job scheduler.

## Overview

Pipewelder aims to ease the task of scheduling jobs by defining very simple
pipelines which are little more than an execution schedule, offloading
most of the execution logic to files in S3.
Pipewelder uses Data Pipeline's concept of [data staging](http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-concepts-staging.html) to pull input files from S3 at the beginning of execution and to upload output files back to S3 at the end of execution.

If you follow Pipewelder's directory structure, all of your pipeline logic
can live in version-controlled flat files. The included command-line interface
gives you simple commands to validate your pipeline definitions, upload
task definitions to S3, and activate your pipelines.

## Installation

Pipewelder is available from [PyPI](https://pypi.python.org/pypi) via `pip`:
```
pip install pipewelder
```

The easiest way to get started is to clone the project from GitHub, copy
the example project from Pipewelder's tests, and then modify to suit:
```bash
git clone https://github.com/SimpleFinance/pipewelder.git
cp -r pipewelder/tests/test_data my-pipewelder-project
```

If you're setting up Pipewelder and need help, feel free to email the author.

## Directory Structure

To use Pipewelder, you provide a template pipeline definition along with
one or more directories that correspond to particular pipeline instances.
The directory structure looks like this
(see [test_data](tests/test_data) for a working example):
```
pipeline_definition.json
pipewelder.json <- optional configuration file
my_first_pipeline/
run
values.json
tasks/
task1.sh
task2.sh
my_second_pipeline/
...
```

The `values.json` file in each pipeline directory specifies parameter values
that are used modify the template definition
including the S3 paths for inputs, outputs, and logs.
Some of these values are used directly by Pipewelder as well.

A [`ShellCommandActivity`](http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-shellcommandactivity.html) in the template definition simply looks for an executable file named `run` and executes it.
`run` is the entry point for whatever work you want your pipeline to do.

Often, your `run` executable will be a wrapper script to execute a variety of similar tasks.
When that's the case, use the `tasks` subdirectory to hold these definitions.
These tasks could be text files, shell scripts, SQL code, or whatever else
your `run` file expects.
Pipewelder gives `tasks` folder special treatment in that the CLI will make
sure to remove existing task definitions when uploading files.

## Using the Command-Line Interface

The Pipewelder CLI should always be invoked from the top-level directory
of your definitions (the directory where `pipeline_definition.json` lives).
If your directory structure matches Pipewelder's expectations, it should
work without further configuration.

As you make changes to your template definition or `values.json` files,
it can be useful to check whether AWS considers your definitions valid:
```
$ pipewelder validate
```

Once you've defined your pipelines, you'll need to upload the files to S3:
```
$ pipewelder upload
```

Finally, activate your pipelines:
```
$ pipewelder activate
```

Any time you change the `values.json` or `pipeline_definition.json`, you'll
need to run the `activate` subcommand again. Because active pipelines can't
be modified, the `activate` command will delete the existing pipeline and
create a new one in its place. The run history for the previous pipeline will
be discarded.

## Acknowledgments

Pipewelder's package structure is based on [python-project-template](https://github.com/seanfisk/python-project-template).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pipewelder-0.1.tar.gz (27.2 kB view hashes)

Uploaded Source

Built Distribution

pipewelder-0.1-py2.py3-none-any.whl (17.0 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page