pacer

pacer is a lightweight Python package for implementing distributed data processing workflows.

Project description

pacer

About

pacer is a lightweight Python package for implementing distributed data processing workflows. Instead of defining a DAG which models the data flow from sources to a final result pacer uses a pull model which is very similar to nesting function calls. Running such a workflow starts on the result node and recursively delegates work to the inputs.

Originally we developed pacer for running analysis pipelines in emzed, a framework for analyzing LCMS data.

How does pacer work ?

Under the hood pacer has two core components:

one for managing distributed computations of chained computations:

Processing steps in pacer are just Python functions with some additional annotations. pacer tries to compute as many processing steps as possible in parallel, either because such a function has to be applied to different data sets, or it has more than one input and those are computed concurrently.
a distributed cache which is retained on the file system

In case of partial modifications of the inputs a pacer workflow does not determine needed update computations but uses a distributed cache for mapping the input values of single processing steps to their final result. So a repeated run of the workflow with unchanged inputs will run the full workflow with all processing steps returning already known results immediately. Running the workflow with unknown or modified inputs will only execute the needed computations and update the cache.

These two components are independent and can be used seperately.

Examples

We provide some simple examples which show how easy it is to use pacer. You find these examples which we extended to print more logging information in the examples/ folder in the git repository.

In a real world LCMS workflow we would not use as simple functions as used below but longer running computation steps such as running a LCMS peak picker and a subsequent peak aligner.

How to declare a pipeline

In this case our input sources are a list of Python strings ["a", "bc", "def"] and a tuple of numbers (1, 2). The very simple example workflow computes the length of each string and multiplies it with every number from the tuple. This very simple example could be implemented in pure Python as follows:

import itertools

def length(what):
    return len(what)

def multiply(a, b):
    return a * b

words = ["a", "bc", "def"]
multipliers = (1, 2)

result = [multiply(length(w), v) for (w, v) in itertools.product(words, multipliers)]
assert result == [1, 2, 2, 4, 3, 6]

In order to transform this computations to a smart parallel processing pipeline we use the apply and join function decorators from pacer and declare the dependencies among the single steps using function calls.

from pacer import apply, join, Engine

@apply
def length(what):
    return len(what)

@join
def multiply(a, b):
    return a * b

words = ["a", "bc", "def"]
multipliers = (1, 2)

# now we DECLARE the workflow (no execution at that time):
workflow = multiply(length(words), multipliers)

Running this workflow on three CPU cores is easy now. In this case the computation steps are run in parallel:

Engine.set_number_of_processes(3)
workflow.start_computations()
result = workflow.get_all_in_order()

assert result == [1, 2, 2, 4, 3, 6]

pacers approach to compute needed updates in case of modified input data

As already stated above, pacer does not determine needed update computations in case of modified input data but uses a distributed cache instead. So running a workflow a second time will fetch the already known results of computations not affected by changes, and start computations with unknown input arguments.

We use decorators again. Leveraging the example above only needs few adjustments:

from pacer import apply, join, Engine, CacheBuilder

cache = CacheBuiler("/tmp/cache_000")

@apply
@cache
def length(what):
    return len(what)

@join
@cache
def multiply(a, b):
    return a * b

# inputs to workflow
words = ["a", "bc", "def"]
multipliers = (1, 2)

workflow = multiply(length(words), multipliers)

# run workflow
Engine.set_number_of_processes(3)
workflow.start_computations()
result = workflow.get_all_in_order()

assert result == [1, 2, 2, 4, 3, 6]

If you run these examples from a command line you see logging results showing the parallel execution of single steps and cache hits avoiding recomputations.

Project details

Release history Release notifications | RSS feed

0.30.6

Aug 30, 2016

0.30.5

Jun 8, 2016

0.30.4

Jun 8, 2016

0.30.3

May 20, 2016

0.30.2

Apr 17, 2016

0.30.1

Apr 14, 2016

0.30.0

Apr 11, 2016

0.29.1

Apr 10, 2016

0.29.0

Apr 3, 2016

0.28.2

Mar 29, 2016

0.28.1

Mar 29, 2016

0.28.0

Mar 21, 2016

0.27.0

Nov 25, 2015

0.23.0

Nov 5, 2015

0.22.1

Nov 5, 2015

0.22.0

Nov 4, 2015

0.21.7

Nov 3, 2015

0.21.6

Oct 13, 2015

0.21.5

Sep 17, 2015

0.21.4

Sep 10, 2015

0.21.3

Sep 8, 2015

0.21.2

Sep 8, 2015

0.21.1

Sep 8, 2015

0.21.0

Sep 8, 2015

0.20.1

Sep 4, 2015

0.20.0

Sep 4, 2015

0.19.4

Sep 3, 2015

0.19.3

Aug 20, 2015

0.19.2

Aug 20, 2015

0.19.1

Aug 10, 2015

0.18.0

Jul 2, 2015

0.17.5

Jun 2, 2015

0.17.4

Jun 1, 2015

0.17.3

Jun 1, 2015

0.17.2

May 19, 2015

0.17.1

May 18, 2015

0.17.0

May 18, 2015

0.16.0

Apr 23, 2015

0.15.8

Apr 21, 2015

0.15.7

Apr 21, 2015

0.15.6

Mar 30, 2015

0.15.5

Mar 29, 2015

0.15.4

Mar 26, 2015

0.15.3

Mar 26, 2015

0.15.2

Mar 26, 2015

0.15.1

Mar 26, 2015

0.15.0

Mar 26, 2015

0.13.0

Mar 25, 2015

0.12.3

Mar 18, 2015

0.12.2

Feb 24, 2015

0.12.1

Feb 24, 2015

0.12.0

Dec 12, 2014

0.11.1

Dec 11, 2014

0.11.0

Dec 10, 2014

0.10.0

Dec 10, 2014

0.9.0

Dec 2, 2014

0.8.0

Nov 28, 2014

0.7.1

Nov 19, 2014

0.7.0

Nov 19, 2014

0.6.0

Nov 18, 2014

0.5.10

Nov 11, 2014

0.5.9

Oct 16, 2014

0.5.8

Oct 16, 2014

0.5.7

Oct 10, 2014

0.5.5

Oct 9, 2014

0.5.4

Oct 8, 2014

0.5.3

Oct 8, 2014

0.5.2

Oct 8, 2014

0.5.1

Oct 7, 2014

0.5.0

Oct 6, 2014

0.4.7

Oct 2, 2014

0.4.6

Oct 1, 2014

0.4.5

Oct 1, 2014

0.4.4

Sep 30, 2014

0.4.3

Sep 30, 2014

0.4.2

Sep 29, 2014

0.4.1

Sep 29, 2014

0.4.0

Sep 29, 2014

0.3.1

Sep 8, 2014

This version

0.3.0

Sep 4, 2014

0.2.2

Aug 19, 2014

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pacer-0.3.0.tar.gz (7.7 kB view hashes)

Uploaded Sep 4, 2014 Source

Hashes for pacer-0.3.0.tar.gz

Hashes for pacer-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`1aa801418de3040450df400da43bfff5bd429100777c0cc3d0faec5e27be8d10`
MD5	`9af90055d9f449c2800c98c0b90374cc`
BLAKE2b-256	`841cd8fd6447088cf2fe0c820c42da6624a5e51c8c3e4df60c4ee2e955a9b36a`