Run a function many times on many processes / machines
Project description
# Cluster-func
Cluster-func is a command line tool that lets you scale embarassingly parallel
solution written in Python with zero additional code.
## Install
With pip
```bash
pip install cluster-func
```
From source
```bash
git clone git://github.com/enewel101/cluster-func
cd cluster-func
python setup.py develop
```
## Why?
Too often I find myself writing multiprocessing and job-division / scheduling
boilerplate code, which, despite being conceptually straightforward, is tedious
and error-prone.
Sure, `xargs` is nice, assuming that you can conveniently get the shell to
spit out your argument values. But that's often not the case, and
you may want your arguments to be arbitrary python types instead of strings.
And, sure, Hadoop is nice too, assuming you've got a lot of time to burn
configuring it, and that your mappers and reducers don't use too much
memory, and that you've loaded all your data in the hdfs,
and that the results from your maps don't
yield too much data and overwhelm the network. Oh, and assuming
you enjoy writing boilerplate mapper and reducer code... wait, maybe hadoop
isn't so nice... (OK, OK, it does have its place!)
## Basic usage
Cluster-func is designed for situations where you need to run a single function
on many different arguments. This kind embarassingly parallelizable problem
comes up pretty often. At it's minimum, a solution involves defining
**A)** the function to be repeatedly called, and **B)** all of the different
arguments on which it should be called.
Cluster-func assumes that you have defined
these in the form of a *callable* and an *iterable* within a python script, and
then it handles the business of spreading the work across cores and machines.
The nice thing about this approach is that you unavoidably define these two
things when you write your code for a single process anyway. So you'll get
multiprocessing and cluster processing basically for free!
The tool has two modes. In **direct mode**, it runs your function on the cpus of
a single machine. In **dispatch mode**, it breaks the work into subjobs that can
be run on separate machines, and optionally submits them to a job scheduler
using `qsub`.
### Direct mode
This command:
```bash
cluf my_script.py --target=my_func --args=my_iterable --processes=12 # short options -t, -a, -p
```
If `args` yields a tuple, its contents will be unpacked and interpreted as the
positional arguments for one invocation of the target function. If you need
greater control, for example, to provide keyword arguments, then see
**<a href="#arguments-iterable">Arguments iterable</a>**
`args` can also be a callable that *returns* an iterable (including a generator),
which is often more convenient.
So, using `cluf` in direct mode lets you multiprocess any script without pools
or queues or even importing multiprocessing. But, if you really need to scale
up, then you'll want to use the dispatch mode.
### Dispatch mode
The main use of dispatch mode is to spread the work in a cluster that
uses the `qsub` command for job submission. But you can still use `cluf` to
spread work between machines don't use `qsub`.
Dispatch mode is implicitly invoked when you specify a number of compute nodes
to use (you can force the running mode using `--mode`, see **<a
href="#reference">Reference</a>**).
For example, this command:
```bash
cluf my_script --bin=0/3 # short option: -b
```
would run `cluf` in direct mode, but only execute iterations falling into bin 0
out of 3, i.e., iterations 0, 3, 6, 9, etc. (Bins are zero-indexed.)
You can use this to start subjobs manually if you like.
You can assign multiple bins to one subjob, For example, the option
`--bins=0-2,5/10` will assign bins 0, 1, 2, and 5 (out of a total of 10 bins).
### If your iterable is not stable
The default approach to binning assumes that the arguments iterable will
yield the same arguments in the same order during execution of each subjob.
If you can't ensure that, then binning can be based on the arguments themselves,
instead of their order.
There are two alernative ways to handle binning: using *argument hashing* and
*direct assignment*.
### Argument hashing
By specifying the `--hash` option, you can instruct `cluf` to hash one or more
of the arguments to determine its bin.
For example, doing:
```bash
cluf example --nodes=12 --hash=0-2,5,my_kwarg # short options: -n and -x
```
will hash arguents in positions 0,1,2, and 5, along with the keyword argument
`my_kwarg`. If any hashed arguments are missing in an
iteration (becase, recall, invocations may use different numbers of arguments),
they are simply ommitted when calculating the hash.
### Direct assignment
The final method for dividing work is to include an argument that explicitly
specifies the
bin for each iteration. To activate direct assignment, and to specify which
argument should be interpreted as the bin, use `--key` option:
```bash
cluf
my_script -n 12 -e 'FOO=bar BAZ="fizz bang"'will set
FOO equal to "bar" and BAZ equal to "fizz bang". This
option only takes effect in dispatch mode.
-P PREPEND_SCRIPT, --prepend-script PREPEND_SCRIPT
Path to a script whose contents should be included at
the beginning of subjob scripts, being executed before
running the subjob. You can include multiple comma-
separated paths. This option only takes effect in
dispatch mode.
-A APPEND_SCRIPT, --append-script APPEND_SCRIPT
Path to a script whose contents should be included at
the end of subjob scripts, being executed after the
subjob completes. You can include multiple comma-
separated paths. This option only takes effect in
dispatch mode.
-m {dispatch,direct}, --mode {dispatch,direct}
Explicitly set the mode of operation. Can be set to
"direct" or "dispatch". In direct mode the job is run,
whereas in dispatch mode a script for the job(s) is
created and optionally enqueued. Setting either -n or
-i implicitly sets the mode of operation to
"dispatch", unless specified otherwise.
-x HASH, --hash HASH Specify an argument or set of arguments to be used to
determine which bin an iteration belons in. These
arguments should have a stable string representation
(i.e. no unordered containers or memory addresses) and
should be unique over the argumetns iterable. This
should only be set if automatic binning won't work,
i.e. if your argument iterable is not stable.
-k KEY, --key KEY Integer specifying the positional argument to use as
the bin for each iteration. That key argument should
always take on a value that is an integer between 0
and num_bins-1. This should only be used if you really
need to control binning. Prefer to rely on automatic
binning (if your iterable is stable), or use the
-xoption, which is more flexible and less error-prone.
-n NODES, --nodes NODES
Number of compute nodes. This option causes the
command to operate in dispatch mode, unless the mode
is explicitly set
-i ITERATIONS, --iterations ITERATIONS
Approximate number of iterations per compute node.
This option causes the command to operate in dispatch
mode, unless the mode is explicitly set. Note that
using this instead of --nodes (-n) can lead to delay
because the total number of iterations has to be
counted to determine the number of compute nodes
needed.
</pre>
Cluster-func is a command line tool that lets you scale embarassingly parallel
solution written in Python with zero additional code.
## Install
With pip
```bash
pip install cluster-func
```
From source
```bash
git clone git://github.com/enewel101/cluster-func
cd cluster-func
python setup.py develop
```
## Why?
Too often I find myself writing multiprocessing and job-division / scheduling
boilerplate code, which, despite being conceptually straightforward, is tedious
and error-prone.
Sure, `xargs` is nice, assuming that you can conveniently get the shell to
spit out your argument values. But that's often not the case, and
you may want your arguments to be arbitrary python types instead of strings.
And, sure, Hadoop is nice too, assuming you've got a lot of time to burn
configuring it, and that your mappers and reducers don't use too much
memory, and that you've loaded all your data in the hdfs,
and that the results from your maps don't
yield too much data and overwhelm the network. Oh, and assuming
you enjoy writing boilerplate mapper and reducer code... wait, maybe hadoop
isn't so nice... (OK, OK, it does have its place!)
## Basic usage
Cluster-func is designed for situations where you need to run a single function
on many different arguments. This kind embarassingly parallelizable problem
comes up pretty often. At it's minimum, a solution involves defining
**A)** the function to be repeatedly called, and **B)** all of the different
arguments on which it should be called.
Cluster-func assumes that you have defined
these in the form of a *callable* and an *iterable* within a python script, and
then it handles the business of spreading the work across cores and machines.
The nice thing about this approach is that you unavoidably define these two
things when you write your code for a single process anyway. So you'll get
multiprocessing and cluster processing basically for free!
The tool has two modes. In **direct mode**, it runs your function on the cpus of
a single machine. In **dispatch mode**, it breaks the work into subjobs that can
be run on separate machines, and optionally submits them to a job scheduler
using `qsub`.
### Direct mode
This command:
```bash
```
If `args` yields a tuple, its contents will be unpacked and interpreted as the
positional arguments for one invocation of the target function. If you need
greater control, for example, to provide keyword arguments, then see
**<a href="#arguments-iterable">Arguments iterable</a>**
`args` can also be a callable that *returns* an iterable (including a generator),
which is often more convenient.
So, using `cluf` in direct mode lets you multiprocess any script without pools
or queues or even importing multiprocessing. But, if you really need to scale
up, then you'll want to use the dispatch mode.
### Dispatch mode
The main use of dispatch mode is to spread the work in a cluster that
uses the `qsub` command for job submission. But you can still use `cluf` to
spread work between machines don't use `qsub`.
Dispatch mode is implicitly invoked when you specify a number of compute nodes
to use (you can force the running mode using `--mode`, see **<a
href="#reference">Reference</a>**).
For example, this command:
```bash
```
would run `cluf` in direct mode, but only execute iterations falling into bin 0
out of 3, i.e., iterations 0, 3, 6, 9, etc. (Bins are zero-indexed.)
You can use this to start subjobs manually if you like.
You can assign multiple bins to one subjob, For example, the option
`--bins=0-2,5/10` will assign bins 0, 1, 2, and 5 (out of a total of 10 bins).
### If your iterable is not stable
The default approach to binning assumes that the arguments iterable will
yield the same arguments in the same order during execution of each subjob.
If you can't ensure that, then binning can be based on the arguments themselves,
instead of their order.
There are two alernative ways to handle binning: using *argument hashing* and
*direct assignment*.
### Argument hashing
By specifying the `--hash` option, you can instruct `cluf` to hash one or more
of the arguments to determine its bin.
For example, doing:
```bash
```
will hash arguents in positions 0,1,2, and 5, along with the keyword argument
`my_kwarg`. If any hashed arguments are missing in an
iteration (becase, recall, invocations may use different numbers of arguments),
they are simply ommitted when calculating the hash.
### Direct assignment
The final method for dividing work is to include an argument that explicitly
specifies the
bin for each iteration. To activate direct assignment, and to specify which
argument should be interpreted as the bin, use `--key` option:
```bash
my_script -n 12 -e 'FOO=bar BAZ="fizz bang"'will set
FOO equal to "bar" and BAZ equal to "fizz bang". This
option only takes effect in dispatch mode.
-P PREPEND_SCRIPT, --prepend-script PREPEND_SCRIPT
Path to a script whose contents should be included at
the beginning of subjob scripts, being executed before
running the subjob. You can include multiple comma-
separated paths. This option only takes effect in
dispatch mode.
-A APPEND_SCRIPT, --append-script APPEND_SCRIPT
Path to a script whose contents should be included at
the end of subjob scripts, being executed after the
subjob completes. You can include multiple comma-
separated paths. This option only takes effect in
dispatch mode.
-m {dispatch,direct}, --mode {dispatch,direct}
Explicitly set the mode of operation. Can be set to
"direct" or "dispatch". In direct mode the job is run,
whereas in dispatch mode a script for the job(s) is
created and optionally enqueued. Setting either -n or
-i implicitly sets the mode of operation to
"dispatch", unless specified otherwise.
-x HASH, --hash HASH Specify an argument or set of arguments to be used to
determine which bin an iteration belons in. These
arguments should have a stable string representation
(i.e. no unordered containers or memory addresses) and
should be unique over the argumetns iterable. This
should only be set if automatic binning won't work,
i.e. if your argument iterable is not stable.
-k KEY, --key KEY Integer specifying the positional argument to use as
the bin for each iteration. That key argument should
always take on a value that is an integer between 0
and num_bins-1. This should only be used if you really
need to control binning. Prefer to rely on automatic
binning (if your iterable is stable), or use the
-xoption, which is more flexible and less error-prone.
-n NODES, --nodes NODES
Number of compute nodes. This option causes the
command to operate in dispatch mode, unless the mode
is explicitly set
-i ITERATIONS, --iterations ITERATIONS
Approximate number of iterations per compute node.
This option causes the command to operate in dispatch
mode, unless the mode is explicitly set. Note that
using this instead of --nodes (-n) can lead to delay
because the total number of iterations has to be
counted to determine the number of compute nodes
needed.
</pre>