sciluigi

Helper library for writing dynamic, flexible workflows in luigi

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Natural Language
- English
Operating System
- POSIX :: Linux
Programming Language
Topic

Project description

Note: this library is still work in progress, but it has already been put into production and has been executing real-world workloads successfully, and no serious flaws are known. Early-adopters are much welcome to test it out and file issues for anything not working properly, features missing, or suggestions for improvements!

Scientific Luigi (SciLuigi for short) is a light-weight wrapper library around Spotify’s Luigi workflow system that aims to make writing scientific workflows (consisting of numerous interdependent commandline applications) more fluent, flexible and modular.

Luigi is a great, flexible, and very fun-to-use library. It has turned out though, that its default way of defining dependencies by hard coding them in each task’s requires() function is not optimal for some type of workflows common in scientific fields such as bioinformatics, where multiple inputs and outputs, complex dependencies, and the need to quickly try different workflow connectivity (e.g. plugging in extra filtering steps) in an explorative fashion is central to the way of working.

SciLuigi was designed to solve some these problem we were facing when trying to use luigi for defining complex workflows for data preprocessing, machine-learning and cross-validation.

To achieve that, SciLuigi provides the following “features” over vanilla Luigi:

Separation of dependency definitions from the tasks themselves, for improved modularity and composability.
Inputs and outputs implemented as separate fields, a.k.a. “ports”, to allow specifying dependencies between specific input and output-targets rather than just between tasks. This is again to let such details of the network definition reside outside the tasks.
The fact that inputs and outputs are object fields, also allows auto-completion support to ease the network connection work (Works great e.g. with jedi-vim.
Inputs and outputs are connected with an intuitive “single-assignment syntax”.
Good default high-level logging of workflow tasks and execution times.
Produces an easy to read audit-report with high level information per task.
Integration with some HPC workload managers. (So far only SLURM though).

Because of Luigi’s great easy-to-use API, these changes have been implemented as a very thin layer on top of luigi’s own API, and no changes to the luigi core is needed at all, so you can continue leveraging the work already being put into maintaining and further developing luigi, by the team at Spotify and others.

Workflow code quick demo

Just to give a quick feel for how a workflow definition might look like in SciLuigi, check this code example (implementation of tasks hidden here for brevity. See Usage section further below for more details):

import sciluigi as sl

class MyWorkflow(sl.WorkflowTask):
    def workflow(self):
        # Initialize tasks:
        foowrt = self.new_task('foowriter', MyFooWriter)
        foorpl = self.new_task('fooreplacer', MyFooReplacer,
            replacement='bar')

        # Here we do the *magic*: Connecting outputs to inputs:
        foorpl.in_foo = foowrt.out_foo

        # Return the last task(s) in the workflow chain.
        return foorpl

That’s it! And again, see the “usage” section just below for a more detailed description of getting to this!

Prerequisites

Python 2.7 - 2.x (No Python 3.x support)
Luigi 1.3.x

Install

Install luigi, preferrably through PyPI:
```
pip install luigi
```

Clone the sciluigi library

cd <your-code-directory>
git clone https://github.com/samuell/sciluigi.git
git checkout tags/v0.9 # Check out the 0.9 version

Now you can use the library by just importing it in your python script, like so:
```
import sciluigi
```
Note that you can aliase it to a shorter name, for brevity, and to save keystrokes:
```
import sciluigi as sl
```

Usage

Creating workflows in SciLuigi differs slightly from how it is done in vanilla Luigi. Very briefly, it is done in these main steps:

Create a workflow tasks clas
Create task classes
Add the workflow definition in the workflow class’s worklfow() method.
Add a run method at the end of the script
Run the script

Create a Workflow task

The first thing to do when creating a workflow, is to define a workflow task.

You do this by:

Creating a subclass of sciluigi.WorkflowTask
Implementing the workflow() method.

Example:

import sciluigi

class MyWorkflow(sciluigi.WorkflowTask):
    def workflow(self):
        pass # TODO: Implement workflow here later!

Create tasks

Then, you need to define some tasks that can be done in this workflow.

This is done by:

Creating a subclass of sciluigi.Task (or sciluigi.SlurmTask if you want Slurm support)
Adding fields named in_<yournamehere> for each input, in the new task class
Define methods named out_<yournamehere>() for each output, that return sciluigi.TargetInfo objects. (sciluigi.TargetInfo is initialized with a reference to the task object itself - typically self - and a path name, where upstream tasks paths can be used).
Define luigi parameters to the task.
Implement the run() method of the task.

Example:

Let’s define a simple task that just writes “foo” to a file named foo.txt:

class MyFooWriter(sciluigi.Task):
    # We have no inputs here
    # Define outputs:
    def out_foo(self):
        return sciluigi.TargetInfo(self, 'foo.txt')
    def run(self):
        with self.out_foo().open('w') as foofile:
            foofile.write('foo\n')

Then, let’s create a task taht replaces “foo” with “bar”:

class MyFooReplacer(sciluigi.Task):
    replacement = luigi.Parameter() # Here, we take as a parameter
                                  # what to replace foo with.
    # Here we have one input, a "foo file":
    in_foo = None
    # ... and an output, a "bar file":
    def out_replaced(self):
        # As the path to the returned target(info), we
        # use the path of the foo file:
        return TargetInfo(self, self.in_foo().path + '.bar.txt')
    def run(self):
        with self.in_foo().open() as in_f:
            with self.out_replaced('w') as out_f:
                # Here we see that we use the parameter self.replacement:
                out_f.write(in_f.read().replace('foo', self.replacement))

The last lines, we could have instead written using the command-line sed utility, available in linux, by calling it on the commandline, with the built-in ex() method:

def run(self):
    # Here, we use the in-built self.ex() method, to execute commands:
    self.ex("sed 's/foo/{repl}' {in} > {out}".format(
        repl=self.replacement,
        in=self.in_foo().path,
        out=self.out_bar().path))

Write the workflow definition

Now, we can use these two tasks we created, to create a simple workflow, in our workflow class, that we also created above.

We do this by:

Instantiating the tasks, using the self.new_task(<unique_taskname>, <task_class>, *args, **kwargs) method, of the workflow task.
Connect the tasks together, by pointing the right out_* method to the right in_* field.
Returning the last task in the chain, from the workflow method.

Example:

import sciluigi
class MyWorkflow(sciluigi.WorkflowTask):
    def workflow(self):
        foowriter = self.new_task('foowriter', MyFooWriter)
        fooreplacer = self.new_task('fooreplacer', MyFooReplacer,
            replacement='bar')

        # Here we do the *magic*: Connecting outputs to inputs:
        fooreplacer.in_foo = foowriter.out_foo

        # Return the last task(s) in the workflow chain.
        return fooreplacer

Add a run method to the end of the script

Now, the only thing that remains, is adding a run method to the end of the script.

You can use luigi’s own luigi.run(), or our own two methods:

sciluigi.run()
sciluigi.run_local()

The run_local() one, is handy if you don’t want to run a central scheduler daemon, but just want to run the workflow as a script.

Both of the above take the same options as luigi.run(), so you can for example set the main class to use (our workflow task):

# End of script ....
if __name__ == '__main__':
    sciluigi.run_local(main_task_cls=MyWorkflow)

Run the workflow

Now, you should be able to run the workflow as simple as:

python myworkflow.py

… provided of course, that the workflow is saved in a file named myworkflow.py.

More Examples

See the examples folder for more detailed examples!

Acknowledgements

This work is funded by: - Faculty grants of the dept. of Pharmaceutical Biosciences, Uppsala University - Bioinformatics Infrastructure for Life Sciences, BILS

Many ideas and inspiration for the API is taken from: - John Paul Morrison’s invention and works on Flow-Based Programming

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Natural Language
- English
Operating System
- POSIX :: Linux
Programming Language
Topic

Release history Release notifications | RSS feed

0.10.1

Jan 7, 2023

0.10.0

Jan 7, 2023

0.9.7

May 27, 2020

0.9.6b7 pre-release

Sep 21, 2017

0.9.5b6 pre-release

Apr 4, 2017

0.9.4b5 pre-release

Oct 26, 2015

0.9.3b4 pre-release

Oct 22, 2015

This version

0.9.2b3 pre-release

Oct 5, 2015

0.1.0

Jan 7, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sciluigi-0.9.2b3.tar.gz (16.5 kB view hashes)

Uploaded Oct 5, 2015 Source

Built Distribution

sciluigi-0.9.2b3-py2-none-any.whl (20.2 kB view hashes)

Uploaded Oct 5, 2015 Python 2

Hashes for sciluigi-0.9.2b3.tar.gz

Hashes for sciluigi-0.9.2b3.tar.gz
Algorithm	Hash digest
SHA256	`e69e950c32670b19894a53957d1fc3ed2a167c6c3bcacd3c4042121aa6d84673`
MD5	`d5be46e7d38e9070ed1f09153ff65275`
BLAKE2b-256	`ed82f2dcbf1c6b26f92001fae1a1bc0becea7e1712fd22345f20cd671dfe934c`

Hashes for sciluigi-0.9.2b3-py2-none-any.whl

Hashes for sciluigi-0.9.2b3-py2-none-any.whl
Algorithm	Hash digest
SHA256	`c90c917c69b2d27749f9f7986942005291fad58449d23742862ed6e3fe5a1f7d`
MD5	`2f19671023c7d48444237f3a89a173ba`
BLAKE2b-256	`5faf154ae9b35f2136dd243fab5ec5de1ad63b8c0efd874d370f04418f523903`