gridtk

SGE Grid and Local Submission and Monitoring Tools for Idiap

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Natural Language
- English
Programming Language
- Python
- Python :: 3
Topic
- System :: Clustering

Project description

The Job Manager is python wrapper around SGE utilities like qsub, qstat and qdel. It interacts with these tools to submit and manage grid jobs making up a complete workflow ecosystem. Currently, it is set up to work with the SGE grid at Idiap, but it is also possible to modify it to be used in other SGE grids.

Since version 1.0 there is also a local submission system introduced. Instead of sending jobs to the SGE grid, it executes them in parallel processes on the local machine, using a simple scheduling system.

This package uses the Buildout system to install it. Please call:

$ python bootstrap.py
$ bin/buildout
$ bin/sphinx-build docs sphinx
$ firefox sphinx/index.html

to create and open the documentation including even more information than given in this README below.

Submitting jobs to the SGE grid

Every time you interact with the Job Manager, a local database file (normally named submitted.sql3) is read or written so it preserves its state during decoupled calls. The database contains all information about jobs that is required for the Job Manager to:

submit jobs of any kind
probe for submitted jobs
query SGE for submitted jobs
identify problems with submitted jobs
cleanup logs from submitted jobs
easily re-submit jobs if problems occur
support for parametric (array) jobs
submit jobs with dependencies, which automatically get killed on failures

Many of these features are also achievable using the stock SGE utilities, the Job Manager only makes it dead simple.

If you really want to use the stock SGE utilities, the gridtk defines some wrapper scripts that allows to use qsub, qstat and qdel without the need of the SETSHELL command. For example, you can easily use qstat.py to query the list of your jobs running in the SGE grid.

Submitting a simple job

To interact with the Job Manager we use the jman utility. Make sure to have your shell environment setup to reach it w/o requiring to type-in the full path. The first task you may need to pursue is to submit jobs. Here is how:

$ jman -vv submit myscript.py --help
... Added job '<Job: 1> : submitted -- /usr/bin/python myscript.py --help' to the database
... Submitted job '<Job: 6151645> : queued -- /usr/bin/python myscript.py --help' to the SGE grid.

Submitting a parametric job

Parametric or array jobs are jobs that execute the same way, except for the environment variable SGE_TASK_ID, which changes for every job. This way, your program controls, which bit of the full job has to be executed in each (parallel) instance. It is great for forking thousands of jobs into the grid.

The next example sends 10 copies of the myscript.py job to the grid with the same parameters. Only the variable SGE_TASK_ID changes between them:

$ jman -vv submit -t 10 myscript.py --help
... Added job '<Job: 2> : submitted -- /usr/bin/python myscript.py --help' to the database
... Submitted job '<Job: 6151646> : queued -- /usr/bin/python myscript.py --help' to the SGE grid.

The -t option in jman accepts different kinds of job array descriptions. Have a look at the help documentation for details with jman --help.

Probing for jobs

Once the job has been submitted you will noticed a database file (by default called submitted.sql3) has been created in the current working directory. It contains the information for the job you just submitted:

$ jman list

       job-id           queue        status            job-name                 dependencies                      submitted command line
====================  =========  ==============  ====================  ==============================  ===========================================
      6151645           all.q        queued             None                        []                 /usr/bin/python myscript.py --help
  6151646 [1-10:1]      all.q        queued             None                        []                 /usr/bin/python myscript.py --help

From this dump you can see the SGE job identifier including the number of array jobs, the queue the job has been submitted to, the current status of the job in the SGE grid, the dependencies of the job and the command that was executed in the SGE grid. The list command from jman will show the current status of the job, which is updated automatically as soon as the grid job finishes. Several calls to list might end up in

Submitting dependent jobs

Sometimes, the execution of one job might depend on the execution of another job. The JobManager can take care of this, simply by adding the id of the job that we have to wait for:

$ jman -vv submit --dependencies 6151645 -- /usr/bin/python myscript.py --help
... Added job '<Job: 3> : submitted -- /usr/bin/python myscript.py --help' to the database
... Submitted job '<Job: 6151647> : queued -- /usr/bin/python myscript.py --help' to the SGE grid.

Now, the new job will only be run after the first one finished.

Inspecting log files

If jobs finish, the result of the executed job will be shown in the list. In case it is non-zero, might want to inspect the log files as follows:

$ jman report --errors-only
...
<Job: 6151646  - 'jman'> : failure (2) -- /usr/bin/python myscript.py --help
/usr/bin/python: can't open file 'myscript.py': [Errno 2] No such file or directory

Hopefully, that helps in debugging the problem!

Re-submitting the job

If you are convinced the job did not work because of external conditions (e.g. temporary network outage), you may re-submit it, exactly like it was submitted the first time:

$ jman -vv resubmit --job-id 6151645
... Deleting job '6151645'
... Submitted job '<Job: 6151673> : queued -- /usr/bin/python myscript.py --help' to the SGE grid.

By default, the log files of the old job are deleted during re-submission. If for any reason you want to keep the old log files, use the --keep-logs option. Notice the new job identifier has changed as expected.

Stopping a grid job

In case you found an error in the code of a grid job that is currently executing, you might want to kill the job in the grid. For this purpose, you can use the command:

$ jman stop

The job is removed from the grid, but all log files are still available. A common use case is to stop the grid job, fix the bugs, and re-submit it.

Cleaning-up

If the job in question will not work no matter how many times we re-submit it, you may just want to clean it up and do something else. The Job Manager is here for you again:

$ jman -vvv delete
... Deleting job '8258327' from the database.

In case, jobs are still running or queued in the grid, they will be stopped before they are removed from the database. By default, all logs will be deleted with the job. Inspection on the current directory will now show you everything concerning the jobs is gone.

New from version 1.0

If you know the gridtk in versions below 1.0, you might experience some differences. The main advantages of the new version are:

When run in the grid, the jobs now register themselves in the database. There is no need to refresh the database by hand any more. This includes that the result (an integral value) of the job execution is available once the job is finished. Hence, there is no need to rely on the output of the error log any more.

Note

In case the job died in the grid, e.g., because of a timeout, this mechanism unfortunately still doesn’t work. Please try to use jman -vv communicate to see if these kinds of errors happened.
Jobs are now stored in a proper .sql3 database. Additionally to the jobs, each array job now has its own SQL model, which allows to store status and results of each array job. To list the array jobs as well, please use the --print-array-jobs option.
In case you have submitted a long list of commands with inter-dependencies, the Job Manager can now kill waiting jobs in case a dependent job failed. Simply use the --stop-on-failure option during the submission of the jobs.
Now, the verbosity of the gridtk can be selected more detailed. Simply use the -v option several times to get 0: ERROR, 1: WARNING, 2: INFO, 3: DEBUG outputs. A good choose is probably the -vv option to enable INFO output. Please note that this is not propagated to the jobs that are run in the grid.

Note

The -v options must directly follow the jman command, and it has to be before the action (like submit or list) is chosen. The --database is now also a default option, which has to be at the same position.
One important improvement is that you now have the possibility to execute the jobs in parallel on the local machine. Please see next section for details.

Running jobs on the local machine

The JobManager is designed such that it supports mainly the same infrastructure when submitting jobs locally or in the SGE grid. To submit jobs locally, just add the --local option to the jman command:

$ jman --local -vv submit /usr/bin/python myscript.py --help

One important difference to the grid submission is that the jobs that are submitted to the local machine do not run immediately, but are only collected in the submitted.sql3 database. To run the collected jobs using 4 parallel processes, simply use:

$ jman --local -vv run-scheduler --parallel 4

and all jobs that have not run yet are executed, keeping an eye on the dependencies.

Another difference is that by default, the jobs write their results into the command line and not into log files. If you want the log file behavior back, specify the log directory during the submission:

$ jman --local -vv submit --log-dir logs myscript.py --help

Of course, you can choose a different log directory (also for the SGE submission).

Furthermore, the job identifiers during local submission usually start from 1 and increase. Also, during local re-submission, the job ID does not change.

Using the local machine for debugging

One possible use case for the local job submission is the re-submission of jobs to the local machine. In this case, you might re-submit the grid job locally:

$ jman --local -vv resubmit --job-id 6151646 --keep-logs

(as mentioned above, no new ID is assigned) and run the local scheduler:

$ jman --local -vv run-scheduler --no-log-files --job-ids 6151646

to print the output and the error to console instead of to log files.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Natural Language
- English
Programming Language
- Python
- Python :: 3
Topic
- System :: Clustering

Release history Release notifications | RSS feed

2.0.1

Mar 3, 2023

2.0.0

Mar 3, 2023

1.8.4

Jun 22, 2022

1.8.3

Nov 3, 2021

1.8.2

Apr 13, 2021

1.8.1

Oct 1, 2020

1.8.0

Jul 7, 2020

1.7.0

Jul 6, 2020

1.6.5

Feb 14, 2020

1.6.4

Oct 29, 2019

1.6.3

Jun 20, 2019

1.6.2

Jul 18, 2018

1.6.1

Apr 12, 2018

1.5.0

Nov 20, 2017

1.4.4

Jul 6, 2017

1.4.3

Jun 1, 2017

1.4.2

Dec 22, 2016

1.4.1

Oct 10, 2016

1.4.0

Oct 4, 2016

1.3.0

Apr 12, 2016

1.2.4

Nov 11, 2015

1.2.3

Nov 6, 2015

1.2.2

May 5, 2015

1.2.1

May 4, 2015

1.2.0

Jan 14, 2015

1.1.6

Sep 25, 2014

1.1.5

Jun 5, 2014

1.1.4

Apr 7, 2014

1.1.3

Mar 28, 2014

1.1.2

Jan 17, 2014

1.1.1

Nov 8, 2013

This version

1.1.0

Oct 31, 2013

1.0.3

Sep 13, 2013

1.0.2

Sep 3, 2013

1.0.1

Sep 3, 2013

1.0.0

Aug 30, 2013

0.3.7

Mar 1, 2013

0.3.6

Feb 13, 2013

0.3.5

Feb 4, 2013

0.3.4

Jan 7, 2013

0.3.3

Dec 4, 2012

0.3.2

Nov 28, 2012

0.3.1

Nov 16, 2012

0.3.0

Sep 26, 2012

0.2.1

Jul 19, 2012

0.2

Jul 13, 2012

0.1

Jul 6, 2012

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gridtk-1.1.0.zip (53.1 kB view hashes)

Uploaded Oct 31, 2013 Source

Hashes for gridtk-1.1.0.zip

Hashes for gridtk-1.1.0.zip
Algorithm	Hash digest
SHA256	`0883a54faf5347669dd53706c6e5a2dc953ce42cabc2a49719aeb1d12a64274e`
MD5	`4d1fde77e50293f8beb5743d8768cf97`
BLAKE2b-256	`e4a7363e756395000871d42662218134a99443fefc5f94272bcbdac0724ec0d7`

gridtk 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Submitting jobs to the SGE grid

Submitting a simple job

Submitting a parametric job

Probing for jobs

Submitting dependent jobs

Inspecting log files

Re-submitting the job

Stopping a grid job

Cleaning-up

New from version 1.0

Running jobs on the local machine

Using the local machine for debugging

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution