Machine Learning Experiment Job Scheduler

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Lightweight Cluster/Cloud VM Job Management 🚀

Are you looking for a tool to manage your training runs locally, on Slurm/Open Grid Engine clusters, SSH servers or Google Cloud Platform VMs? mle-scheduler provides a lightweight API to launch and monitor job queues. It smoothly orchestrates simultaneous runs for different configurations and/or random seeds. It is meant to reduce boilerplate and to make job resource specification intuitive. It comes with two core pillars:

MLEJob: Launches and monitors a single job on a resource (Slurm, Open Grid Engine, GCP, SSH, etc.).
MLEQueue: Launches and monitors a queue of jobs with different training configurations and/or seeds.

For a quickstart check out the notebook blog or the example scripts 📖

	Local	Slurm	Grid Engine	SSH	GCP

Installation ⏳

pip install mle-scheduler

Managing a Single Job with `MLEJob` Locally 🚀

from mle_scheduler import MLEJob

# python train.py -config base_config_1.yaml -exp_dir logs_single -seed_id 1
job = MLEJob(
    resource_to_run="local",
    job_filename="train.py",
    config_filename="base_config_1.yaml",
    experiment_dir="logs_single",
    seed_id=1
)

_ = job.run()

Managing a Queue of Jobs with `MLEQueue` Locally 🚀...🚀

from mle_scheduler import MLEQueue

# python train.py -config base_config_1.yaml -seed 0 -exp_dir logs_queue/<date>_base_config_1
# python train.py -config base_config_1.yaml -seed 1 -exp_dir logs_queue/<date>_base_config_1
# python train.py -config base_config_2.yaml -seed 0 -exp_dir logs_queue/<date>_base_config_2
# python train.py -config base_config_2.yaml -seed 1 -exp_dir logs_queue/<date>_base_config_2
queue = MLEQueue(
    resource_to_run="local",
    job_filename="train.py",
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    random_seeds=[0, 1],
    experiment_dir="logs_queue"
)

queue.run()

Launching Slurm Cluster-Based Jobs 🐒

# Each job requests 5 CPU cores & 1 V100S GPU & loads CUDA 10.0
job_args = {
    "partition": "<SLURM_PARTITION>",  # Partition to schedule jobs on
    "env_name": "mle-toolbox",  # Env to activate at job start-up
    "use_conda_venv": True,  # Whether to use anaconda venv
    "num_logical_cores": 5,  # Number of requested CPU cores per job
    "num_gpus": 1,  # Number of requested GPUs per job
    "gpu_type": "V100S",  # GPU model requested for each job
    "modules_to_load": "nvidia/cuda/10.0"  # Modules to load at start-up
}

queue = MLEQueue(
    resource_to_run="slurm-cluster",
    job_filename="train.py",
    job_arguments=job_args,
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    experiment_dir="logs_slurm",
    random_seeds=[0, 1]
)
queue.run()

Launching GridEngine Cluster-Based Jobs 🐘

# Each job requests 5 CPU cores & 1 V100S GPU w. CUDA 10.0 loaded
job_args = {
    "queue": "<GRID_ENGINE_QUEUE>",  # Queue to schedule jobs on
    "env_name": "mle-toolbox",  # Env to activate at job start-up
    "use_conda_venv": True,  # Whether to use anaconda venv
    "num_logical_cores": 5,  # Number of requested CPU cores per job
    "num_gpus": 1,  # Number of requested GPUs per job
    "gpu_type": "V100S",  # GPU model requested for each job
    "gpu_prefix": "cuda"  #$ -l {gpu_prefix}="{num_gpus}"
}

queue = MLEQueue(
    resource_to_run="slurm-cluster",
    job_filename="train.py",
    job_arguments=job_args,
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    experiment_dir="logs_grid_engine",
    random_seeds=[0, 1]
)
queue.run()

Launching SSH Server-Based Jobs 🦊

ssh_settings = {
    "user_name": "<SSH_USER_NAME>",  # SSH server user name
    "pkey_path": "<PKEY_PATH>",  # Private key path (e.g. ~/.ssh/id_rsa)
    "main_server": "<SSH_SERVER>",  # SSH Server address
    "jump_server": '',  # Jump host address
    "ssh_port": 22,  # SSH port
    "remote_dir": "mle-code-dir",  # Dir to sync code to on server
    "start_up_copy_dir": True,  # Whether to copy code to server
    "clean_up_remote_dir": True  # Whether to delete remote_dir on exit
}

job_args = {
    "env_name": "mle-toolbox",  # Env to activate at job start-up
    "use_conda_venv": True  # Whether to use anaconda venv
}

queue = MLEQueue(
    resource_to_run="ssh-node",
    job_filename="train.py",
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    random_seeds=[0, 1],
    experiment_dir="logs_ssh_queue",
    job_arguments=job_args,
    ssh_settings=ssh_settings)

queue.run()

Launching GCP VM-Based Jobs 🦄

cloud_settings = {
    "project_name": "<GCP_PROJECT_NAME>",  # Name of your GCP project
    "bucket_name": "<GCS_BUCKET_NAME>", # Name of your GCS bucket
    "remote_dir": "<GCS_CODE_DIR_NAME>",  # Name of code dir in bucket
    "start_up_copy_dir": True,  # Whether to copy code to bucket
    "clean_up_remote_dir": True  # Whether to delete remote_dir on exit
}

job_args = {
    "num_gpus": 0,  # Number of requested GPUs per job
    "gpu_type": None,  # GPU requested e.g. "nvidia-tesla-v100"
    "num_logical_cores": 1,  # Number of requested CPU cores per job
}

queue = MLEQueue(
    resource_to_run="gcp-cloud",
    job_filename="train.py",
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    random_seeds=[0, 1],
    experiment_dir="logs_gcp_queue",
    job_arguments=job_args,
    cloud_settings=cloud_settings,
)
queue.run()

Development & Milestones for Next Release

You can run the test suite via python -m pytest -vv tests/. If you find a bug or are missing your favourite feature, feel free to contact me @RobertTLange or create an issue :hugs:. In future releases I plan on implementing the following:

Add configuration details to examples (time of job, memory, etc.)
Clean up TPU GCP VM & JAX dependencies case
Add local launching of cluster jobs via SSH to headnode
Add Docker/Singularity container setup support
Add Azure support
Add AWS support

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.0.6

Mar 8, 2023

0.0.5

Jan 5, 2022

This version

0.0.4

Dec 7, 2021

0.0.3

Nov 12, 2021

0.0.2

Nov 12, 2021

0.0.1

Nov 12, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mle_scheduler-0.0.4.tar.gz (25.4 kB view hashes)

Uploaded Dec 7, 2021 Source

Built Distribution

mle_scheduler-0.0.4-py3-none-any.whl (32.5 kB view hashes)

Uploaded Dec 7, 2021 Python 3

Hashes for mle_scheduler-0.0.4.tar.gz

Hashes for mle_scheduler-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`0ae516158c0aa60d03804990eb050ec566f60d3d6a09af2cf120c5da36499e71`
MD5	`6244a5e7e0b5491bce530f5df2019a65`
BLAKE2b-256	`7ce338231a2f3ba2c249498eaa6d56c6d13b2d9fd1a4cfd3dda0e2eea97bc733`

Hashes for mle_scheduler-0.0.4-py3-none-any.whl

Hashes for mle_scheduler-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`be6d8e62f26f09452d1a160f76b7b5a8f89aaea6696a4e169d2d99e61c511b39`
MD5	`2795f34b904277edd1dcd7b3cb3118b7`
BLAKE2b-256	`427d7381cf43eca6d8f93961739b947a8707668d54d771feb1fe7a0fa8961050`

mle-scheduler 0.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Lightweight Cluster/Cloud VM Job Management 🚀

Installation ⏳

Managing a Single Job with `MLEJob` Locally 🚀

Managing a Queue of Jobs with `MLEQueue` Locally 🚀...🚀

Launching Slurm Cluster-Based Jobs 🐒

Launching GridEngine Cluster-Based Jobs 🐘

Launching SSH Server-Based Jobs 🦊

Launching GCP VM-Based Jobs 🦄

Development & Milestones for Next Release

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

mle-scheduler 0.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Lightweight Cluster/Cloud VM Job Management 🚀

Installation ⏳

Managing a Single Job with MLEJob Locally 🚀

Managing a Queue of Jobs with MLEQueue Locally 🚀...🚀

Launching Slurm Cluster-Based Jobs 🐒

Launching GridEngine Cluster-Based Jobs 🐘

Launching SSH Server-Based Jobs 🦊

Launching GCP VM-Based Jobs 🦄

Development & Milestones for Next Release

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

Managing a Single Job with `MLEJob` Locally 🚀

Managing a Queue of Jobs with `MLEQueue` Locally 🚀...🚀