Deep Learning Training Acceleration with Bagua and Lightning AI

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Lightning + Bagua

Deep Learning Training Acceleration with Bagua and Lightning AI

Bagua is a deep learning training acceleration framework which supports multiple advanced distributed training algorithms including:

Gradient AllReduce for centralized synchronous communication, where gradients are averaged among all workers.
Decentralized SGD for decentralized synchronous communication, where each worker exchanges data with one or a few specific workers.
ByteGrad and QAdam for low precision communication, where data is compressed into low precision before communication.
Asynchronous Model Average for asynchronous communication, where workers are not required to be synchronized in the same iteration in a lock-step style.

By default, Bagua uses Gradient AllReduce algorithm, which is also the algorithm implemented in DDP, but Bagua can usually produce a higher training throughput due to its backend written in Rust.

Installation

pip install -U lightning lightning-bagua

Usage

Simply set the strategy argument in the Trainer:

from lightning import Trainer

# train on 4 GPUs (using Bagua mode)
trainer = Trainer(strategy="bagua", accelerator="gpu", devices=4)

By specifying the algorithm in the BaguaStrategy, you can select more advanced training algorithms featured by Bagua:

from lightning import Trainer
from lightning_bagua import BaguaStrategy

# train on 4 GPUs, using Bagua Gradient AllReduce algorithm
trainer = Trainer(
    strategy=BaguaStrategy(algorithm="gradient_allreduce"),
    accelerator="gpu",
    devices=4,
)

# train on 4 GPUs, using Bagua ByteGrad algorithm
trainer = Trainer(
    strategy=BaguaStrategy(algorithm="bytegrad"),
    accelerator="gpu",
    devices=4,
)

# train on 4 GPUs, using Bagua Decentralized SGD
trainer = Trainer(
    strategy=BaguaStrategy(algorithm="decentralized"),
    accelerator="gpu",
    devices=4,
)

# train on 4 GPUs, using Bagua Low Precision Decentralized SGD
trainer = Trainer(
    strategy=BaguaStrategy(algorithm="low_precision_decentralized"),
    accelerator="gpu",
    devices=4,
)

# train on 4 GPUs, using Asynchronous Model Average algorithm, with a synchronization interval of 100ms
trainer = Trainer(
    strategy=BaguaStrategy(algorithm="async", sync_interval_ms=100),
    accelerator="gpu",
    devices=4,
)

To use QAdam, we need to initialize QAdamOptimizer first:

from lightning import Trainer
import lightning.pytorch as pl
from lightning_bagua import BaguaStrategy
from bagua.torch_api.algorithms.q_adam import QAdamOptimizer


class MyModel(pl.LightningModule):
    ...

    def configure_optimizers(self):
        # initialize QAdam Optimizer
        return QAdamOptimizer(self.parameters(), lr=0.05, warmup_steps=100)


model = MyModel()
trainer = Trainer(
    accelerator="gpu",
    devices=4,
    strategy=BaguaStrategy(algorithm="qadam"),
)
trainer.fit(model)

Bagua relies on its own launcher to schedule jobs. Below, find examples using bagua.distributed.launch which follows torch.distributed.launch API:

# start training with 8 GPUs on a single node
python -m bagua.distributed.launch --nproc_per_node=8 train.py

If the ssh service is available with passwordless login on each node, you can launch the distributed job on a single node with baguarun which has a similar syntax as mpirun. When staring the job, baguarun will automatically spawn new processes on each of your training node provided by --host_list option and each node in it is described as an ip address followed by a ssh port.

# Run on node1 (or node2) to start training on two nodes (node1 and node2), 8 GPUs per node
baguarun --host_list hostname1:ssh_port1,hostname2:ssh_port2 --nproc_per_node=8 --master_port=port1 train.py

Note

You can also start training in the same way as Distributed Data Parallel. However, system optimizations like Bagua-Net and Performance autotuning can only be enabled through bagua launcher. It is worth noting that with Bagua-Net, Distributed Data Parallel can also achieve better performance without modifying the training script.

See Bagua Tutorials for more details on installation and advanced features.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.1.0

Mar 22, 2023

0.1.0rc2 pre-release yanked

Mar 19, 2023

This version

0.1.0rc1 pre-release yanked

Mar 16, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lightning-bagua-0.1.0rc1.tar.gz (16.2 kB view hashes)

Uploaded Mar 16, 2023 Source

Built Distribution

lightning_bagua-0.1.0rc1-py3-none-any.whl (14.0 kB view hashes)

Uploaded Mar 16, 2023 Python 3

Hashes for lightning-bagua-0.1.0rc1.tar.gz

Hashes for lightning-bagua-0.1.0rc1.tar.gz
Algorithm	Hash digest
SHA256	`da17037950c39e75608a009b4859a9c0195e2fca4478b0a7c047cb5a8ef9d735`
MD5	`9f46a59a8a5e76fc2d1ef3687d661ab8`
BLAKE2b-256	`75d19da487186ea6333fceb4958e78b5198767de092902fc760d61c2715067cd`

Hashes for lightning_bagua-0.1.0rc1-py3-none-any.whl

Hashes for lightning_bagua-0.1.0rc1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`47060bf16a90cd650b74daee9dc2f3c644c1ef8cb4f35a9127c903a56139715e`
MD5	`93f9b7a99786d150b4209d1cc55e9128`
BLAKE2b-256	`ff7c52ab4e5620f9042bcdd78650e2bb529bf6db347d5a3ff33e097e0c40295a`