Skip to main content

Habana's lightning-specific optimized plugins

Project description

Habana Lightning Plugins

Habana Lightning plugins is a suite of plugins that aid/accelerate model training using Lightning framework for HPU. The plugins acts as an extension to the lightning framework to support HPU specific features.

Currently, the following plugins are available:

  • HPUDataModule
  • HPUProfiler

Installation

To install Habana lightning plugins run the following command:

python -um pip install habana-lightning-plugins

HPUDataModule

HPUDataModule is an extension to the LightningDataModule class which uses Habana's dataloader to load and pre-process the input data. Using HPUDataModule offloads the data preprocessing overhead to the HPU and in turn increases the performance of training. The wrapper also aids in switching between hardware and software preprocessor based on the specific Gaudi device used.

Visit Habana Dataloader for more information related to Habana Dataloader.

Usage

The following shows an example of how to use the HPUDataModule:

  1. Import Habana Datamodule:
    from habana_lightning_plugins.datamodule import HPUDataModule
  1. Create and initialize HPUDataModule object with the dataset and the configuration required to preprocess the data:
    train_dir = "./path/to/train/data"
    val_dir = "./path/to/val/data"

    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])

    train_transforms = [
        transforms.RandomResizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        normalize,
    ]
    val_transforms = [
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        normalize,
    ]

    data_module = HPUDataModule(
        train_dir,
        val_dir,
        train_transforms=train_transforms,
        val_transforms=val_transforms,
        num_workers=8,
        batch_size=32,
        shuffle=False,
        pin_memory=True,
        drop_last=True,
    )
  1. Create an object of Lightning trainer and model:
    trainer = pl.Trainer(devices=1, accelerator="hpu", max_epochs=1, max_steps=2)
    model = RN50Module()  # Or any other model to be defined by user
  1. Pass the datamodule object as an argument to trainer to execute train/val/test loops:
    trainer.fit(model, datamodule=data_module)
    trainer.validate(model, datamodule=data_module)

Examples

  • A sample script can be found at examples/hpu_datamodule_sample.py.
python examples/hpu_datamodule_sample.py --data-path <path to Imagenet dataset - ILSVRC2012>

A reference model using HPUDataModule can be found in the ResNet50 Model Reference

Limitations

  • HPUDataModule supports the Imagenet dataset only.
  • HPUDataModule supports only 8 parallel data loader workers.

HPUProfiler

HPUProfiler is a lightning implementation of PyTorch profiler for HPU devices. It aids in obtaining profiling summary of PyTorch functions. It subclasses PyTorch Lightning's PyTorch profiler.

Default Profiling

For auto profiling, create a HPUProfiler instance and pass it to trainer. At the end of profiler.fit(), it will generate a json trace for the run. In case accelerator = "hpu" is not used with HPUProfiler, then it will dump only CPU traces, similar to PyTorchProfiler.

# Import profiler
from habana_lightning_plugins.profiler import HPUProfiler

# Create profiler object
profiler = HPUProfiler()
accelerator = "hpu"

# Pass profiler to the trainer
    trainer = Trainer(
        profiler=profiler,
        accelerator=accelerator,
    )

Distributed Profiling

To profile a distributed model, use the HPUProfiler with the filename argument which will save a report per rank:

from habana_lightning_plugins.profiler import HPUProfiler

profiler = HPUProfiler(filename="perf-logs")
trainer = Trainer(profiler=profiler, accelerator="hpu")

Custom Profiling

To profile custom actions of interest, reference a profiler in the LightningModule:

from habana_lightning_plugins.profiler import HPUProfiler

# Reference profiler in LightningModule
class MyModel(LightningModule):
    def __init__(self, profiler=None):
        self.profiler = profiler

# To profile in any part of your code, use the self.profiler.profile() function
    def custom_processing_step_basic(self, data):
        with self.profiler.profile("my_custom_action"):
            ...
        return data

# Alternatively, use self.profiler.start("my_custom_action")
# and self.profiler.stop("my_custom_action") functions
# to enclose the part of code to be profiled.
    def custom_processing_step_granular(self, data):
        self.profiler.start("my_custom_action") 
            ...
        self.profiler.stop("my_custom_action")
        return data

# Pass profiler instance to LightningModule
profiler = HPUProfiler()
model = MyModel(profiler)
trainer = Trainer(profiler=profiler, accelerator="hpu")

For more details on profiler, refer to PyTorchProfiler

Visualize Profiled Operations

Profiler will dump traces in json format. The traces can be visualized in 2 ways:

Using PyTorch TensorBoard Profiler

For further instructions see, https://github.com/pytorch/kineto/tree/master/tb_plugin.

# Install tensorboard
python -um pip install tensorboard torch-tb-profiler

# Start the TensorBoard server (default at port 6006):
tensorboard --logdir ./tensorboard --port 6006

# Now open the following url on your browser
http://localhost:6006/#profile

Using Chrome

1. Open Chrome and copy/paste this URL: `chrome://tracing/`.
2. Once tracing opens, click on `Load` at the top-right and load one of the generated traces.

Limitations

  • When using the HPUProfiler, wall clock time will not be representative of the true wall clock time. This is due to forcing profiled operations to be measured synchronously, when many HPU ops happen asynchronously. It is recommended to use this Profiler to find bottlenecks/breakdowns, however for end to end wall clock time use the SimpleProfiler.

  • HPUProfiler.summary() is not supported

  • Passing profiler name as string "hpu" to the trainer is not supported.

Supported Configurations

Validated on SynapseAI Version PyTorch Version PyTorch Lightning Version
Gaudi 1.9.0 1.13.1 1.9.4
Gaudi2 1.9.0 1.13.1 1.9.4

Changelog

  • habana-lightning-plugins introduced with support for datamodule and profiler plugins

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page