Skip to main content

optimizer & lr scheduler collections in PyTorch

Project description

Build

workflow Documentation Status

Quality

codecov black ruff

Package

PyPI version PyPI pyversions

Status

PyPi download PyPi month download

License

apache

pytorch-optimizer is optimizer & lr scheduler collections in PyTorch.
I just re-implemented (speed & memory tweaks, plug-ins) the algorithm while based on the original paper. Also, It includes useful and practical optimization ideas.
Currently, 55 optimizers, 6 lr schedulers are supported!

Highly inspired by pytorch-optimizer.

Getting Started

For more, see the documentation.

Most optimizers are under MIT or Apache 2.0 license, but a few optimizers like Fromage, Nero have BY-NC-SA 4.0 license, which is non-commercial. So, please double-check the license before using it at your work.

Installation

$ pip3 install -U pytorch-optimizer

If there’s a version issue when installing the package, try with –no-deps option.

$ pip3 install -U --no-deps pytorch-optimizer

Simple Usage

from pytorch_optimizer import AdamP

model = YourModel()
optimizer = AdamP(model.parameters())

# or you can use optimizer loader, simply passing a name of the optimizer.

from pytorch_optimizer import load_optimizer

model = YourModel()
opt = load_optimizer(optimizer='adamp')
optimizer = opt(model.parameters())

Also, you can load the optimizer via torch.hub

import torch

model = YourModel()
opt = torch.hub.load('kozistr/pytorch_optimizer', 'adamp')
optimizer = opt(model.parameters())

If you want to build the optimizer with parameters & configs, there’s create_optimizer() API.

from pytorch_optimizer import create_optimizer

optimizer = create_optimizer(
    model,
    'adamp',
    lr=1e-3,
    weight_decay=1e-3,
    use_gc=True,
    use_lookahead=True,
)

Supported Optimizers

You can check the supported optimizers with below code.

from pytorch_optimizer import get_supported_optimizers

supported_optimizers = get_supported_optimizers()

Optimizer

Description

Official Code

Paper

Citation

AdaBelief

Adapting Step-sizes by the Belief in Observed Gradients

github

https://arxiv.org/abs/2010.07468

cite

AdaBound

Adaptive Gradient Methods with Dynamic Bound of Learning Rate

github

https://openreview.net/forum?id=Bkg3g2R9FX

cite

AdaHessian

An Adaptive Second Order Optimizer for Machine Learning

github

https://arxiv.org/abs/2006.00719

cite

AdamD

Improved bias-correction in Adam

https://arxiv.org/abs/2110.10828

cite

AdamP

Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights

github

https://arxiv.org/abs/2006.08217

cite

diffGrad

An Optimization Method for Convolutional Neural Networks

github

https://arxiv.org/abs/1909.11015v3

cite

MADGRAD

A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic

github

https://arxiv.org/abs/2101.11075

cite

RAdam

On the Variance of the Adaptive Learning Rate and Beyond

github

https://arxiv.org/abs/1908.03265

cite

Ranger

a synergistic optimizer combining RAdam and LookAhead, and now GC in one optimizer

github

https://bit.ly/3zyspC3

cite

Ranger21

a synergistic deep learning optimizer

github

https://arxiv.org/abs/2106.13731

cite

Lamb

Large Batch Optimization for Deep Learning

github

https://arxiv.org/abs/1904.00962

cite

Shampoo

Preconditioned Stochastic Tensor Optimization

github

https://arxiv.org/abs/1802.09568

cite

Nero

Learning by Turning: Neural Architecture Aware Optimisation

github

https://arxiv.org/abs/2102.07227

cite

Adan

Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models

github

https://arxiv.org/abs/2208.06677

cite

Adai

Disentangling the Effects of Adaptive Learning Rate and Momentum

github

https://arxiv.org/abs/2006.15815

cite

SAM

Sharpness-Aware Minimization

github

https://arxiv.org/abs/2010.01412

cite

ASAM

Adaptive Sharpness-Aware Minimization

github

https://arxiv.org/abs/2102.11600

cite

GSAM

Surrogate Gap Guided Sharpness-Aware Minimization

github

https://openreview.net/pdf?id=edONMAnhLu-

cite

D-Adaptation

Learning-Rate-Free Learning by D-Adaptation

github

https://arxiv.org/abs/2301.07733

cite

AdaFactor

Adaptive Learning Rates with Sublinear Memory Cost

github

https://arxiv.org/abs/1804.04235

cite

Apollo

An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization

github

https://arxiv.org/abs/2009.13586

cite

NovoGrad

Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks

github

https://arxiv.org/abs/1905.11286

cite

Lion

Symbolic Discovery of Optimization Algorithms

github

https://arxiv.org/abs/2302.06675

cite

Ali-G

Adaptive Learning Rates for Interpolation with Gradients

github

https://arxiv.org/abs/1906.05661

cite

SM3

Memory-Efficient Adaptive Optimization

github

https://arxiv.org/abs/1901.11150

cite

AdaNorm

Adaptive Gradient Norm Correction based Optimizer for CNNs

github

https://arxiv.org/abs/2210.06364

cite

RotoGrad

Gradient Homogenization in Multitask Learning

github

https://openreview.net/pdf?id=T8wHz4rnuGL

cite

A2Grad

Optimal Adaptive and Accelerated Stochastic Gradient Descent

github

https://arxiv.org/abs/1810.00553

cite

AccSGD

Accelerating Stochastic Gradient Descent For Least Squares Regression

github

https://arxiv.org/abs/1704.08227

cite

SGDW

Decoupled Weight Decay Regularization

github

https://arxiv.org/abs/1711.05101

cite

ASGD

Adaptive Gradient Descent without Descent

github

https://arxiv.org/abs/1910.09529

cite

Yogi

Adaptive Methods for Nonconvex Optimization

NIPS 2018

cite

SWATS

Improving Generalization Performance by Switching from Adam to SGD

https://arxiv.org/abs/1712.07628

cite

Fromage

On the distance between two neural networks and the stability of learning

github

https://arxiv.org/abs/2002.03432

cite

MSVAG

Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients

github

https://arxiv.org/abs/1705.07774

cite

AdaMod

An Adaptive and Momental Bound Method for Stochastic Learning

github

https://arxiv.org/abs/1910.12249

cite

AggMo

Aggregated Momentum: Stability Through Passive Damping

github

https://arxiv.org/abs/1804.00325

cite

QHAdam

Quasi-hyperbolic momentum and Adam for deep learning

github

https://arxiv.org/abs/1810.06801

cite

PID

A PID Controller Approach for Stochastic Optimization of Deep Networks

github

CVPR 18

cite

Gravity

a Kinematic Approach on Optimization in Deep Learning

github

https://arxiv.org/abs/2101.09192

cite

AdaSmooth

An Adaptive Learning Rate Method based on Effective Ratio

https://arxiv.org/abs/2204.00825v1

cite

SRMM

Stochastic regularized majorization-minimization with weakly convex and multi-convex surrogates

github

https://arxiv.org/abs/2201.01652

cite

AvaGrad

Domain-independent Dominance of Adaptive Methods

github

https://arxiv.org/abs/1912.01823

cite

PCGrad

Gradient Surgery for Multi-Task Learning

github

https://arxiv.org/abs/2001.06782

cite

AMSGrad

On the Convergence of Adam and Beyond

https://openreview.net/pdf?id=ryQu7f-RZ

cite

Lookahead

k steps forward, 1 step back

github

https://arxiv.org/abs/1907.08610

cite

PNM

Manipulating Stochastic Gradient Noise to Improve Generalization

github

https://arxiv.org/abs/2103.17182

cite

GC

Gradient Centralization

github

https://arxiv.org/abs/2004.01461

cite

AGC

Adaptive Gradient Clipping

github

https://arxiv.org/abs/2102.06171

cite

Stable WD

Understanding and Scheduling Weight Decay

github

https://arxiv.org/abs/2011.11152

cite

Softplus T

Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM

https://arxiv.org/abs/1908.00700

cite

Un-tuned w/u

On the adequacy of untuned warmup for adaptive optimization

https://arxiv.org/abs/1910.04209

cite

Norm Loss

An efficient yet effective regularization method for deep neural networks

https://arxiv.org/abs/2103.06583

cite

AdaShift

Decorrelation and Convergence of Adaptive Learning Rate Methods

github

https://arxiv.org/abs/1810.00143v4

cite

AdaDelta

An Adaptive Learning Rate Method

https://arxiv.org/abs/1212.5701v1

cite

Amos

An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale

github

https://arxiv.org/abs/2210.11693

cite

SignSGD

Compressed Optimisation for Non-Convex Problems

github

https://arxiv.org/abs/1802.04434

cite

AdaHessian

An Adaptive Second Order Optimizer for Machine Learning

github

https://arxiv.org/abs/2006.00719

cite

Sophia

A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

github

https://arxiv.org/abs/2305.14342

cite

Prodigy

An Expeditiously Adaptive Parameter-Free Learner

github

https://arxiv.org/abs/2306.06101

cite

Supported LR Scheduler

You can check the supported learning rate schedulers with below code.

from pytorch_optimizer import get_supported_lr_schedulers

supported_lr_schedulers = get_supported_lr_schedulers()

LR Scheduler

Description

Official Code

Paper

Citation

Explore-Exploit

Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule

https://arxiv.org/abs/2003.03977

cite

Chebyshev

Acceleration via Fractal Learning Rate Schedules

https://arxiv.org/abs/2103.01338

cite

Useful Resources

Several optimization ideas to regularize & stabilize the training. Most of the ideas are applied in Ranger21 optimizer.

Also, most of the captures are taken from Ranger21 paper.

Adaptive Gradient Clipping

Gradient Centralization

Softplus Transformation

Gradient Normalization

Norm Loss

Positive-Negative Momentum

Linear learning rate warmup

Stable weight decay

Explore-exploit learning rate schedule

Lookahead

Chebyshev learning rate schedule

(Adaptive) Sharpness-Aware Minimization

On the Convergence of Adam and Beyond

Improved bias-correction in Adam

Adaptive Gradient Norm Correction

Adaptive Gradient Clipping

This idea originally proposed in NFNet (Normalized-Free Network) paper.
AGC (Adaptive Gradient Clipping) clips gradients based on the unit-wise ratio of gradient norms to parameter norms.

Gradient Centralization

https://raw.githubusercontent.com/kozistr/pytorch_optimizer/main/assets/gradient_centralization.png

Gradient Centralization (GC) operates directly on gradients by centralizing the gradient to have zero mean.

Softplus Transformation

By running the final variance denom through the softplus function, it lifts extremely tiny values to keep them viable.

Gradient Normalization

Norm Loss

https://raw.githubusercontent.com/kozistr/pytorch_optimizer/main/assets/norm_loss.png

Positive-Negative Momentum

https://raw.githubusercontent.com/kozistr/pytorch_optimizer/main/assets/positive_negative_momentum.png

Linear learning rate warmup

https://raw.githubusercontent.com/kozistr/pytorch_optimizer/main/assets/linear_lr_warmup.png

Stable weight decay

https://raw.githubusercontent.com/kozistr/pytorch_optimizer/main/assets/stable_weight_decay.png

Explore-exploit learning rate schedule

https://raw.githubusercontent.com/kozistr/pytorch_optimizer/main/assets/explore_exploit_lr_schedule.png

Lookahead

k steps forward, 1 step back. Lookahead consisting of keeping an exponential moving average of the weights that is
updated and substituted to the current weights every k_{lookahead} steps (5 by default).

Chebyshev learning rate schedule

Acceleration via Fractal Learning Rate Schedules.

(Adaptive) Sharpness-Aware Minimization

Sharpness-Aware Minimization (SAM) simultaneously minimizes loss value and loss sharpness.
In particular, it seeks parameters that lie in neighborhoods having uniformly low loss.

On the Convergence of Adam and Beyond

Convergence issues can be fixed by endowing such algorithms with ‘long-term memory’ of past gradients.

Improved bias-correction in Adam

With the default bias-correction, Adam may actually make larger than requested gradient updates early in training.

Adaptive Gradient Norm Correction

Correcting the norm of gradient in each iteration based on the adaptive training history of gradient norm.

Citation

Please cite original authors of optimization algorithms. If you use this software, please cite it as below. Or you can get from “cite this repository” button.

@software{Kim_pytorch_optimizer_Optimizer_and_2022,
    author = {Kim, Hyeongchan},
    month = {1},
    title = {{pytorch_optimizer: optimizer and lr scheduler collections in PyTorch}},
    version = {1.0.0},
    year = {2022}
}

Author

Hyeongchan Kim / @kozistr

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytorch_optimizer-2.10.1.tar.gz (84.6 kB view hashes)

Uploaded Source

Built Distribution

pytorch_optimizer-2.10.1-py3-none-any.whl (137.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page