pytorch_optimizer

optimizer & lr scheduler collections in PyTorch

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Build
Quality
Package
Status
License

pytorch-optimizer is optimizer & lr scheduler collections in PyTorch.
I just re-implemented (speed & memory tweaks, plug-ins) the algorithm while based on the original paper. Also, It includes useful and practical optimization ideas.
Currently, 55 optimizers, 6 lr schedulers are supported!

Highly inspired by pytorch-optimizer.

Getting Started

For more, see the documentation.

Most optimizers are under MIT or Apache 2.0 license, but a few optimizers like Fromage, Nero have BY-NC-SA 4.0 license, which is non-commercial. So, please double-check the license before using it at your work.

Installation

$ pip3 install -U pytorch-optimizer

If there’s a version issue when installing the package, try with –no-deps option.

$ pip3 install -U --no-deps pytorch-optimizer

Simple Usage

from pytorch_optimizer import AdamP

model = YourModel()
optimizer = AdamP(model.parameters())

# or you can use optimizer loader, simply passing a name of the optimizer.

from pytorch_optimizer import load_optimizer

model = YourModel()
opt = load_optimizer(optimizer='adamp')
optimizer = opt(model.parameters())

Also, you can load the optimizer via torch.hub

import torch

model = YourModel()
opt = torch.hub.load('kozistr/pytorch_optimizer', 'adamp')
optimizer = opt(model.parameters())

If you want to build the optimizer with parameters & configs, there’s create_optimizer() API.

from pytorch_optimizer import create_optimizer

optimizer = create_optimizer(
    model,
    'adamp',
    lr=1e-3,
    weight_decay=1e-3,
    use_gc=True,
    use_lookahead=True,
)

Supported Optimizers

You can check the supported optimizers with below code.

from pytorch_optimizer import get_supported_optimizers

supported_optimizers = get_supported_optimizers()

Optimizer	Description	Official Code	Paper	Citation
AdaBelief	Adapting Step-sizes by the Belief in Observed Gradients	github	https://arxiv.org/abs/2010.07468	cite
AdaBound	Adaptive Gradient Methods with Dynamic Bound of Learning Rate	github	https://openreview.net/forum?id=Bkg3g2R9FX	cite
AdaHessian	An Adaptive Second Order Optimizer for Machine Learning	github	https://arxiv.org/abs/2006.00719	cite
AdamD	Improved bias-correction in Adam		https://arxiv.org/abs/2110.10828	cite
AdamP	Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights	github	https://arxiv.org/abs/2006.08217	cite
diffGrad	An Optimization Method for Convolutional Neural Networks	github	https://arxiv.org/abs/1909.11015v3	cite
MADGRAD	A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic	github	https://arxiv.org/abs/2101.11075	cite
RAdam	On the Variance of the Adaptive Learning Rate and Beyond	github	https://arxiv.org/abs/1908.03265	cite
Ranger	a synergistic optimizer combining RAdam and LookAhead, and now GC in one optimizer	github	https://bit.ly/3zyspC3	cite
Ranger21	a synergistic deep learning optimizer	github	https://arxiv.org/abs/2106.13731	cite
Lamb	Large Batch Optimization for Deep Learning	github	https://arxiv.org/abs/1904.00962	cite
Shampoo	Preconditioned Stochastic Tensor Optimization	github	https://arxiv.org/abs/1802.09568	cite
Nero	Learning by Turning: Neural Architecture Aware Optimisation	github	https://arxiv.org/abs/2102.07227	cite
Adan	Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models	github	https://arxiv.org/abs/2208.06677	cite
Adai	Disentangling the Effects of Adaptive Learning Rate and Momentum	github	https://arxiv.org/abs/2006.15815	cite
SAM	Sharpness-Aware Minimization	github	https://arxiv.org/abs/2010.01412	cite
ASAM	Adaptive Sharpness-Aware Minimization	github	https://arxiv.org/abs/2102.11600	cite
GSAM	Surrogate Gap Guided Sharpness-Aware Minimization	github	https://openreview.net/pdf?id=edONMAnhLu-	cite
D-Adaptation	Learning-Rate-Free Learning by D-Adaptation	github	https://arxiv.org/abs/2301.07733	cite
AdaFactor	Adaptive Learning Rates with Sublinear Memory Cost	github	https://arxiv.org/abs/1804.04235	cite
Apollo	An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization	github	https://arxiv.org/abs/2009.13586	cite
NovoGrad	Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks	github	https://arxiv.org/abs/1905.11286	cite
Lion	Symbolic Discovery of Optimization Algorithms	github	https://arxiv.org/abs/2302.06675	cite
Ali-G	Adaptive Learning Rates for Interpolation with Gradients	github	https://arxiv.org/abs/1906.05661	cite
SM3	Memory-Efficient Adaptive Optimization	github	https://arxiv.org/abs/1901.11150	cite
AdaNorm	Adaptive Gradient Norm Correction based Optimizer for CNNs	github	https://arxiv.org/abs/2210.06364	cite
RotoGrad	Gradient Homogenization in Multitask Learning	github	https://openreview.net/pdf?id=T8wHz4rnuGL	cite
A2Grad	Optimal Adaptive and Accelerated Stochastic Gradient Descent	github	https://arxiv.org/abs/1810.00553	cite
AccSGD	Accelerating Stochastic Gradient Descent For Least Squares Regression	github	https://arxiv.org/abs/1704.08227	cite
SGDW	Decoupled Weight Decay Regularization	github	https://arxiv.org/abs/1711.05101	cite
ASGD	Adaptive Gradient Descent without Descent	github	https://arxiv.org/abs/1910.09529	cite
Yogi	Adaptive Methods for Nonconvex Optimization		NIPS 2018	cite
SWATS	Improving Generalization Performance by Switching from Adam to SGD		https://arxiv.org/abs/1712.07628	cite
Fromage	On the distance between two neural networks and the stability of learning	github	https://arxiv.org/abs/2002.03432	cite
MSVAG	Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients	github	https://arxiv.org/abs/1705.07774	cite
AdaMod	An Adaptive and Momental Bound Method for Stochastic Learning	github	https://arxiv.org/abs/1910.12249	cite
AggMo	Aggregated Momentum: Stability Through Passive Damping	github	https://arxiv.org/abs/1804.00325	cite
QHAdam	Quasi-hyperbolic momentum and Adam for deep learning	github	https://arxiv.org/abs/1810.06801	cite
PID	A PID Controller Approach for Stochastic Optimization of Deep Networks	github	CVPR 18	cite
Gravity	a Kinematic Approach on Optimization in Deep Learning	github	https://arxiv.org/abs/2101.09192	cite
AdaSmooth	An Adaptive Learning Rate Method based on Effective Ratio		https://arxiv.org/abs/2204.00825v1	cite
SRMM	Stochastic regularized majorization-minimization with weakly convex and multi-convex surrogates	github	https://arxiv.org/abs/2201.01652	cite
AvaGrad	Domain-independent Dominance of Adaptive Methods	github	https://arxiv.org/abs/1912.01823	cite
PCGrad	Gradient Surgery for Multi-Task Learning	github	https://arxiv.org/abs/2001.06782	cite
AMSGrad	On the Convergence of Adam and Beyond		https://openreview.net/pdf?id=ryQu7f-RZ	cite
Lookahead	k steps forward, 1 step back	github	https://arxiv.org/abs/1907.08610	cite
PNM	Manipulating Stochastic Gradient Noise to Improve Generalization	github	https://arxiv.org/abs/2103.17182	cite
GC	Gradient Centralization	github	https://arxiv.org/abs/2004.01461	cite
AGC	Adaptive Gradient Clipping	github	https://arxiv.org/abs/2102.06171	cite
Stable WD	Understanding and Scheduling Weight Decay	github	https://arxiv.org/abs/2011.11152	cite
Softplus T	Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM		https://arxiv.org/abs/1908.00700	cite
Un-tuned w/u	On the adequacy of untuned warmup for adaptive optimization		https://arxiv.org/abs/1910.04209	cite
Norm Loss	An efficient yet effective regularization method for deep neural networks		https://arxiv.org/abs/2103.06583	cite
AdaShift	Decorrelation and Convergence of Adaptive Learning Rate Methods	github	https://arxiv.org/abs/1810.00143v4	cite
AdaDelta	An Adaptive Learning Rate Method		https://arxiv.org/abs/1212.5701v1	cite
Amos	An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale	github	https://arxiv.org/abs/2210.11693	cite
SignSGD	Compressed Optimisation for Non-Convex Problems	github	https://arxiv.org/abs/1802.04434	cite
AdaHessian	An Adaptive Second Order Optimizer for Machine Learning	github	https://arxiv.org/abs/2006.00719	cite
Sophia	A Scalable Stochastic Second-order Optimizer for Language Model Pre-training	github	https://arxiv.org/abs/2305.14342	cite
Prodigy	An Expeditiously Adaptive Parameter-Free Learner	github	https://arxiv.org/abs/2306.06101	cite

Supported LR Scheduler

You can check the supported learning rate schedulers with below code.

from pytorch_optimizer import get_supported_lr_schedulers

supported_lr_schedulers = get_supported_lr_schedulers()

LR Scheduler	Description	Official Code	Paper	Citation
Explore-Exploit	Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule		https://arxiv.org/abs/2003.03977	cite
Chebyshev	Acceleration via Fractal Learning Rate Schedules		https://arxiv.org/abs/2103.01338	cite

Useful Resources

Several optimization ideas to regularize & stabilize the training. Most of the ideas are applied in Ranger21 optimizer.

Also, most of the captures are taken from Ranger21 paper.

Adaptive Gradient Clipping	Gradient Centralization	Softplus Transformation
Gradient Normalization	Norm Loss	Positive-Negative Momentum
Linear learning rate warmup	Stable weight decay	Explore-exploit learning rate schedule
Lookahead	Chebyshev learning rate schedule	(Adaptive) Sharpness-Aware Minimization
On the Convergence of Adam and Beyond	Improved bias-correction in Adam	Adaptive Gradient Norm Correction

Adaptive Gradient Clipping

This idea originally proposed in NFNet (Normalized-Free Network) paper.

AGC (Adaptive Gradient Clipping) clips gradients based on the unit-wise ratio of gradient norms to parameter norms.

code : github
paper : arXiv

Gradient Centralization

Gradient Centralization (GC) operates directly on gradients by centralizing the gradient to have zero mean.

code : github
paper : arXiv

Softplus Transformation

By running the final variance denom through the softplus function, it lifts extremely tiny values to keep them viable.

paper : arXiv

Gradient Normalization

Norm Loss

paper : arXiv

Positive-Negative Momentum

code : github
paper : arXiv

Linear learning rate warmup

https://raw.githubusercontent.com/kozistr/pytorch_optimizer/main/assets/linear_lr_warmup.png

paper : arXiv

Stable weight decay

code : github
paper : arXiv

Explore-exploit learning rate schedule

https://raw.githubusercontent.com/kozistr/pytorch_optimizer/main/assets/explore_exploit_lr_schedule.png

code : github
paper : arXiv

Lookahead

k steps forward, 1 step back. Lookahead consisting of keeping an exponential moving average of the weights that is

updated and substituted to the current weights every k_{lookahead} steps (5 by default).

Chebyshev learning rate schedule

Acceleration via Fractal Learning Rate Schedules.

(Adaptive) Sharpness-Aware Minimization

Sharpness-Aware Minimization (SAM) simultaneously minimizes loss value and loss sharpness.

In particular, it seeks parameters that lie in neighborhoods having uniformly low loss.

On the Convergence of Adam and Beyond

Convergence issues can be fixed by endowing such algorithms with ‘long-term memory’ of past gradients.

Improved bias-correction in Adam

With the default bias-correction, Adam may actually make larger than requested gradient updates early in training.

Adaptive Gradient Norm Correction

Correcting the norm of gradient in each iteration based on the adaptive training history of gradient norm.

Citation

Please cite original authors of optimization algorithms. If you use this software, please cite it as below. Or you can get from “cite this repository” button.

@software{Kim_pytorch_optimizer_Optimizer_and_2022,
    author = {Kim, Hyeongchan},
    month = {1},
    title = {{pytorch_optimizer: optimizer and lr scheduler collections in PyTorch}},
    version = {1.0.0},
    year = {2022}
}

Author

Hyeongchan Kim / @kozistr

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

2.12.0

Oct 7, 2023

2.11.2

Sep 2, 2023

2.11.1

Jul 19, 2023

2.11.0

Jun 27, 2023

This version

2.10.1

Jun 13, 2023

2.10.0

Jun 7, 2023

2.9.1

May 19, 2023

2.9.0

May 6, 2023

2.8.0

Apr 29, 2023

2.7.0

Apr 26, 2023

2.6.1

Apr 22, 2023

2.6.0

Apr 22, 2023

2.5.2

Apr 11, 2023

2.5.1

Mar 12, 2023

2.5.0

Feb 15, 2023

2.4.2

Feb 10, 2023

2.4.1

Feb 6, 2023

2.4.0

Feb 2, 2023

2.3.1

Jan 31, 2023

2.3.0

Jan 30, 2023

2.2.1

Jan 28, 2023

2.2.0

Jan 24, 2023

2.1.1

Jan 2, 2023

2.1.0

Jan 1, 2023

2.0.1

Nov 1, 2022

2.0.0

Oct 21, 2022

1.3.2

Sep 2, 2022

1.3.1

Sep 1, 2022

1.2.0

Aug 26, 2022

1.1.4

Aug 25, 2022

1.1.3

Aug 23, 2022

1.1.2

Jun 1, 2022

1.1.1

May 9, 2022

1.1.0

May 8, 2022

1.0.0

May 7, 2022

0.6.1

May 7, 2022

0.6.0

Apr 2, 2022

0.5.0

Mar 5, 2022

0.4.2

Mar 5, 2022

0.4.1

Feb 20, 2022

0.4.0

Feb 19, 2022

0.3.7

Feb 1, 2022

0.3.6

Jan 31, 2022

0.3.5

Jan 30, 2022

0.3.4

Jan 29, 2022

0.3.3

Jan 29, 2022

0.3.2

Jan 28, 2022

0.3.1

Jan 28, 2022

0.3.0

Jan 28, 2022

0.2.2

Nov 29, 2021

0.2.1

Nov 22, 2021

0.2.0

Nov 15, 2021

0.1.1

Oct 9, 2021

0.1.0

Oct 6, 2021

0.0.11

Oct 6, 2021

0.0.10

Sep 25, 2021

0.0.9

Sep 23, 2021

0.0.8

Sep 23, 2021

0.0.7

Sep 22, 2021

0.0.6

Sep 22, 2021

0.0.5

Sep 22, 2021

0.0.4

Sep 22, 2021

0.0.3

Sep 22, 2021

0.0.2

Sep 21, 2021

0.0.1

Sep 21, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytorch_optimizer-2.10.1.tar.gz (84.6 kB view hashes)

Uploaded Jun 13, 2023 Source

Built Distribution

pytorch_optimizer-2.10.1-py3-none-any.whl (137.5 kB view hashes)

Uploaded Jun 13, 2023 Python 3

Hashes for pytorch_optimizer-2.10.1.tar.gz

Hashes for pytorch_optimizer-2.10.1.tar.gz
Algorithm	Hash digest
SHA256	`301217d15aca0f023a55d647044ae7d1f60d38d42daccc2dc7327994a83e251f`
MD5	`07ce2445bdd89fcac1d27e9a2cbec041`
BLAKE2b-256	`9f64031fc7ddfb002e0e749d288683f576fcb9345163d2cf2c13841512b23e2c`

Hashes for pytorch_optimizer-2.10.1-py3-none-any.whl

Hashes for pytorch_optimizer-2.10.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7e68cfca92576cf49c836d83a5c5317cd41f532601d35d18c8543539dc4cb06d`
MD5	`407c590ef90f405567ba2bf789805de3`
BLAKE2b-256	`bef5b7b1c3ca57d2914ea63e28e7cabdd418d839fa3339f30dcbce84f2d846bf`

pytorch_optimizer 2.10.1

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Getting Started

Installation

Simple Usage

Supported Optimizers

Supported LR Scheduler

Useful Resources

Adaptive Gradient Clipping

Gradient Centralization

Softplus Transformation

Gradient Normalization

Norm Loss

Positive-Negative Momentum

Linear learning rate warmup

Stable weight decay

Explore-exploit learning rate schedule

Lookahead

Chebyshev learning rate schedule

(Adaptive) Sharpness-Aware Minimization

On the Convergence of Adam and Beyond

Improved bias-correction in Adam

Adaptive Gradient Norm Correction

Citation

Author

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution