Last updated:
0 purchases
pytorch optimizer 3.1.1
pytorch-optimizer
Build
Quality
Package
Status
License
pytorch-optimizer is optimizer & lr scheduler collections in PyTorch.
I just re-implemented (speed & memory tweaks, plug-ins) the algorithm while based on the original paper. Also, It includes useful and practical optimization ideas.
Currently, 75 optimizers (+ bitsandbytes, qgalore), 16 lr schedulers, and 13 loss functions are supported!
Highly inspired by pytorch-optimizer.
Getting Started
For more, see the documentation.
Most optimizers are under MIT or Apache 2.0 license, but a few optimizers like Fromage, Nero have CC BY-NC-SA 4.0 license, which is non-commercial.
So, please double-check the license before using it at your work.
Installation
$ pip3 install pytorch-optimizer
From v2.12.0, v3.1.0, you can use bitsandbytes, q-galore-torch optimizers respectively!
please check the bnb requirements, q-galore-torch installation
before installing it.
From v3.0.0, drop Python 3.7 support. However, you can still use this package with Python 3.7 by installing with --ignore-requires-python option.
$ pip install "pytorch-optimizer[bitsandbytes]"
Simple Usage
from pytorch_optimizer import AdamP
model = YourModel()
optimizer = AdamP(model.parameters())
# or you can use optimizer loader, simply passing a name of the optimizer.
from pytorch_optimizer import load_optimizer
optimizer = load_optimizer(optimizer='adamp')(model.parameters())
# if you install `bitsandbytes` optimizer, you can use `8-bit` optimizers from `pytorch-optimizer`.
from pytorch_optimizer import load_optimizer
opt = load_optimizer(optimizer='bnb_adamw8bit')
optimizer = opt(model.parameters())
Also, you can load the optimizer via torch.hub.
import torch
model = YourModel()
opt = torch.hub.load('kozistr/pytorch_optimizer', 'adamp')
optimizer = opt(model.parameters())
If you want to build the optimizer with parameters & configs, there's create_optimizer() API.
from pytorch_optimizer import create_optimizer
optimizer = create_optimizer(
model,
'adamp',
lr=1e-3,
weight_decay=1e-3,
use_gc=True,
use_lookahead=True,
)
Supported Optimizers
You can check the supported optimizers with below code.
from pytorch_optimizer import get_supported_optimizers
supported_optimizers = get_supported_optimizers()
Optimizer
Description
Official Code
Paper
Citation
AdaBelief
Adapting Step-sizes by the Belief in Observed Gradients
github
https://arxiv.org/abs/2010.07468
cite
AdaBound
Adaptive Gradient Methods with Dynamic Bound of Learning Rate
github
https://openreview.net/forum?id=Bkg3g2R9FX
cite
AdaHessian
An Adaptive Second Order Optimizer for Machine Learning
github
https://arxiv.org/abs/2006.00719
cite
AdamD
Improved bias-correction in Adam
https://arxiv.org/abs/2110.10828
cite
AdamP
Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights
github
https://arxiv.org/abs/2006.08217
cite
diffGrad
An Optimization Method for Convolutional Neural Networks
github
https://arxiv.org/abs/1909.11015v3
cite
MADGRAD
A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic
github
https://arxiv.org/abs/2101.11075
cite
RAdam
On the Variance of the Adaptive Learning Rate and Beyond
github
https://arxiv.org/abs/1908.03265
cite
Ranger
a synergistic optimizer combining RAdam and LookAhead, and now GC in one optimizer
github
https://bit.ly/3zyspC3
cite
Ranger21
a synergistic deep learning optimizer
github
https://arxiv.org/abs/2106.13731
cite
Lamb
Large Batch Optimization for Deep Learning
github
https://arxiv.org/abs/1904.00962
cite
Shampoo
Preconditioned Stochastic Tensor Optimization
github
https://arxiv.org/abs/1802.09568
cite
Nero
Learning by Turning: Neural Architecture Aware Optimisation
github
https://arxiv.org/abs/2102.07227
cite
Adan
Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models
github
https://arxiv.org/abs/2208.06677
cite
Adai
Disentangling the Effects of Adaptive Learning Rate and Momentum
github
https://arxiv.org/abs/2006.15815
cite
SAM
Sharpness-Aware Minimization
github
https://arxiv.org/abs/2010.01412
cite
ASAM
Adaptive Sharpness-Aware Minimization
github
https://arxiv.org/abs/2102.11600
cite
GSAM
Surrogate Gap Guided Sharpness-Aware Minimization
github
https://openreview.net/pdf?id=edONMAnhLu-
cite
D-Adaptation
Learning-Rate-Free Learning by D-Adaptation
github
https://arxiv.org/abs/2301.07733
cite
AdaFactor
Adaptive Learning Rates with Sublinear Memory Cost
github
https://arxiv.org/abs/1804.04235
cite
Apollo
An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization
github
https://arxiv.org/abs/2009.13586
cite
NovoGrad
Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks
github
https://arxiv.org/abs/1905.11286
cite
Lion
Symbolic Discovery of Optimization Algorithms
github
https://arxiv.org/abs/2302.06675
cite
Ali-G
Adaptive Learning Rates for Interpolation with Gradients
github
https://arxiv.org/abs/1906.05661
cite
SM3
Memory-Efficient Adaptive Optimization
github
https://arxiv.org/abs/1901.11150
cite
AdaNorm
Adaptive Gradient Norm Correction based Optimizer for CNNs
github
https://arxiv.org/abs/2210.06364
cite
RotoGrad
Gradient Homogenization in Multitask Learning
github
https://openreview.net/pdf?id=T8wHz4rnuGL
cite
A2Grad
Optimal Adaptive and Accelerated Stochastic Gradient Descent
github
https://arxiv.org/abs/1810.00553
cite
AccSGD
Accelerating Stochastic Gradient Descent For Least Squares Regression
github
https://arxiv.org/abs/1704.08227
cite
SGDW
Decoupled Weight Decay Regularization
github
https://arxiv.org/abs/1711.05101
cite
ASGD
Adaptive Gradient Descent without Descent
github
https://arxiv.org/abs/1910.09529
cite
Yogi
Adaptive Methods for Nonconvex Optimization
NIPS 2018
cite
SWATS
Improving Generalization Performance by Switching from Adam to SGD
https://arxiv.org/abs/1712.07628
cite
Fromage
On the distance between two neural networks and the stability of learning
github
https://arxiv.org/abs/2002.03432
cite
MSVAG
Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients
github
https://arxiv.org/abs/1705.07774
cite
AdaMod
An Adaptive and Momental Bound Method for Stochastic Learning
github
https://arxiv.org/abs/1910.12249
cite
AggMo
Aggregated Momentum: Stability Through Passive Damping
github
https://arxiv.org/abs/1804.00325
cite
QHAdam
Quasi-hyperbolic momentum and Adam for deep learning
github
https://arxiv.org/abs/1810.06801
cite
PID
A PID Controller Approach for Stochastic Optimization of Deep Networks
github
CVPR 18
cite
Gravity
a Kinematic Approach on Optimization in Deep Learning
github
https://arxiv.org/abs/2101.09192
cite
AdaSmooth
An Adaptive Learning Rate Method based on Effective Ratio
https://arxiv.org/abs/2204.00825v1
cite
SRMM
Stochastic regularized majorization-minimization with weakly convex and multi-convex surrogates
github
https://arxiv.org/abs/2201.01652
cite
AvaGrad
Domain-independent Dominance of Adaptive Methods
github
https://arxiv.org/abs/1912.01823
cite
PCGrad
Gradient Surgery for Multi-Task Learning
github
https://arxiv.org/abs/2001.06782
cite
AMSGrad
On the Convergence of Adam and Beyond
https://openreview.net/pdf?id=ryQu7f-RZ
cite
Lookahead
k steps forward, 1 step back
github
https://arxiv.org/abs/1907.08610
cite
PNM
Manipulating Stochastic Gradient Noise to Improve Generalization
github
https://arxiv.org/abs/2103.17182
cite
GC
Gradient Centralization
github
https://arxiv.org/abs/2004.01461
cite
AGC
Adaptive Gradient Clipping
github
https://arxiv.org/abs/2102.06171
cite
Stable WD
Understanding and Scheduling Weight Decay
github
https://arxiv.org/abs/2011.11152
cite
Softplus T
Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM
https://arxiv.org/abs/1908.00700
cite
Un-tuned w/u
On the adequacy of untuned warmup for adaptive optimization
https://arxiv.org/abs/1910.04209
cite
Norm Loss
An efficient yet effective regularization method for deep neural networks
https://arxiv.org/abs/2103.06583
cite
AdaShift
Decorrelation and Convergence of Adaptive Learning Rate Methods
github
https://arxiv.org/abs/1810.00143v4
cite
AdaDelta
An Adaptive Learning Rate Method
https://arxiv.org/abs/1212.5701v1
cite
Amos
An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale
github
https://arxiv.org/abs/2210.11693
cite
SignSGD
Compressed Optimisation for Non-Convex Problems
github
https://arxiv.org/abs/1802.04434
cite
Sophia
A Scalable Stochastic Second-order Optimizer for Language Model Pre-training
github
https://arxiv.org/abs/2305.14342
cite
Prodigy
An Expeditiously Adaptive Parameter-Free Learner
github
https://arxiv.org/abs/2306.06101
cite
PAdam
Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks
github
https://arxiv.org/abs/1806.06763
cite
LOMO
Full Parameter Fine-tuning for Large Language Models with Limited Resources
github
https://arxiv.org/abs/2306.09782
cite
AdaLOMO
Low-memory Optimization with Adaptive Learning Rate
github
https://arxiv.org/abs/2310.10195
cite
Tiger
A Tight-fisted Optimizer, an optimizer that is extremely budget-conscious
github
cite
CAME
Confidence-guided Adaptive Memory Efficient Optimization
github
https://aclanthology.org/2023.acl-long.243/
cite
WSAM
Sharpness-Aware Minimization Revisited: Weighted Sharpness as a Regularization Term
github
https://arxiv.org/abs/2305.15817
cite
Aida
A DNN Optimizer that Improves over AdaBelief by Suppression of the Adaptive Stepsize Range
github
https://arxiv.org/abs/2203.13273
cite
GaLore
Memory-Efficient LLM Training by Gradient Low-Rank Projection
github
https://arxiv.org/abs/2403.03507
cite
Adalite
Adalite optimizer
github
https://github.com/VatsaDev/adalite
cite
bSAM
SAM as an Optimal Relaxation of Bayes
github
https://arxiv.org/abs/2210.01620
cite
Schedule-Free
Schedule-Free Optimizers
github
https://github.com/facebookresearch/schedule_free
cite
FAdam
Adam is a natural gradient optimizer using diagonal empirical Fisher information
github
https://arxiv.org/abs/2405.12807
cite
Grokfast
Accelerated Grokking by Amplifying Slow Gradients
github
https://arxiv.org/abs/2405.20233
cite
Kate
Remove that Square Root: A New Efficient Scale-Invariant Version of AdaGrad
github
https://arxiv.org/abs/2403.02648
cite
StableAdamW
Stable and low-precision training for large-scale vision-language models
https://arxiv.org/abs/2304.13013
cite
AdamMini
Use Fewer Learning Rates To Gain More
github
https://arxiv.org/abs/2406.16793
cite
TRAC
Adaptive Parameter-free Optimization
github
https://arxiv.org/abs/2405.16642
cite
AdamG
Towards Stability of Parameter-free Optimization
https://arxiv.org/abs/2405.04376
cite
Supported LR Scheduler
You can check the supported learning rate schedulers with below code.
from pytorch_optimizer import get_supported_lr_schedulers
supported_lr_schedulers = get_supported_lr_schedulers()
LR Scheduler
Description
Official Code
Paper
Citation
Explore-Exploit
Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule
https://arxiv.org/abs/2003.03977
cite
Chebyshev
Acceleration via Fractal Learning Rate Schedules
https://arxiv.org/abs/2103.01338
cite
REX
Revisiting Budgeted Training with an Improved Schedule
github
https://arxiv.org/abs/2107.04197
cite
WSD
Warmup-Stable-Decay learning rate scheduler
github
https://arxiv.org/abs/2404.06395
cite
Supported Loss Function
You can check the supported loss functions with below code.
from pytorch_optimizer import get_supported_loss_functions
supported_loss_functions = get_supported_loss_functions()
Loss Functions
Description
Official Code
Paper
Citation
Label Smoothing
Rethinking the Inception Architecture for Computer Vision
https://arxiv.org/abs/1512.00567
cite
Focal
Focal Loss for Dense Object Detection
https://arxiv.org/abs/1708.02002
cite
Focal Cosine
Data-Efficient Deep Learning Method for Image Classification Using Data Augmentation, Focal Cosine Loss, and Ensemble
https://arxiv.org/abs/2007.07805
cite
LDAM
Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss
github
https://arxiv.org/abs/1906.07413
cite
Jaccard (IOU)
IoU Loss for 2D/3D Object Detection
https://arxiv.org/abs/1908.03851
cite
Bi-Tempered
The Principle of Unchanged Optimality in Reinforcement Learning Generalization
https://arxiv.org/abs/1906.03361
cite
Tversky
Tversky loss function for image segmentation using 3D fully convolutional deep networks
https://arxiv.org/abs/1706.05721
cite
Lovasz Hinge
A tractable surrogate for the optimization of the intersection-over-union measure in neural networks
github
https://arxiv.org/abs/1705.08790
cite
Useful Resources
Several optimization ideas to regularize & stabilize the training. Most of the ideas are applied in Ranger21 optimizer.
Also, most of the captures are taken from Ranger21 paper.
Adaptive Gradient Clipping
Gradient Centralization
Softplus Transformation
Gradient Normalization
Norm Loss
Positive-Negative Momentum
Linear learning rate warmup
Stable weight decay
Explore-exploit learning rate schedule
Lookahead
Chebyshev learning rate schedule
(Adaptive) Sharpness-Aware Minimization
On the Convergence of Adam and Beyond
Improved bias-correction in Adam
Adaptive Gradient Norm Correction
Adaptive Gradient Clipping
This idea originally proposed in NFNet (Normalized-Free Network) paper. AGC (Adaptive Gradient Clipping) clips gradients based on the unit-wise ratio of gradient norms to parameter norms.
code : github
paper : arXiv
Gradient Centralization
Gradient Centralization (GC) operates directly on gradients by centralizing the gradient to have zero mean.
code : github
paper : arXiv
Softplus Transformation
By running the final variance denom through the softplus function, it lifts extremely tiny values to keep them viable.
paper : arXiv
Gradient Normalization
Norm Loss
paper : arXiv
Positive-Negative Momentum
code : github
paper : arXiv
Linear learning rate warmup
paper : arXiv
Stable weight decay
code : github
paper : arXiv
Explore-exploit learning rate schedule
code : github
paper : arXiv
Lookahead
k steps forward, 1 step back. Lookahead consisting of keeping an exponential moving average of the weights that is updated and substituted to the current weights every k lookahead steps (5 by default).
Chebyshev learning rate schedule
Acceleration via Fractal Learning Rate Schedules.
(Adaptive) Sharpness-Aware Minimization
Sharpness-Aware Minimization (SAM) simultaneously minimizes loss value and loss sharpness.
In particular, it seeks parameters that lie in neighborhoods having uniformly low loss.
On the Convergence of Adam and Beyond
Convergence issues can be fixed by endowing such algorithms with 'long-term memory' of past gradients.
Improved bias-correction in Adam
With the default bias-correction, Adam may actually make larger than requested gradient updates early in training.
Adaptive Gradient Norm Correction
Correcting the norm of a gradient in each iteration based on the adaptive training history of gradient norm.
Frequently asked questions
here
Visualization
here
Citation
Please cite the original authors of optimization algorithms. You can easily find it in the above table!
If you use this software, please cite it below. Or you can get it from "cite this repository" button.
@software{Kim_pytorch_optimizer_optimizer_2021,
author = {Kim, Hyeongchan},
month = jan,
title = {{pytorch_optimizer: optimizer & lr scheduler & loss function collections in PyTorch}},
url = {https://github.com/kozistr/pytorch_optimizer},
version = {3.1.0},
year = {2021}
}
Maintainer
Hyeongchan Kim / @kozistr
For personal and professional use. You cannot resell or redistribute these repositories in their original state.
There are no reviews.