Most LLMs Still Lean on AdamW
Today I curious about which optimizers, the popular LLMs use.
Seems the go-to choice is still AdamW — the Adam variant that cleanly splits weight-decay from the momentum updates:
- BERT (original TensorFlow code) -
optimization.py
definesAdamWeightDecayOptimizer
- GPT-2 (Karpathy's minimal recreation) -
train_gpt2.py
createstorch.optim.AdamW
- T5 (fine-tuning helper) -
hf_model.py
docstring shows passingtransformers.AdamW
viaoptimizer=
A small asterisk on T5
Google's T5 actually pre-trains with Adafactor, a memory-lean cousin of Adam designed for huge models. Fine-tuning scripts (including Google's own helper above) usually flip to AdamW once the model fits on a single GPU.
Other contenders
Some others I came across but none has displaced AdamW as the safe default yet:
- LAMB - layer-wise scaling so you can crank batch sizes into the tens of thousands.
- Lion - sign-based update; promising speedups on vision and small-scale language tasks.
- Shampoo / Adafactor-variants - second-order tricks aimed at even larger models.