Most LLMs Still Lean on AdamW
Today I curious about which optimizers, the popular LLMs use.
Seems the go-to choice is still AdamW — the Adam variant that cleanly splits weight-decay from the momentum updates:
- BERT (original TensorFlow code) - optimization.pydefinesAdamWeightDecayOptimizer
- GPT-2 (Karpathy's minimal recreation) - train_gpt2.pycreatestorch.optim.AdamW
- T5 (fine-tuning helper) - hf_model.pydocstring shows passingtransformers.AdamWviaoptimizer=
A small asterisk on T5
Google's T5 actually pre-trains with Adafactor, a memory-lean cousin of Adam designed for huge models. Fine-tuning scripts (including Google's own helper above) usually flip to AdamW once the model fits on a single GPU.
Other contenders
Some others I came across but none has displaced AdamW as the safe default yet:
- LAMB - layer-wise scaling so you can crank batch sizes into the tens of thousands.
- Lion - sign-based update; promising speedups on vision and small-scale language tasks.
- Shampoo / Adafactor-variants - second-order tricks aimed at even larger models.