Today I curious about which optimizers, the popular LLMs use.
Seems the go-to choice is still AdamW — the Adam variant that cleanly splits weight-decay from the momentum updates:

  • BERT (original TensorFlow code) - optimization.py defines AdamWeightDecayOptimizer
  • GPT-2 (Karpathy's minimal recreation) - train_gpt2.py creates torch.optim.AdamW
  • T5 (fine-tuning helper) - hf_model.py docstring shows passing transformers.AdamW via optimizer=

A small asterisk on T5

Google's T5 actually pre-trains with Adafactor, a memory-lean cousin of Adam designed for huge models. Fine-tuning scripts (including Google's own helper above) usually flip to AdamW once the model fits on a single GPU.

Other contenders

Some others I came across but none has displaced AdamW as the safe default yet:

  • LAMB - layer-wise scaling so you can crank batch sizes into the tens of thousands.
  • Lion - sign-based update; promising speedups on vision and small-scale language tasks.
  • Shampoo / Adafactor-variants - second-order tricks aimed at even larger models.