Optimizer Theory for Transformer Training

> [!info] Course code > Use these repo paths together with this note: > - [picollm/accelerated/pretrain/train.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/pretrain/train.py) > - [picollm/accelerated/optim.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/optim.py) ## What This Concept Is When a run becomes unstable, slow to improve, or strangely sensitive to small changes, the optimizer settings are often part of the story. This note explains those settings as control over optimization behavior rather than as a bag of ritual knobs. In short, it is about why the update rule behaves the way it does. ## Foundation Terms You Need First An **optimizer** is the algorithm that turns gradients into parameter updates. The **learning rate** controls step size. **Weight decay** is a regularization term that discourages overly large weights. **[[Glossary#Warmup|Warmup]]** is the gradual ramp into the target learning rate at the beginning of training. Those terms are worth keeping separate because they answer different questions: how fast to move, how strongly to regularize, and how gently to enter training. ## Why Adam became standard [[Glossary#Adam|Adam]] became widely used because it adapts updates per parameter using moving averages of gradients and squared gradients. This makes it robust across parameters with different scales and often easier to tune than plain SGD in deep settings.[^2] For large Transformer models, that practical robustness helped make Adam-style optimization a default choice. ## Why AdamW matters [[Glossary#AdamW|AdamW]] matters because it decouples weight decay from the adaptive gradient update. That distinction is not cosmetic. In the older formulation, weight decay interacted with the optimizer dynamics in a way that did not behave like clean L2 regularization under adaptive methods. Decoupled weight decay made the regularization effect more interpretable and more stable, which is why AdamW became a common standard in Transformer training.[^3] ## Warmup and why large models need it Warmup exists because the earliest phase of training is fragile. Parameters are near initialization, gradient scales can be poorly behaved, and large updates at the beginning can destabilize the run. Instead of jumping immediately to the full learning rate, warmup ramps it up gradually. This gives the optimizer time to enter a more stable regime before the aggressive part of training begins.[^4] You should not memorize warmup as ritual. They should understand it as protection against early instability. ## Why cosine decay is common After warmup, many training recipes use cosine decay because it reduces the learning rate smoothly over time. The intuition is that early training benefits from larger exploratory steps, while later training often benefits from smaller refinements. Cosine schedules are not the only choice, but they are popular because they give a smooth transition from aggressive to conservative updates without abrupt jumps.[^5] ## Gradient noise scale and large-batch intuition The gradient computed from a finite batch is only an estimate of the full-data gradient. That estimate contains noise. The relationship between [[Glossary#Batch size|batch size]] and gradient noise helps explain why very small batches can be unstable and why very large batches change optimization behavior. You do not need a full derivation in the first pass, but you should understand that batch size affects not only memory and [[Glossary#Throughput|throughput]] but also the character of optimization itself.[^6] This is why "just increase batch size" is not only a hardware question. It changes the stochastic regime the optimizer experiences. ## Gradient clipping and precision stability Large language model training often uses gradient clipping because rare gradient spikes can destabilize updates. This interacts with mixed precision, [[Glossary#Loss|loss]] scaling, and normalization behavior. In other words, clipping is not only a protection against bad luck. It is part of the numerical-stability stack. You should learn to see: - learning rate - clipping - precision mode - normalization placement as a coupled system rather than four unrelated toggles. ## Instability in deep Transformers Transformers are powerful but not automatically easy to optimize at scale. Instability can come from large learning rates, poor normalization placement, residual-path amplification, numerical precision issues, and unbalanced initialization. This is one reason modern architectures and training recipes pay so much attention to normalization, residual scaling, and carefully tuned schedules.[^7] ## Normalization and residual scaling [[Glossary#Layer normalization|Layer normalization]] helps stabilize deep networks by controlling activation statistics, but it is only part of the story. Residual pathways also matter because they allow information to pass through many layers. If those residual contributions are badly scaled, the network can become unstable or hard to optimize. This is why deep-Transformer design includes more than just choosing an optimizer. Architecture and optimization are coupled.[^7] ## Optimizer-state memory as a systems constraint At advanced level, optimizer theory also becomes systems theory. Adam-style optimizers maintain extra state, and that state consumes memory. This matters because optimizer choice affects not only convergence behavior but also whether the run fits on the hardware at all. You should understand that optimizer selection changes both the training dynamics and the memory budget. ## Why this matters for `picollm` The `picollm` scripts expose practical optimizer knobs, but you should learn to see them as theory-backed decisions. Learning rate, warmup, weight decay, and scheduler shape are not just config lines. They are the local expression of a much deeper optimization story. <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Scaling Laws and Compute-Optimal Training" href="Scaling%20Laws%20and%20Compute-Optimal%20Training">Scaling Laws and Compute-Optimal Training</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Advanced Distributed Training Systems" href="Advanced%20Distributed%20Training%20Systems">Advanced Distributed Training Systems</a></div> </div> </div> ## References [^1]: Sebastian Raschka, [LLMs-from-scratch](https://github.com/rasbt/LLMs-from-scratch) [^2]: Diederik Kingma and Jimmy Ba, [Adam: A Method for Stochastic Optimization](https://arxiv.org/abs/1412.6980) [^3]: Ilya Loshchilov and Frank Hutter, [Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101) [^4]: Hugging Face, [Trainer documentation](https://huggingface.co/docs/transformers/main_classes/trainer) [^5]: PyTorch, [Learning rate scheduler documentation](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate) [^6]: Samuel Smith and Quoc Le, [A Bayesian Perspective on Generalization and Stochastic Gradient Descent](https://arxiv.org/abs/1710.06451) [^7]: Hongyu Wang et al., [DeepNet: Scaling Transformers to 1,000 Layers](https://arxiv.org/abs/2203.00555)