Layer Normalization

> [!info] Course code > Use the companion repository for runnable notebooks, figures, and implementation references for this lecture: > - Theory notebook: [notebooks/layer_normalization/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/layer_normalization/lecture_walkthrough.ipynb) > - Serious model anchor: [picollm/accelerated/gpt.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/gpt.py) ## What This Concept Is Deep models can become numerically messy very quickly. Activations can drift in scale, some features can dominate others, and stacked residual blocks can become harder to optimize. Layer normalization is one of the main tools that keeps those internal values in a healthier range. So this note is about stability, not meaning. LayerNorm does not teach the model a concept by itself. It helps the rest of the model train and stack more reliably. ## Foundation Terms You Need First Take one token representation and look across its feature dimension. LayerNorm computes simple statistics over that feature vector, recenters it, rescales it, and then applies learned scale and shift parameters so the model still has flexibility. The important objects are the **mean**, the **variance**, and the token's **feature dimension**. As you read, keep the scope clear: LayerNorm usually works across features inside one token representation, not across different tokens in the sequence. It is a local normalization step applied repeatedly throughout the stack. <video src="https://assets.montek.dev/lectures/media/llm/concepts/Layer%20Normalization/01_layernorm_forward_geometry.mp4" controls></video> ## How this lecture maps to picoLLM This note introduces canonical LayerNorm because you need the standard normalization story first. But the serious picoLLM stack is intentionally not a literal LayerNorm-only replica. In `picollm/accelerated/gpt.py`, you will encounter RMSNorm-style choices in the actual runtime stack. That is the intended reading ladder: - learn the standard normalization mechanics here - inspect the normalization choice that picoLLM actually uses - use `rasbt` as the clean concept-first external reference - use `nanochat` as the systems-first comparison reference This note covers three connected ideas: - the mathematics of layer normalization: the forward pass, the role of epsilon, the geometric meaning of mean subtraction and variance scaling, and the structure of the backward pass[^1] - why normalization placement matters so much in Transformers: Post-LN versus Pre-LN, [[Glossary#Warmup|warmup]] sensitivity, residual-gradient flow, and why modern large models often prefer pre-normalization[^2] - the architecture and systems choices around normalization: RMSNorm and other variants, mixed-precision stability, fused kernels, residual scaling methods, and practical debugging patterns[^3] ## What layer normalization is ### The basic forward pass For a token representation $x \in \mathbb{R}^d$, layer normalization computes: - the feature mean $\mu$ - the feature variance $\sigma^2$ - the normalized vector $(x - \mu) / \sqrt{\sigma^2 + \epsilon}$ - then applies learned gain and bias The canonical formula is: $ \operatorname{LN}(x) = \gamma \left(\frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}\right) + \beta $ where $\mu$ and $\sigma^2$ are computed across the feature dimension of that token.[^1] This is the original definition introduced by Ba, Kiros, and Hinton. In Transformer usage, the normalized axis is typically the hidden dimension, so each token is normalized independently from every other token. ### Why LN is natural for Transformers LayerNorm fits Transformers well for two simple reasons: - it is batch-size independent - it behaves identically at train and test time That makes it much easier to use in variable-length sequence settings and autoregressive decoding than BatchNorm, which depends on batch statistics and running averages.[^1] This is a good place to make the contrast explicit: > BatchNorm normalizes across examples. LayerNorm normalizes within each example. > [!question] Quick check > Why is BatchNorm awkward for sequence models with variable lengths and tiny decode-time batches? >> [!answer] it depends on batch statistics, while autoregressive decoding often has small or irregular batches and needs consistent behavior at train and test time. ## The geometry of LN ### Mean subtraction removes the all-ones direction The first geometric effect of LN is mean subtraction. When you subtract the feature mean, you remove the component along the all-ones direction. That means LN projects away a global offset shared across all features of a token. This is not merely a numerical trick. It changes what directions in representation space survive into the next layer. ### Variance scaling removes radial freedom The second effect is normalization by standard deviation. That rescales the vector to a controlled magnitude, reducing sensitivity to overall feature scale. So LN does two important things at once: - removes a mean direction - controls vector scale That is why LN is better understood as a geometric transformation of the [[Glossary#Residual stream|residual stream]], not merely as a training hack. ### Gain and bias restore flexibility A common worry is that normalization might destroy too much information. The learned affine parameters $\gamma$ and $\beta$ are the answer. They let the model reintroduce scale and shift per feature after normalization.[^1] This is one of the most important lines to say clearly: LN constrains representation geometry, but it does not freeze it. The model still learns how much each feature should ultimately matter. > [!example] Code for this section > - Notebook: [notebooks/layer_normalization/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/layer_normalization/lecture_walkthrough.ipynb) > [!tip] TensorTonic follow-up > - [TensorTonic: Transformers Layer Normalization](https://www.tensortonic.com/research/transformer/transformers-layer-normalization) > Use it here to practice the per-token normalization mechanics immediately after this section. ## Backward-pass intuition ### LN couples features Unlike an elementwise activation, LN has a dense Jacobian along the normalized dimension because every feature contributes to the token’s mean and variance. That means the gradient of one output coordinate depends on many input coordinates. This is why LN is not “featurewise independent” even though it is applied per token. ### The useful compact gradient story For lecture purposes, the most important points are: - gradients are corrected by subtracting their mean - gradients are corrected again by subtracting the component aligned with the normalized activations - the whole result is rescaled by the inverse standard deviation That reveals the same geometry as the forward pass: - remove a uniform component - remove an over-dominant radial component - stabilize the resulting scale You do not need to memorize every derivative term. You do need to understand that LN redistributes gradients across features in a structured way. > [!question] Quick check > Why is the compact vector-form gradient for LayerNorm usually the right first formula to study? >> [!answer] it shows the coupled gradient structure clearly without burying the main idea under a full Jacobian tensor. ## Why epsilon matters ### Epsilon is not decorative The $\epsilon$ term prevents division by zero when variance becomes very small. But it also does more than that: it bounds how strongly LN can amplify noise when the variance of a token collapses. This means epsilon is both: - a numerical safeguard - a gradient-scale safeguard ### Epsilon becomes more important in low precision In fp16, tiny eps values can become unreliable, and unstable variance estimates can cause overflow or collapse. This is one reason many production implementations: - accumulate statistics in fp32 - use carefully chosen $\epsilon$ defaults - rely on fused kernels rather than naïve custom code[^4] You should come away understanding that epsilon is a real hyperparameter, not a ceremonial constant. <video src="https://assets.montek.dev/lectures/media/llm/concepts/Layer%20Normalization/02_epsilon_and_precision.mp4" controls></video> ## LayerNorm placement in Transformers ### Post-LN in the original Transformer The original Transformer uses: $ \operatorname{LayerNorm}(x + \operatorname{Sublayer}(x)) $ This is the classic Post-LN layout.[^5] It is historically important because it is what the original architecture did, but it also became famous for being harder to train stably in very deep settings without learning-rate warmup. ### Pre-LN as the stability-favoring alternative Pre-LN changes the order: $ x + \operatorname{Sublayer}(\operatorname{LayerNorm}(x)) $ This creates a cleaner identity path through the [[Glossary#Residual connection|residual connection]], because the skip branch bypasses normalization entirely.[^2] That is the core intuition behind why Pre-LN often stabilizes training: - the residual stream has a direct gradient path - normalization still regularizes the sublayer input - deep stacks become easier to optimize ### Why warmup often matters for Post-LN Mean-field analyses of Transformers argue that Post-LN produces larger expected gradients near upper layers at initialization, which makes large early learning rates dangerous. That helps explain why Post-LN often needs warmup for stable training.[^2] This is one of the cleanest examples in the Transformer literature of architecture directly shaping optimizer behavior. > [!question] Quick check > Which layout gives a cleaner residual gradient highway, Pre-LN or Post-LN? >> [!answer] Pre-LN, because the skip path bypasses normalization and stays closer to an identity map. ### Pre-LN is not the final word Even though Pre-LN became the dominant practical choice for many large language models, normalization layout is still an active design space. Newer proposals like DeepNorm, LayerScale-style residual scaling, and Peri-LN-like variants show that people are still trying to recover the best of both stability and performance.[^6][^7][^8] It is worth stating this explicitly: Normalization placement is still a live architecture question, not a fully solved one. ## Alternatives to standard LayerNorm ### RMSNorm RMSNorm removes mean subtraction and normalizes only by the root mean square of the features.[^3] That means it keeps scale normalization while discarding explicit recentering. The empirical message from modern LLM practice is that this often works very well and can be simpler and faster. This is a good discussion point: If RMSNorm works well, perhaps scale control is doing more of the practical work than mean subtraction in many models. ### ScaleNorm ScaleNorm is another simplified alternative that normalizes by vector norm and uses a single learned scale parameter.[^9] It is less standard than RMSNorm in modern LLMs, but it is still useful here because it isolates the idea that scale control alone can stabilize training in some settings. ### LayerScale and residual scaling ideas LayerScale is not a normalization method in the same sense as LN or RMSNorm. Instead, it learns small residual-branch scaling factors to stabilize deep networks.[^7] This is useful because it widens the perspective: There are multiple ways to control activation and gradient scale propagation. Normalization is one family; residual scaling is another. ## Numerical stability and mixed precision ### Stable variance computation matters A naïve one-pass variance formula can suffer from catastrophic cancellation when values are large and variance is small, especially in low precision. A more stable computation uses either: - a two-pass mean-then-variance calculation - or a numerically stable online reduction strategy This matters much more than many people initially expect, because LN sits in every block and any small error can accumulate over depth. ### fp16 versus bf16 In practice: - fp16 has a narrower exponent range and overflows more easily - [[Glossary#BF16|bf16]] has a wider exponent range and is often more robust for reductions That is why production systems commonly compute LN statistics in fp32 even when the surrounding activations are bf16 or fp16.[^4] This is an excellent systems lesson: Numerically small operations are not necessarily numerically easy operations. ### Fused kernels LN is memory-bandwidth bound and reduction-heavy, so fused kernels matter. Apex, Megatron-Core, and Transformer Engine all provide fused LayerNorm or LayerNorm-plus-adjacent-op kernels to reduce memory traffic and kernel-launch overhead.[^10][^11] This is a good architecture-to-systems bridge: The math is simple. The optimal implementation is not. <video src="https://assets.montek.dev/lectures/media/llm/concepts/Layer%20Normalization/03_preln_vs_postln.mp4" controls></video> ## Practical debugging patterns ### Common LN bugs The most common failure modes are: - normalizing over the wrong axis - doing LN entirely in fp16 without stable accumulation - choosing an epsilon that is too small for the precision regime - blaming LN for what is actually an attention-mask bug You should know that LN itself does not use attention masks. It normalizes each token independently. If padding tokens are corrupting other tokens, the likely problem is the attention mask, not LN. ### What to inspect first If training looks unstable, check: - hidden-state norms across depth - per-layer gradient norms - activation histograms before and after LN - whether train/eval behavior differs unexpectedly - dtype and epsilon settings in fused or exported kernels This is the fastest way to distinguish architecture problems from pure implementation bugs. > [!example] Notebook walkthroughs in this lecture > > If you want to study this note in code, use these notebook sections. If the viewer ignores the fragment, search for the exact heading text in the notebook: > > - [`Manual LayerNorm matches PyTorch LayerNorm`](https://github.com/Montekkundan/llm/blob/main/notebooks/layer_normalization/lecture_walkthrough.ipynb#manual-layernorm-matches-pytorch-layernorm) > - [`LayerNorm zero-centers and rescales each token`](https://github.com/Montekkundan/llm/blob/main/notebooks/layer_normalization/lecture_walkthrough.ipynb#layernorm-zero-centers-and-rescales-each-token) > - [`Epsilon on nearly constant inputs`](https://github.com/Montekkundan/llm/blob/main/notebooks/layer_normalization/lecture_walkthrough.ipynb#epsilon-on-nearly-constant-inputs) > - [`Pre-LN and Post-LN residual paths`](https://github.com/Montekkundan/llm/blob/main/notebooks/layer_normalization/lecture_walkthrough.ipynb#pre-ln-and-post-ln-residual-paths) > - [`RMSNorm without mean subtraction`](https://github.com/Montekkundan/llm/blob/main/notebooks/layer_normalization/lecture_walkthrough.ipynb#rmsnorm-without-mean-subtraction) > [!tip] TensorTonic practice for this lecture > > If you want to practice this lecture in a more implementation-focused format, work through these TensorTonic exercises: > > - [TensorTonic: Transformers Layer Normalization](https://www.tensortonic.com/research/transformer/transformers-layer-normalization) > - [TensorTonic: GPT-2 LayerNorm](https://www.tensortonic.com/research/gpt2/gpt2-layernorm) > > They are good follow-ups because they make the normalization step concrete in both the textbook and decoder-only settings: > > - computing per-token mean and variance > - applying $\gamma$ and $\beta$ correctly > - seeing where epsilon stabilizes the denominator > - comparing generic LayerNorm with the way GPT-style blocks use it in practice > > A useful study order is: > > 1. compute one worked LayerNorm example by hand > 2. inspect epsilon and nearly constant inputs > 3. compare Pre-LN and Post-LN residual paths > 4. then compare LayerNorm and RMSNorm as design choices > > <video src="https://assets.montek.dev/lectures/media/llm/concepts/Layer%20Normalization/04_norm_variants_and_kernels.mp4" controls></video> <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Feed-Forward Network" href="Feed-Forward%20Network">Feed-Forward Network</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Encoder Block" href="Encoder%20Block">Encoder Block</a></div> </div> </div> ### References [^1]: Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton, "Layer Normalization," 2016. https://www.cs.utoronto.ca/~hinton/absps/LayerNormalization.pdf [^2]: Ruibin Xiong et al., "On Layer Normalization in the Transformer Architecture," 2020. https://arxiv.org/abs/2002.04745 [^3]: Toan Q. Nguyen and Julian Salazar, "Transformers without Tears: Improving the Normalization of Self-Attention," 2019. https://papers.nips.cc/paper_files/paper/2019/file/1e8a19426224ca89e83cef47f1e7f53b-Paper.pdf [^4]: PyTorch, "torch.nn.LayerNorm," 2025. https://docs.pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html [^5]: Ashish Vaswani et al., "Attention Is All You Need," 2017. https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf [^6]: Hongyu Wang, Shuming Ma, Li Dong, et al., "DeepNet: Scaling Transformers to 1,000 Layers," 2022. https://arxiv.org/abs/2203.00555 [^7]: Hugo Touvron et al., "Going Deeper with Image Transformers," 2021. https://openaccess.thecvf.com/content/ICCV2021/papers/Touvron_Going_Deeper_With_Image_Transformers_ICCV_2021_paper.pdf [^8]: Jeonghoon Kim, Byeongchan Lee, and Cheonbok Park, "Peri-LN: Revisiting Normalization Layer in the Transformer Architecture," 2025. https://arxiv.org/abs/2502.02732 [^9]: Biao Zhang and Rico Sennrich, "Root Mean Square Layer Normalization," 2019. https://arxiv.org/abs/1910.05895 [^10]: NVIDIA, "Apex LayerNorm," 2025. https://nvidia.github.io/apex/layernorm.html [^11]: NVIDIA, "Transformer Engine user guide," 2025. https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/getting_started/index.html