Encoder Block - lectures

> [!info] Course code > Use the companion repository for runnable notebooks, figures, and implementation references for this lecture: > - Theory notebook: [notebooks/encoder_block/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/encoder_block/lecture_walkthrough.ipynb) > - Serious model contrast: [picollm/accelerated/gpt.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/gpt.py) ## What This Concept Is If you want one clean unit that shows how a Transformer thinks, the encoder block is a strong place to start. First tokens exchange information through self-attention. Then each token goes through its own feed-forward computation. Residual paths and normalization hold the whole thing together so the block can be stacked many times. Even though `picollm` is decoder-only, the encoder block is still one of the clearest ways to understand the communication-plus-computation pattern that the whole architecture is built from. ## Foundation Terms You Need First Keep the block in two halves. The first half is **self-attention**, where tokens look at one another. The second half is the **feed-forward network**, where each token is transformed independently. Around both halves sit **residual connections** and **layer normalization**, which help preserve shape and stabilize optimization. So when you read "encoder block," do not think of one mysterious monolith. Think of a small repeated circuit with four familiar parts: attention, feed-forward computation, residual paths, and normalization. <video src="https://assets.montek.dev/lectures/media/llm/concepts/Encoder%20Block/01_encoder_block_dataflow.mp4" controls></video> ## Why this note still matters even though picoLLM is decoder-only A common question is why the course still covers encoder blocks carefully if `picollm` is decoder-only. The answer is that encoder blocks are still the cleanest way to understand: - the basic communication-plus-computation structure - the role of bidirectional self-attention - the difference between contextualization and generation Then `picollm/accelerated/gpt.py` shows the decoder-only specialization of that general Transformer block story. That keeps the map clear: - this note explains the generic block family - the decoder note explains the GPT-style specialization - picoLLM is where the serious decoder-only implementation lives This note covers three connected ideas: - what an encoder block is as a composed object: self-attention, FFN, residual connections, normalization, masks, and positional injection working together rather than as isolated parts[^1] - the canonical forward pass, the Post-LN versus Pre-LN comparison, and why the encoder differs structurally from a [[Glossary#Decoder block|decoder block]][^2] - the optimization and systems issues around encoder blocks: gradient highways, parameter and [[Glossary#FLOP / FLOPS|FLOP]] accounting, activation memory, fused kernels, and common modern encoder-block variants[^3] ## What the encoder block does ### Communication plus computation The cleanest way to read the encoder block is as a two-phase loop: - **communication**: self-attention mixes information across tokens - **computation**: the FFN applies a nonlinear map to each token independently This is the minimal conceptual decomposition of the block. Attention is where tokens influence one another. The FFN is where each token privately computes on the contextualized information it now carries. Residuals keep information flowing, and normalization keeps the [[Glossary#Residual stream|residual stream]] numerically trainable. That gives a strong lecture line: > Tokens talk in attention, then think in the FFN. ### Why the encoder is not the decoder It is easy to blur encoder blocks and decoder blocks together because both contain self-attention and FFNs. The structural difference is important: - encoder self-attention is bidirectional - decoder self-attention is causally masked - decoder blocks usually include an additional cross-attention sublayer So the encoder is the contextualizer, not the autoregressive generator. It reads the whole input sequence at once and produces contextual representations for every token.[^1] ## Why positional information is necessary inside the encoder ### Self-attention alone is permutation equivariant Without positional information, self-attention treats the sequence as a set of token vectors. If you permute the inputs, the outputs permute in the same way. That means a stack of encoder blocks without position injection cannot represent order-sensitive functions.[^4][^5] This is one of the most important high-level truths about the encoder: An encoder block without position is not really a sequence model in the ordinary sense. ### Position can enter at multiple points There are two standard ways to break this symmetry: - add absolute positional information to the input representations - modify attention itself with relative or bias-based positional mechanisms That is why the encoder block is best understood as: `positional injection + token mixing + channel mixing + residual stabilization` rather than as “attention plus MLP.” ## The canonical forward pass ### Original Post-LN encoder block In the original Transformer, one encoder layer applies: 1. multi-head self-attention 2. residual addition plus LayerNorm 3. position-wise FFN 4. residual addition plus LayerNorm In compact form: $ A = \operatorname{LayerNorm}(x + \operatorname{MHA}(x)) $ $ y = \operatorname{LayerNorm}(A + \operatorname{FFN}(A)) $ This is the original “Add & Norm” formulation.[^1] ### Modern Pre-LN encoder block Many modern models instead use the Pre-LN formulation: $ A = x + \operatorname{MHA}(\operatorname{LayerNorm}(x)) $ $ y = A + \operatorname{FFN}(\operatorname{LayerNorm}(A)) $ The block still does the same high-level operations, but the normalization placement changes the optimization dynamics substantially.[^2] ### Why the shape is preserved Every encoder block returns a tensor with the same shape as its input. This is what makes stacking natural: - input: $[B, L, d_{\text{model}}]$ - output: $[B, L, d_{\text{model}}]$ Self-attention mixes tokens but returns the residual width $d_{\text{model}}$. The FFN expands internally to $d_{\text{ff}}$, but projects back down to $d_{\text{model}}$. The [[Glossary#Residual connection|residual connection]] therefore remains type-compatible at every block. This uniform interface is one of the key reasons Transformer stacks are easy to scale in depth. ## Shapes and implementation structure ### Attention path For self-attention inside the encoder: - input $x$ is projected into $Q$, $K$, and $V$ - these are split into $h$ heads - attention is computed within each head - head outputs are concatenated and projected back Modern code usually implements the per-head projections through one fused QKV projection followed by reshape and split. ### FFN path The FFN then applies: $ d_{\text{model}} \to d_{\text{ff}} \to d_{\text{model}} $ independently at every position. This means the encoder block alternates: - token mixing across the sequence - channel mixing within each token That alternation is the core architectural rhythm of the encoder stack. ## Residuals and gradient flow ### Why residual connections matter so much For a residual map $y = x + F(x)$, the Jacobian is: $ \frac{dy}{dx} = I + \frac{dF}{dx} $ This identity term is the mathematical reason gradients can still flow even if the sublayer Jacobian is poorly conditioned.[^6] That is not an implementation convenience. It is the central mechanism that makes deep stacks of encoder blocks trainable. ### Residuals interact with normalization placement This is where Pre-LN and Post-LN differ sharply. In Pre-LN, the residual branch remains a cleaner identity path because normalization is inside the sublayer branch. In Post-LN, normalization sits after the addition, so the residual signal is filtered by LayerNorm at every block boundary.[^2] That is the shortest rigorous explanation of why LN placement changes gradient behavior with depth. <video src="https://assets.montek.dev/lectures/media/llm/concepts/Encoder%20Block/02_positional_injection_and_symmetry.mp4" controls></video> ## Post-LN versus Pre-LN ### Why Post-LN was historically natural The original Transformer used Post-LN and worked well in its regime.[^1] It is conceptually simple: - compute a sublayer output - add the residual - normalize the result But as Transformer depth and scale grew, this layout became associated with stronger [[Glossary#Warmup|warmup]] dependence and more fragile optimization. ### Why Pre-LN became common Mean-field analysis shows that Pre-LN gives more stable gradients at initialization, while Post-LN tends to produce large expected gradients near upper layers, making early large learning rates unstable.[^2] This is a very important architecture-to-optimizer link: Warmup is not just optimizer folklore. It is partly a consequence of block design. ### Pre-LN is still not the end of the story Modern work keeps exploring normalization layout and residual scaling: - Pre-LN for stable training - RMSNorm-based stacks for simpler normalization - DeepNorm for extreme depth - LayerScale and similar residual-branch scaling ideas - Peri-LN-like alternatives in newer large-scale studies[^7][^8][^9][^10] It is worth stating this explicitly because it prevents “Pre-LN solved it” from sounding like settled doctrine. ## Parameter count, FLOPs, and bottlenecks ### Parameter distribution in a dense encoder block Ignoring biases, a dense encoder block typically contains: - attention projections of order $4 d_{\text{model}}^2$ - FFN parameters of order $2 d_{\text{model}} d_{\text{ff}}$ - normalization parameters that are tiny by comparison With the common choice $d_{\text{ff}} = 4 d_{\text{model}}$, the FFN often dominates parameter count.[^11] This is one of the highest-value practical facts in the whole note: In many encoder blocks, attention dominates token mixing, but the FFN dominates parameter budget. ### Compute depends on sequence length For sequence length `L`: - dense attention scales as $O(L^2 d_{\text{model}})$ - the FFN scales as $O(L d_{\text{model}} d_{\text{ff}})$ So: - at moderate sequence lengths, the FFN can dominate compute - at long sequence lengths, attention’s quadratic term becomes the major issue That is why “what is the bottleneck?” depends strongly on context length. ### Activation memory Encoder blocks are also a memory story: - attention wants to materialize or effectively manage $L \times L$ score/probability structures - FFNs create wide $d_{\text{ff}}$ activations that are expensive to store for backward This is why memory-optimized encoder designs often attack both sides: - fused or tiled attention - checkpointing - chunked FFNs - reversible residual formulations[^3][^12][^13] ## Masks inside the encoder ### No causal mask by default The encoder usually does not use a [[Glossary#Causal mask|causal mask]]. Every token may attend to every non-padding token. This is a simple but crucial difference from the decoder. ### Padding masks still matter The encoder does need padding masks so padding tokens do not contribute misleading attention mass. In practice, this is one of the most common implementation failure points. You should know two things: - padding mask semantics differ across APIs - query-side padding behavior and key-side masking are not always handled identically by frameworks That is why mask bugs often look like “the encoder trains but underperforms mysteriously.” > [!question] Quick check > What bug should you suspect if an encoder seems to train but quietly underperform, especially on padded batches? >> [!answer] a missing or incorrect padding mask. The model may be attending to padding positions and learning distorted context patterns. ## Encoder-block variants ### Attention variants in the encoder slot The encoder block can keep the same overall skeleton while swapping the attention mechanism: - relative position attention - sparse or local attention - long-context recurrence or segment reuse - exact fused attention kernels This is a good way to present variants without making the architecture feel chaotic. The outer block stays recognizable even when the inner attention changes. ### FFN variants in the encoder slot The FFN slot is equally rich: - dense ReLU or GELU FFN - gated FFNs like GEGLU or SwiGLU - MoE FFNs for sparse conditional computation This means the encoder block is really a template rather than one rigid circuit. ### Memory-optimized encoder blocks Reformer-style encoder variants illustrate how aggressively the block can be re-engineered while keeping the same broad communication-plus-computation pattern: - reversible residuals - chunked FFNs - efficient attention patterns[^12] That is one of the best examples of “same high-level block, different systems realization.” <video src="https://assets.montek.dev/lectures/media/llm/concepts/Encoder%20Block/03_preln_vs_postln_encoder.mp4" controls></video> ## Practical systems view ### Exact attention can still be engineered better FlashAttention is especially important for encoder blocks because it shows that exact self-attention can be computed with much better IO behavior by avoiding materialization of large intermediates in high-bandwidth memory.[^3] This is a major lesson for you: Sometimes the right optimization is not to change the math, but to change the scheduling. ### Fused norms and MLP kernels Encoder blocks also benefit from fused LayerNorm-plus-linear and LayerNorm-plus-MLP kernels in high-performance training stacks. Transformer Engine and related libraries expose these fused patterns directly.[^15] This matters because the encoder block is executed repeatedly across depth. Even small per-block savings compound dramatically. ### Mixed precision In mixed precision, encoder blocks are sensitive in several places: - [[Glossary#Softmax|softmax]] in attention - variance computation in normalization - wide FFN intermediates - MoE routers if present That is why serious implementations are usually selective about where fp32 accumulation is still used. > [!example] Notebook walkthroughs in this lecture > > If you want to study this note in code, use these notebook sections. If the viewer ignores the fragment, search for the exact heading text in the notebook: > > - [`Encoder block preserves shape`](https://github.com/Montekkundan/llm/blob/main/notebooks/encoder_block/lecture_walkthrough.ipynb#encoder-block-preserves-shape) > - [`Encoder attention is bidirectional`](https://github.com/Montekkundan/llm/blob/main/notebooks/encoder_block/lecture_walkthrough.ipynb#encoder-attention-is-bidirectional) > - [`Padding masks remove invalid tokens`](https://github.com/Montekkundan/llm/blob/main/notebooks/encoder_block/lecture_walkthrough.ipynb#padding-masks-remove-invalid-tokens) > - [`Post-LN and Pre-LN block layouts`](https://github.com/Montekkundan/llm/blob/main/notebooks/encoder_block/lecture_walkthrough.ipynb#post-ln-and-pre-ln-block-layouts) > > A useful study order is: > > 1. confirm the shape-preserving block contract > 2. compare bidirectional attention with the decoder-style causal restriction > 3. inspect padding-mask behavior directly > 4. then compare Post-LN and Pre-LN block layouts > > <video src="https://assets.montek.dev/lectures/media/llm/concepts/Encoder%20Block/04_encoder_cost_masks_and_variants.mp4" controls></video> > [!tip] TensorTonic practice for this lecture > > If you want to practice this lecture in a more implementation-focused format, work through these TensorTonic exercises: > > - [TensorTonic: Transformers Encoder Block](https://www.tensortonic.com/research/transformer/transformers-encoder-block) > - [TensorTonic: BERT Pooler](https://www.tensortonic.com/research/bert/bert-pooler) > - [TensorTonic: BERT NSP](https://www.tensortonic.com/research/bert/bert-nsp) > > They are good follow-ups because they connect the encoder stack to common BERT-style downstream structure: > > - composing attention, residual, normalization, and FFN into one reusable block > - seeing how the pooled representation is formed on top of encoder outputs > - understanding how next-sentence-style classification used that pooled signal in the original BERT setup <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Layer Normalization" href="Layer%20Normalization">Layer Normalization</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Decoder Block" href="Decoder%20Block">Decoder Block</a></div> </div> </div> ### References [^1]: Ashish Vaswani et al., "Attention Is All You Need," 2017. https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf [^2]: Ruibin Xiong et al., "On Layer Normalization in the Transformer Architecture," 2020. https://proceedings.mlr.press/v119/xiong20b/xiong20b.pdf [^3]: Tri Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness," 2022. https://openreview.net/forum?id=H4DqfPSibmx [^4]: Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani, "Self-Attention with Relative Position Representations," 2018. https://aclanthology.org/N18-2074.pdf [^5]: Chulhee Yun et al., "Are Transformers universal approximators of sequence-to-sequence functions?," 2019. https://arxiv.org/abs/1912.10077 [^6]: Kaiming He et al., "Deep Residual Learning for Image Recognition," 2015. https://arxiv.org/abs/1512.03385 [^7]: Hongyu Wang, Shuming Ma, Li Dong, et al., "DeepNet: Scaling Transformers to 1,000 Layers," 2022. https://arxiv.org/abs/2203.00555 [^8]: Toan Q. Nguyen and Julian Salazar, "Transformers without Tears: Improving the Normalization of Self-Attention," 2019. https://papers.nips.cc/paper_files/paper/2019/file/1e8a19426224ca89e83cef47f1e7f53b-Paper.pdf [^9]: Hugo Touvron et al., "Going Deeper with Image Transformers," 2021. https://openaccess.thecvf.com/content/ICCV2021/papers/Touvron_Going_Deeper_With_Image_Transformers_ICCV_2021_paper.pdf [^10]: Jeonghoon Kim, Byeongchan Lee, and Cheonbok Park, "Peri-LN: Revisiting Normalization Layer in the Transformer Architecture," 2025. https://arxiv.org/abs/2502.02732 [^11]: Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy, "Transformer Feed-Forward Layers Are Key-Value Memories," 2021. https://aclanthology.org/2021.emnlp-main.446.pdf [^12]: Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya, "Reformer: The Efficient Transformer," 2020. https://openreview.net/forum?id=rkgNKkHtvB [^13]: Aidan N. Gomez et al., "The Reversible Residual Network: Backpropagation Without Storing Activations," 2017. https://arxiv.org/abs/1707.04585 [^15]: NVIDIA, "Transformer Engine user guide," 2025. https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/getting_started/index.html