Decoder Block - lectures

> [!info] Course code > Use the companion repository for runnable notebooks, figures, and implementation references for this lecture: > - [notebooks/decoder_block/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/decoder_block/lecture_walkthrough.ipynb) ## What This Concept Is A decoder block is what turns the general Transformer story into a text generator. It still has attention, feed-forward computation, residual paths, and normalization, but now the attention is constrained so each position can only look backward. That one restriction is what makes next-token generation possible. If you want the shortest intuition, it is this: the decoder block is the repeated cell that lets the model read the past without peeking at the future. ## Foundation Terms You Need First The most important extra object in this note is the **[[Glossary#Causal mask|causal mask]]**. Self-attention is still present, but it is no longer fully bidirectional. The block also still contains a **feed-forward network**, **residual connections**, and **normalization**, just like the encoder story. So the real shift from encoder block to decoder block is not that everything changes. Most of the machinery stays familiar. What changes is the visibility rule inside attention, and that rule changes the whole behavior of the model. <video src="https://assets.montek.dev/lectures/media/llm/concepts/Decoder%20Block/01_decoder_block_flow.mp4" controls></video> ## How this note fits the course-to-picoLLM map This is one of the most important bridge lectures in the entire course. - the notebook shows the canonical decoder block - `picollm/accelerated/gpt.py` shows the serious decoder-only runtime - `picollm/accelerated/pretrain/train.py` shows the shape knobs that actually instantiate the stack You should also keep the reference split straight: - `rasbt` is the clean concept-first external reference - `picollm` is the course’s real implementation path - `nanochat` is the systems-oriented external comparison ## What the decoder block must do A decoder-only block has four jobs: - enforce causality: each token can attend to previous tokens but not future ones - increase expressivity: transform representations with a learned nonlinear map - preserve information flow: residual connections provide an identity path - stabilize optimization: normalization keeps training well-conditioned That list matters because it tells you something structural: the decoder block is not one idea. It is a negotiated settlement between information flow, expressivity, and optimization stability. [^3] ## Decoder block vs Transformer decoder The original Transformer paper uses an encoder-decoder architecture. In that setting, the *decoder* contains: 1. masked [[Glossary#Self-attention|self-attention]] 2. encoder-decoder attention (cross-attention) 3. a [[Glossary#Feed-forward network (FFN)|feed-forward network]] A GPT-style model is decoder-only. There is no encoder sequence, so there is no cross-attention. The GPT block is: - masked self-attention - feed-forward network Everything else (residuals, normalization, dropout variants) is wiring and optimization infrastructure. [^4] ## The canonical pre-norm structure In a pre-norm GPT-style block, the forward pass is conceptually: 1. normalize the input 2. run masked multi-head self-attention 3. add a [[Glossary#Residual connection|residual connection]] 4. normalize again 5. run the feed-forward network 6. add another residual connection Abstractly: $ h_1 = h + \operatorname{MHA}(\operatorname{LN}(h)) $ $ h_2 = h_1 + \operatorname{FFN}(\operatorname{LN}(h_1)) $ This choice (LayerNorm inside the residual branch) is not cosmetic. Pre-norm placement is strongly associated with more stable gradients in deep Transformers and can reduce reliance on careful [[Glossary#Warmup|warmup]] schedules. [^3] [^5] ## How picoLLM implements the serious version The accelerated stack keeps the same residual logic, but you should know that it is not a vanilla GPT-2 block. In `picollm/accelerated/gpt.py`, you will see: - RMSNorm instead of LayerNorm - RoPE instead of absolute position embeddings - grouped-query attention through `n_head` and `n_kv_head` - optional sliding-window attention through `window_pattern` - a ReLU-squared MLP - extra value embeddings in alternating layers That is an important transition in the course. First you learn the canonical decoder block. Then you inspect how a serious repo specializes that template to chase better optimization and faster inference. > [!example] Notebook follow-up > - [`Decoder block preserves shape`](https://github.com/Montekkundan/llm/blob/main/notebooks/decoder_block/lecture_walkthrough.ipynb#decoder-block-preserves-shape) > - [`Compare with unmasked encoder-style attention`](https://github.com/Montekkundan/llm/blob/main/notebooks/decoder_block/lecture_walkthrough.ipynb#compare-with-unmasked-encoder-style-attention) > Use these notebook sections here to compare the canonical masked block against the unmasked encoder-style pattern. > [!tip] TensorTonic follow-up > - [TensorTonic: GPT-2 Decoder Block](https://www.tensortonic.com/research/gpt2/gpt2-decoder-block) > Use it here to assemble the whole GPT-style block after the canonical structure is clear. ## Why residual connections matter Without residual connections, deep stacks become brittle: each layer must reconstruct a useful representation from scratch. Residual paths change the semantics of a layer from "replace the state" to "refine the state." This is the same optimization intuition that made very deep residual networks trainable in other domains: learn a correction on top of an identity map. [^6] > [!tip] TensorTonic follow-up > - [TensorTonic: GPT-2 Init Scaling](https://www.tensortonic.com/research/gpt2/gpt2-init-scaling) > Use it here to connect the residual-depth story to GPT-style initialization and scaling choices. > [!question] Quick check > Which story better explains why deeper residual stacks can work? 1. each layer forgets the previous one 2. each layer preserves and improves it >> [!answer] 2. Residual connections keep an identity path, so each layer can refine a useful state instead of rebuilding it from scratch. ## Why masking belongs inside the block A decoder block is what makes autoregressive language modeling possible because its attention is *causally masked*. Each position can only attend to itself and earlier positions. That is the structural enforcement of the factorization: $ p(x) = \prod_t p(x_t \mid x_{<t}) $ ### The actual mask you apply In scaled dot-product attention, you typically implement causal masking by adding a matrix $M$ to the attention [[Glossary#Logits|logits]] before [[Glossary#Softmax|softmax]]: $ A = \operatorname{softmax}\left(\frac{QK^\top}{\sqrt{d_k}} + M\right) $ where: - $M[i, j] = 0$ if $j \le i$ - $M[i, j] = -\infty$ if $j > i$ This produces a strict lower-triangular attention pattern. > [!example] Notebook follow-up > - [`Causal mask blocks future positions`](https://github.com/Montekkundan/llm/blob/main/notebooks/decoder_block/lecture_walkthrough.ipynb#causal-mask-blocks-future-positions) > Use this notebook section here to see the lower-triangular mask as an explicit attention constraint. > [!tip] TensorTonic follow-up > - [TensorTonic: GPT-2 Causal Attention](https://www.tensortonic.com/research/gpt2/gpt2-causal-attention) > Use it here to practice the masked-attention step directly. <video src="https://assets.montek.dev/lectures/media/llm/concepts/Decoder%20Block/02_causal_mask_inside_block.mp4" controls></video> <video src="https://assets.montek.dev/lectures/media/llm/concepts/Decoder%20Block/03_training_vs_inference.mp4" controls></video> ## Training-time view vs inference-time view It is easy to conflate "masked" with "sequential." The important separation is: - training can compute attention in parallel over the whole sequence while still enforcing causality via masking - inference must generate tokens sequentially by construction ### KV cache: why decoding is a systems problem In inference, the decoder repeatedly reuses past keys and values. Implementations therefore cache K and V tensors per layer (often called a [[Glossary#KV cache|KV cache]] or `past_key_values`) to avoid recomputing attention over the entire prefix at every step. [^7] This is also why production code is full of attention variants that target KV cache bandwidth: - Multi-Query Attention (MQA): share keys/values across heads to make decoding cheaper [^8] - Grouped-Query Attention (GQA): share keys/values across groups of heads to recover quality while staying fast [^9] ## How depth changes behavior Depth is both a capability knob and a systems knob: - deeper stacks can represent more complex, compositional transformations (more repeated mixing + nonlinearity) - deeper stacks also increase [[Glossary#Latency|latency]], memory, and optimization difficulty There is also a useful nuance here: empirical scaling studies find that, within a wide range, overall scale (parameters, data, compute) dominates many "shape" details, even though depth still changes runtime and implementation tradeoffs. [^10] ## Relationship to the production app In the canonical view, the decoder block is explicit and simple. In the production path, it may include extra choices: - RMSNorm instead of LayerNorm [^11] - RoPE instead of absolute positional encodings [^12] - MQA or GQA to reduce KV cache cost [^8] [^9] - fused attention kernels (FlashAttention-family) for IO efficiency [^13] - FP8-friendly linear layers and dtype-aware projections in the accelerated stack when the hardware supports them This note names those variants without letting them obscure the core mechanism. ## Key takeaway The decoder block is the place where the Transformer stops being a bag of components and becomes an actual language-modeling machine. --- [^1]: Alec Radford et al., "Language Models are Unsupervised Multitask Learners," 2019. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf [^2]: Alec Radford et al., "Improving Language Understanding by Generative Pre-Training," 2018. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf [^3]: Ruibin Xiong et al., "On Layer Normalization in the Transformer Architecture," 2020. https://arxiv.org/abs/2002.04745 [^4]: Ashish Vaswani et al., "Attention Is All You Need," 2017. https://arxiv.org/abs/1706.03762 [^5]: Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton, "Layer Normalization," 2016. https://arxiv.org/abs/1607.06450 [^6]: Kaiming He et al., "Deep Residual Learning for Image Recognition," 2015. https://arxiv.org/abs/1512.03385 [^7]: Hugging Face documentation on KV cache and `past_key_values`. https://huggingface.co/docs/transformers/main/cache_explanation [^8]: Noam Shazeer, "Fast Transformer Decoding: One Write-Head is All You Need," 2019. https://arxiv.org/abs/1911.02150 [^9]: Joshua Ainslie et al., "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints," 2023. https://arxiv.org/abs/2305.13245 [^10]: Jared Kaplan et al., "Scaling Laws for Neural Language Models," 2020. https://arxiv.org/abs/2001.08361 [^11]: Biao Zhang and Rico Sennrich, "Root Mean Square Layer Normalization," 2019. https://arxiv.org/abs/1910.07467 [^12]: Jianlin Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding," 2021. https://arxiv.org/abs/2104.09864 [^13]: Tri Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness," 2022. https://arxiv.org/abs/2205.14135 > [!example] Notebook walkthroughs in this lecture > > Use this order: > > 1. [notebooks/decoder_block/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/decoder_block/lecture_walkthrough.ipynb) > Use these sections as you read: > - [`Decoder block preserves shape`](https://github.com/Montekkundan/llm/blob/main/notebooks/decoder_block/lecture_walkthrough.ipynb#decoder-block-preserves-shape) > - [`Causal mask blocks future positions`](https://github.com/Montekkundan/llm/blob/main/notebooks/decoder_block/lecture_walkthrough.ipynb#causal-mask-blocks-future-positions) > - [`Compare with unmasked encoder-style attention`](https://github.com/Montekkundan/llm/blob/main/notebooks/decoder_block/lecture_walkthrough.ipynb#compare-with-unmasked-encoder-style-attention) > - [`Decoder blocks drive autoregressive GPT models`](https://github.com/Montekkundan/llm/blob/main/notebooks/decoder_block/lecture_walkthrough.ipynb#decoder-blocks-drive-autoregressive-gpt-models) > 2. [picollm/accelerated/gpt.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/gpt.py) > Use it to compare the canonical block against the accelerated production block. > > <video src="https://assets.montek.dev/lectures/media/llm/concepts/Decoder%20Block/04_canonical_vs_serious_block.mp4" controls></video> > [!tip] TensorTonic practice for this lecture > > If you want to practice this lecture in a more implementation-focused format, work through these TensorTonic exercises: > > - [TensorTonic: GPT-2 Causal Attention](https://www.tensortonic.com/research/gpt2/gpt2-causal-attention) > - [TensorTonic: GPT-2 Decoder Block](https://www.tensortonic.com/research/gpt2/gpt2-decoder-block) > - [TensorTonic: GPT-2 Init Scaling](https://www.tensortonic.com/research/gpt2/gpt2-init-scaling) > > They are good follow-ups because they isolate the three parts that make a GPT-style block behave the way it does: > > - masking future positions inside self-attention > - composing the full residual block from its sublayers > - seeing why initialization and residual scaling matter for stable deep decoding stacks <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Encoder Block" href="Encoder%20Block">Encoder Block</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Causal Language Modeling" href="Causal%20Language%20Modeling">Causal Language Modeling</a></div> </div> </div>