Scaled Dot-Product Attention

> [!info] Course code > Use the companion repository for runnable notebooks, figures, and implementation references for this lecture: > - Theory notebook: [notebooks/scaled_dot_product_attention/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/scaled_dot_product_attention/lecture_walkthrough.ipynb) > - Serious model anchor: [picollm/accelerated/gpt.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/gpt.py) > - Kernel/runtime anchor: [picollm/accelerated/flash_attention.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/flash_attention.py) ## What This Concept Is Suppose a model is reading the word `bank` in a sentence. To understand it, the model may need to look back at nearby words and decide whether this means a river bank or a financial bank. Scaled dot-product attention is the operator that scores those possible connections and mixes the useful information back into the current position. So instead of thinking of attention as magic, think of it as a structured way for one token to ask, "which other tokens matter for me right now, and how much?" ## Foundation Terms You Need First Use the question-answer analogy carefully. A **query** is what the current position is asking for. A **key** is how another position advertises what it can offer. A **value** is the information that actually gets returned if that position is attended to. The model compares queries against keys, turns the scores into weights with **[[Glossary#Softmax|softmax]]**, and uses those weights to mix the values. That means attention has two stages you should not blur together: first scoring, then mixing. The score says how relevant another position looks. The value is what actually flows into the output. ```mermaid flowchart TD A["Queries Q"] --> D["QK^T / sqrt(d_k)"] B["Keys K"] --> D D --> E["Apply causal or padding mask"] E --> F["Softmax attention weights"] F --> G["Weighted sum with V"] C["Values V"] --> G G --> H["Context vectors"] ``` ## How the clean formula becomes a systems operator In the notebook, we teach attention in its clean mathematical form: $\operatorname{softmax}(QK^\top / \sqrt{d_k})V$ That is exactly the right place to start. But by the time you reach `picollm/accelerated/gpt.py`, you should see how the same operator gets wrapped in stricter systems constraints: - queries, keys, and values are projected separately - RoPE is applied to `q` and `k` - QK normalization is applied for scale hygiene - grouped-query attention reduces KV-cache cost - `picollm/accelerated/flash_attention.py` chooses FlashAttention 3 on supported Hopper GPUs and falls back to PyTorch SDPA otherwise The key point is: > production attention is still the same mathematics, but it lives inside a much more demanding memory-and-[[Glossary#Throughput|throughput]] environment That is why you should first master the equation, then inspect the kernel-aware version in `picollm`. ## How this lecture fits the course-to-picoLLM map This lecture is a good place to reinforce the course reading rule: - learn the clean operator in the notebook first - inspect the accelerated implementation in `picollm/accelerated/gpt.py` - inspect the kernel and hardware path in `picollm/accelerated/flash_attention.py` - use `rasbt/LLMs-from-scratch` as the concept-first comparison - use `nanochat` as the systems-first comparison That prevents you from confusing the mathematical definition of attention with the optimized code that runs during the accelerated build. <video src="https://assets.montek.dev/lectures/media/llm/concepts/Scaled%20Dot-Product%20Attention/01_attention_as_soft_retrieval.mp4" controls></video> This note covers three connected ideas: - the scaled dot-product attention derivation: the score matrix, the softmax normalization, the output aggregation, and why the $1 / \sqrt{d_k}$ factor matters for stable training[^1] - deeper interpretations of attention: differentiable retrieval, normalized exponential kernel smoothing, and the design space of scoring functions[^3] - the engineering reality around the clean formula: softmax saturation, mixed-precision stability, masking pitfalls, quadratic memory cost, FlashAttention, and the main approximation families for long sequences[^4] ## Why scaled dot-product attention exists ### Attention as differentiable retrieval The cleanest way to introduce attention is not as “magic weighting,” but as differentiable retrieval. - queries ask - keys advertise - values contain what gets returned Vaswani et al. define attention as a mapping from a query and a set of key-value pairs to an output, where the output is a weighted sum of values and the weights come from a compatibility function between the query and each key.[^1] This formulation matters because it shows that scaled dot-product attention is not “attention in general.” It is one specific choice inside a broader family of alignment-and-aggregation mechanisms. ### Dot-product attention is the hardware-friendly choice Before Transformers, additive attention was the dominant alignment model in neural machine translation. Bahdanau attention computes a score with a small neural network, then applies softmax over those scores.[^6] Luong et al. compare multiple score functions, including dot, general bilinear, and additive/concat scoring.[^7] The Transformer’s dot-product choice is important not because it is the only reasonable similarity function, but because it maps cleanly to dense matrix multiplication. That makes it fast on modern accelerators, where GEMM kernels are extremely optimized.[^1] ## The core formulation to internalize ### Matrix form Let: - $Q \in \mathbb{R}^{L_q \times d_k}$ be the query matrix - $K \in \mathbb{R}^{L_k \times d_k}$ be the key matrix - $V \in \mathbb{R}^{L_k \times d_v}$ be the value matrix Then scaled dot-product attention is: $ \operatorname{Attention}(Q, K, V) = \operatorname{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V $ This is the equation to be able to write from memory.[^1] ### What each step means The computation happens in three conceptual stages: 1. Compute compatibility scores: $S = QK^\top$ 2. Normalize row-wise with softmax: $P = \operatorname{softmax}(S / \sqrt{d_k})$ 3. Aggregate values: $O = PV$ Each row of `P` is an attention distribution for one query over all keys. Each row of `O` is therefore a weighted combination of value vectors. > [!question] Quick check > What does the row-wise softmax mean in attention? >> [!answer] each query gets its own probability distribution over keys, so every row of the score matrix is normalized separately. ### Masking is part of the definition in practice In autoregressive Transformers, causal masking blocks access to future positions by setting forbidden logits to $-\infty$ before softmax.[^1] Padding masks and structural masks work the same way conceptually. That means the production formula is really: $ P = \operatorname{softmax}\left(\frac{QK^\top}{\sqrt{d_k}} + M\right) $ where $M$ contains $0$ for allowed connections and $-\infty$ for disallowed ones. This is not an implementation footnote. It is part of how the operator expresses task structure. > [!example] Code for this section > - Notebook: [notebooks/scaled_dot_product_attention/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/scaled_dot_product_attention/lecture_walkthrough.ipynb) > [!tip] TensorTonic follow-up > - [TensorTonic: Transformers Attention](https://www.tensortonic.com/research/transformer/transformers-attention) > Use it here to practice the exact score-matrix, softmax, and weighted-sum path from this section. ## Why divide by the square root of d_k? ### The variance argument This is one of the highest-value derivations in the lecture. Assume the components of query and key vectors are independent, mean-zero, variance-one random variables. Then the raw dot product $ q \cdot k = \sum_i q_i k_i $ has variance proportional to $d_k$, because it sums $d_k$ terms.[^1] So as head dimension grows: - raw logits get larger in magnitude - softmax becomes sharper and more saturated - gradients through softmax become very small Dividing by $\sqrt{d_k}$ normalizes the scale of logits so their variance remains approximately dimension-independent.[^1] ### The gradient-health interpretation Vaswani et al. explicitly motivate scaling as a way to avoid pushing softmax into saturated regimes with tiny gradients.[^1] This is the clean conceptual line: $1 / \sqrt{d_k}$ is not cosmetic. It is a default [[Glossary#Temperature|temperature]] chosen to keep the softmax trainable. People often remember the variance derivation, but the real operational reason is gradient health. > [!question] Quick check > If subtracting the max stabilizes softmax numerically, why do we still need scaling by 1 / sqrtd_k? >> [!answer] subtracting the max prevents overflow, but it does not stop the distribution from becoming too sharp as dot-product magnitudes grow. <video src="https://assets.montek.dev/lectures/media/llm/concepts/Scaled%20Dot-Product%20Attention/02_why_scale_by_sqrt_dk.mp4" controls></video> ## Softmax, temperature, and saturation ### Attention is a probability distribution over keys Once the score matrix is passed through a row-wise softmax, each query gets a probability distribution over keys. That means attention sharpness is governed by the same mechanism as any softmax temperature system. If we define: $ \operatorname{softmax}(z / \tau) $ then: - smaller $\tau$ makes the distribution sharper - larger $\tau$ makes it flatter Scaled dot-product attention implicitly sets $\tau$ proportional to $\sqrt{d_k}$. ### Why saturation hurts learning The softmax Jacobian tells the story: $ \frac{\partial p_i}{\partial z_j} = p_i (\delta_{ij} - p_j) $ When one attention weight becomes very close to `1` and the others become very close to `0`, most Jacobian entries become tiny. That means losing keys receive almost no gradient signal. This produces the standard training pathology: - too sharp too early: brittle, winner-take-all attention - too flat for too long: slow learning and weak selection This is one of the reasons attention often benefits from careful initialization, scaling, and stable kernels even when the formula looks simple. ## A deeper interpretation: attention as a kernel smoother ### The kernel view A useful interpretation is: Softmax attention is normalized exponential kernel regression. The query-key score determines a kernel similarity, softmax normalizes the similarities, and the value vectors are averaged under that normalized kernel.[^3] This helps you understand why so many efficient attention methods look like “kernel tricks.” They are often trying to preserve the kernel-smoothing behavior while avoiding the explicit $L \times L$ attention matrix. ### Why this view matters The kernel perspective clarifies several things at once: - why dot-product similarity is only one choice among many - why approximations can target the attention kernel rather than the full matrix - why linear-attention methods often rely on feature maps and associativity Performer and linear-attention methods make this perspective computationally explicit.[^8][^9] ## Classical attention variants and scoring functions ### Additive attention Bahdanau attention uses a learned MLP scoring function rather than a raw dot product.[^6] This is historically important because it shows that the essential object is not the dot product itself, but the general pattern: - compute compatibility - normalize - aggregate ### Multiplicative and bilinear attention Luong et al. present dot attention, general bilinear attention, and concat/additive variants inside one unified framework.[^7] This matters because it prevents you from over-identifying attention with one specific formula. ### Cosine attention Cosine attention normalizes query and key vectors before scoring them. This can stabilize similarity scales, and modern work uses cosine-style constructions in some linear-attention variants.[^10] The high-level point is: Choosing a scoring function means choosing an inductive bias about similarity. ## Why attention becomes expensive ### Quadratic compute and memory The original Transformer analysis makes the central cost explicit: [[Glossary#Self-attention|self-attention]] scales quadratically with sequence length because it compares every query with every key.[^1] For sequence length $L$, the score matrix is $L \times L$. That means: - quadratic compute for score formation - quadratic memory for storing or effectively behaving as though storing scores and probabilities - increasing pressure on bandwidth during forward and backward passes This is the practical reason attention becomes the bottleneck at long context lengths. ### The real bottleneck is often IO, not just FLOPs A key modern systems lesson is that naive attention is often memory-bound, not simply arithmetic-bound. Materializing the full score matrix and probability matrix creates heavy memory traffic. FlashAttention shows that exact attention can be much faster simply by respecting GPU memory hierarchy better.[^4] This is an excellent point to reset intuition: “Efficient attention” does not always mean “approximate attention.” Sometimes it means “same math, better algorithmic scheduling.” ## Efficient attention families ### Sparse attention Sparse attention methods reduce cost by restricting which query-key pairs are evaluated. Examples: - Sparse Transformer uses structured sparsity patterns.[^11] - Longformer combines sliding-window local attention with global tokens.[^12] - BigBird mixes local, random, and global connections and provides strong theoretical arguments about expressivity.[^13] The tradeoff is clear: - cheaper computation - stronger structural inductive bias - possible [[Glossary#Loss|loss]] of unrestricted token-to-token interaction ### Low-rank approximations Low-rank methods assume the attention matrix can be approximated compactly. Examples: - Linformer projects keys and values into lower-dimensional spaces.[^14] - Nyströmformer uses landmark-based approximation of the attention matrix.[^15] These methods trade exact fine-grained interactions for lower memory and time complexity. ### Kernelized or linearized attention Linear-attention methods reinterpret attention through feature maps so associativity can be used to avoid explicitly forming the full attention matrix. Examples: - Katharopoulos et al. derive linear attention with kernel feature maps.[^9] - Performer approximates the softmax kernel using random features in FAVOR+.[^8] This is one of the most elegant places where the kernel-smoother interpretation becomes operational rather than philosophical. ### Exact fused attention FlashAttention is the canonical example of exact efficient attention.[^4] Instead of approximating the attention result, it changes how the computation is scheduled: - tile the computation - do stable softmax incrementally - avoid storing the full attention matrix in high-bandwidth memory - recompute strategically in backward FlashAttention-2 pushes the same idea further with improved parallelism and work partitioning.[^16] <video src="https://assets.montek.dev/lectures/media/llm/concepts/Scaled%20Dot-Product%20Attention/03_attention_cost_and_efficiency.mp4" controls></video> ## Numerical stability and implementation realities ### Stable softmax is mandatory Softmax is numerically dangerous because $\exp(z)$ can overflow or underflow. The standard fix is to subtract the row maximum before exponentiating: $ \operatorname{softmax}(z) = \frac{\exp(z - \max(z))}{\sum_j \exp(z_j - \max(z))} $ This prevents overflow in practice and is the standard stable formulation.[^17] This is a crucial distinction to teach: - subtract-max fixes numerical overflow - scaling by $\sqrt{d_k}$ fixes training-scale saturation It is easy to conflate those two problems. ### Mixed precision makes the issue more serious In fp16 or bfloat16, exponentials can overflow much more easily than in fp32. That is why mixed-precision attention implementations should: - accumulate normalization statistics in higher precision when possible - use stable softmax - prefer fused kernels designed for low-precision safety[^4][^17] ### Masking bugs are common and worth stating explicitly A practical point: boolean mask semantics differ across frameworks and APIs. PyTorch’s fused `scaled_dot_product_attention` has its own conventions and requires care when migrating from older `MultiheadAttention` mask code.[^18] A safe implementation pattern is: - keep one internal representation of the mask - convert it once to an additive mask with $0$ and $-\infty$ This avoids an entire class of avoidable bugs. ### Dropout placement matters Attention dropout is usually applied after softmax and before the weighted sum with values. In modern fused APIs, dropout semantics may surprise you if you assume evaluation mode turns it off automatically. PyTorch explicitly documents that you should pass `dropout_p = 0.0` yourself in eval mode for `scaled_dot_product_attention`.[^18] > [!question] Quick check > Why can attention outputs still look stochastic in eval mode if you call a fused attention API incorrectly? >> [!answer] because some APIs require you to pass `dropout_p = 0.0` explicitly in eval mode instead of assuming the framework will disable it for you. ## Training dynamics and inductive bias ### Multi-head attention changes the statistics One underappreciated point is that multi-head attention changes logit statistics because each head uses a smaller $d_k$. That partly helps keep dot products in a manageable scale regime, while also letting different heads specialize to different subspaces.[^1] This leads to a useful contrast: - one huge head: larger raw inner-product scale, fewer subspaces - multiple heads: smaller head dimension, more representational diversity That is not the only reason multi-head attention works, but it is one piece of the story you can reason about directly. ### Efficient attention methods also change inductive bias Every efficiency trick changes not just compute, but what the model prefers: - sparse patterns encourage locality or structured routing - low-rank approximations favor compressible attention structure - kernelized methods may blur extremely sharp peaks unless approximation quality is high - exact fused kernels preserve the original inductive bias because they preserve the exact result This is why method choice should depend on task structure, not only on [[Glossary#Benchmark|benchmark]] throughput. > [!example] Notebook walkthroughs in this lecture > > If you want to study this note in code, use these notebook sections. If the viewer ignores the fragment, search for the exact heading text in the notebook: > > - [`Compute score matrix, softmax, and weighted sum`](https://github.com/Montekkundan/llm/blob/main/notebooks/scaled_dot_product_attention/lecture_walkthrough.ipynb#compute-score-matrix-softmax-and-weighted-sum) > - [`Masking blocks illegal positions`](https://github.com/Montekkundan/llm/blob/main/notebooks/scaled_dot_product_attention/lecture_walkthrough.ipynb#masking-blocks-illegal-positions) > - [`Why scaling matters`](https://github.com/Montekkundan/llm/blob/main/notebooks/scaled_dot_product_attention/lecture_walkthrough.ipynb#why-scaling-matters) > - [`Softmax temperature and saturation`](https://github.com/Montekkundan/llm/blob/main/notebooks/scaled_dot_product_attention/lecture_walkthrough.ipynb#softmax-temperature-and-saturation) > - [`Exact PyTorch SDPA matches the manual formula`](https://github.com/Montekkundan/llm/blob/main/notebooks/scaled_dot_product_attention/lecture_walkthrough.ipynb#exact-pytorch-sdpa-matches-the-manual-formula) > > A useful study order is: > > 1. derive the score matrix and row-wise softmax > 2. add masking and stable softmax > 3. compare the manual computation with PyTorch SDPA > 4. then connect the exact operator to fused kernels and long-sequence engineering > > <video src="https://assets.montek.dev/lectures/media/llm/concepts/Scaled%20Dot-Product%20Attention/04_stable_softmax_and_masking.mp4" controls></video> <video src="https://assets.montek.dev/lectures/media/llm/concepts/Scaled%20Dot-Product%20Attention/04_stable_softmax_and_masking.mp4" controls></video> > [!tip] TensorTonic follow-up > - [TensorTonic: GPT-2 Attention](https://www.tensortonic.com/research/gpt2/gpt2-attention) > Work through it here to connect the general attention operator to the causal GPT-style attention path. > [!tip] TensorTonic practice for this lecture > > If you want to practice this lecture in a more implementation-focused format, work through these TensorTonic exercises: > > - [TensorTonic: Transformers Attention](https://www.tensortonic.com/research/transformer/transformers-attention) > - [TensorTonic: GPT-2 Attention](https://www.tensortonic.com/research/gpt2/gpt2-attention) > > They are good follow-ups because they force you to compute the core attention path directly: > > - building query, key, and value projections > - forming scaled attention scores > - applying softmax over the correct axis > - comparing the generic mechanism with a GPT-style implementation <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Positional Encoding" href="Positional%20Encoding">Positional Encoding</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Multi-head Attention" href="Multi-head%20Attention">Multi-head Attention</a></div> </div> </div> ### References [^1]: Ashish Vaswani et al., "Attention Is All You Need," 2017. https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf [^3]: Yao-Hung Hubert Tsai, Shaojie Bai, and Makoto Yamada, "Transformer Dissection: A Unified Understanding of Transformer's Attention via the Lens of Kernel," 2019. https://arxiv.org/abs/1908.11775 [^4]: Tri Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness," 2022. https://openreview.net/forum?id=H4DqfPSibmx [^6]: Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, "Neural Machine Translation by Jointly Learning to Align and Translate," 2014. https://arxiv.org/abs/1409.0473 [^7]: Minh-Thang Luong, Hieu Pham, and Christopher D. Manning, "Effective Approaches to Attention-based Neural Machine Translation," 2015. https://aclanthology.org/D15-1166.pdf [^8]: Krzysztof Choromanski et al., "Rethinking Attention with Performers," 2020. https://arxiv.org/abs/2009.14794 [^9]: Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Francois Fleuret, "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention," 2020. https://arxiv.org/abs/2006.16236 [^10]: Zhen Qin, Weixuan Sun, and Hui Deng, "cosFormer: Rethinking Softmax in Attention," 2022. https://arxiv.org/abs/2202.08791 [^11]: Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever, "Generating Long Sequences with Sparse Transformers," 2019. https://arxiv.org/abs/1904.10509 [^12]: Iz Beltagy, Matthew E. Peters, and Arman Cohan, "Longformer: The Long-Document Transformer," 2020. https://arxiv.org/abs/2004.05150 [^13]: Manzil Zaheer et al., "Big Bird: Transformers for Longer Sequences," 2020. https://arxiv.org/abs/2007.14062 [^14]: Sinong Wang, Belinda Z. Li, and Madian Khabsa, "Linformer: Self-Attention with Linear Complexity," 2020. https://arxiv.org/abs/2006.04768 [^15]: Yunyang Xiong, Zhanpeng Zeng, and Rudrasis Chakraborty, "Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention," 2021. https://arxiv.org/abs/2102.03902 [^16]: Tri Dao, "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning," 2023. https://arxiv.org/abs/2307.08691 [^17]: Pierre Blanchard, Desmond J. Higham, and Nicholas J. Higham, "Accurate Computation of the Log-Sum-Exp and Softmax Functions," 2019. https://arxiv.org/abs/1909.03469 [^18]: PyTorch, "torch.nn.functional.scaled_dot_product_attention," 2025. https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html [^19]: Geoffrey Hinton, Oriol Vinyals, and Jeff Dean, "Distilling the Knowledge in a Neural Network," 2015. https://arxiv.org/abs/1503.02531