Positional Encoding

> [!info] Course code > Use the companion repository for runnable notebooks, figures, and implementation references for this lecture: > - Theory notebook: [notebooks/positional_encoding/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/positional_encoding/lecture_walkthrough.ipynb) > - Serious model anchor: [picollm/accelerated/gpt.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/gpt.py) ## What This Concept Is Look at the pair `dog bites man` and `man bites dog`. They use the same words, but they do not mean the same thing. A plain attention mechanism only compares content vectors, so by itself it has no built-in sense of left-to-right order. Positional encoding is the extra signal that tells the model where each token sits in the sequence. This note is really about one question: once token identity is known, how does the model keep track of order? ## Foundation Terms You Need First Start with the simplest picture. A token already has an embedding that says what it is. What is missing is a **position signal** that says where it is. Some schemes add that signal directly to the token embedding. Other schemes change the attention score so relative distance matters. Either way, the goal is the same: let the model tell first, second, and last apart. As you read, keep the distinction clear between **token identity** and **sequence order**. This note is about the second one. The model already knows which token it is looking at; positional methods tell it where that token appears. ```mermaid flowchart TD A["'dog bites man'"] B["'man bites dog'"] C["Same tokens, different order"] D["Sinusoidal: add absolute position vectors"] E["Relative bias: modify attention scores by distance"] F["RoPE: rotate queries and keys by position"] A --> C B --> C C --> D C --> E C --> F ``` ## A first concrete picture of the design space Before you think about any one implementation, keep the three main options separate in your head: - sinusoidal absolute encodings - learned absolute tables - relative bias methods - rotary embeddings - ALiBi-style biasing The key difference between them is not branding. It is where position enters the computation. Some methods add position to the token representation at the input. Some change the attention score. Some transform queries and keys directly so relative offsets affect dot products. That is the map you want before the note gets more formal. ## How this lecture fits the course-to-picoLLM map You should read this note in three layers: - the notebook introduces the clean family of positional methods - `picollm/accelerated/gpt.py` shows the one method the accelerated run actually commits to - external references are used only for comparison: `rasbt/LLMs-from-scratch` for concept clarity and `nanochat` for systems orientation Once the family is clear, come back to the serious path. In `picollm/accelerated/gpt.py`, the actual implementation choice is rotary position [[Glossary#Embedding|embedding]], where queries and keys are rotated by `apply_rotary_emb(...)` using cached cosine and sine tables. <video src="https://assets.montek.dev/lectures/media/llm/concepts/Positional%20Encoding/01_order_needs_signal.mp4" controls></video> This note covers three connected ideas: - why positional information is mathematically necessary: without it, self-attention is permutation-equivariant, which sharply restricts what sequence functions the model can represent[^2] - the major positional methods by where they enter the computation: additive absolute encodings, relative or bias-based attention modifications, and multiplicative schemes like RoPE[^3] - the engineering tradeoffs: extrapolation to longer contexts, recency bias, clipping and bucketing effects, training stability, and why long-context scaling is often mostly a positional story[^3] ## Why positional information is needed at all ### Self-attention is order-agnostic without extra structure A plain self-attention layer does not contain an intrinsic notion of “first,” “next,” or “before.” It receives a set of token representations, computes queries, keys, and values, and mixes them by pairwise similarity. If you permute the input positions and apply the same permutation to the outputs, the computation remains consistent. That is the core reason a Transformer without positional information behaves like a set-processing architecture rather than a sequence model.[^2] This is the rigorous version of the intuition you may already have: > A Transformer processes tokens in parallel, so order must be injected rather than assumed. A sharper way to say it is this: positional encodings enlarge the model class. Without them, the model is restricted to permutation-equivariant sequence functions; with them, it can represent general order-sensitive functions.[^5] > [!question] Quick check > Can plain self-attention, without any positional signal, distinguish `dog bites man` from `man bites dog`? >> [!answer] not reliably. Without positional information, self-attention is permutation-equivariant and treats reordering too symmetrically. ### Position can leak in through other channels, but weakly There is an important subtlety here. Even when you remove explicit positional embeddings, some order information can still leak through causal masks, boundary tokens, or architectural asymmetries. Recent work studies how positional information can emerge in causal Transformers even without explicit position encoding.[^6] But that does not eliminate the main point. It just means “no explicit positional encoding” is not the same as “absolutely no positional signal anywhere.” This is a useful distinction because people often jump too quickly from “NoPE sometimes works” to “position never mattered.” ## A clean taxonomy: where does position enter the math? The most useful way to teach positional encoding is not by chronology, but by insertion point. There are three major places where position can enter: - **Input-level additive methods**: add a position vector to the token embedding. - **Attention-logit modifications**: bias attention scores using relative distance or buckets. - **Query/key reparameterizations**: transform queries and keys so dot products depend on relative offsets. This framing gives you a compact organizing principle: Position is not one trick. It is a family of ways to break permutation symmetry. > [!example] Code for this section > - Notebook: [notebooks/positional_encoding/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/positional_encoding/lecture_walkthrough.ipynb) > [!tip] TensorTonic follow-up > - [TensorTonic: Transformers Positional Encoding](https://www.tensortonic.com/research/transformer/transformers-positional-encoding) > Use it here to practice building the positional signal right after the math in this section. ## Absolute additive positional encodings ### Sinusoidal positional encoding The original Transformer introduces a deterministic positional vector $p_i$ for each position $i$, and adds it to the token embedding $x_i$ before attention.[^1] The standard formula is: $ PE(\operatorname{pos}, 2k) = \sin\left(\frac{\operatorname{pos}}{10000^{2k / d_{\text{model}}}}\right) $ $ PE(\operatorname{pos}, 2k+1) = \cos\left(\frac{\operatorname{pos}}{10000^{2k / d_{\text{model}}}}\right) $ The paper’s stated motivations are worth presenting directly: - the model has no recurrence or convolution, so order must be injected explicitly - multiple frequencies provide both fast-varying and slow-varying positional components - fixed offsets should correspond to simple transformations of the encoding, making relative reasoning easier[^1] A strong board derivation is the paired sine/cosine identity: $ \cos(a-b) = \cos(a)\cos(b) + \sin(a)\sin(b) $ Once you see that, they understand why sinusoidal encodings are absolute in implementation but still support relative comparisons in spirit. ### Sinusoids are not magical There is a very useful caveat to teach here. Sinusoids do not give you a perfect, infinite, collision-free position hash. They are periodic. In finite precision, large positions can become numerically awkward. In practice, they are a coordinate system, not a guarantee of exact positional uniqueness. That matters because many extrapolation failures are really positional distribution-shift failures: the model is asked to operate on positional phases it did not experience during training.[^7] > [!question] Quick check > What should you notice in a sinusoidal heatmap? >> [!answer] some channels vary quickly while others vary slowly, which is how the encoding mixes short-range and long-range positional frequencies. <video src="https://assets.montek.dev/lectures/media/llm/concepts/Positional%20Encoding/02_sinusoidal_encoding.mp4" controls></video> ### Learned absolute positional embeddings Instead of a fixed sinusoidal table, you can learn a positional embedding table and add those vectors to token embeddings. The original Transformer reports that learned positional embeddings performed similarly to sinusoids on the translation tasks they studied, while preferring sinusoids partly for their hoped-for extrapolation behavior.[^1] BERT is the canonical lecture example here. It constructs each input representation as the sum of token embeddings, segment embeddings, and learned position embeddings.[^8] This makes a crucial limitation obvious: - learned absolute embeddings are tied to the maximum trained length - beyond that, you need an extension rule such as copying, interpolation, or reinitialization - each extension rule changes the geometry the model learned during training This is why learned absolute methods are easy to explain but often brittle under context extension. ### Axial and factorized variants For very long sequences, a full learned position table can become expensive. Axial positional embeddings factor the position space into coarser and finer components, reducing parameter cost by decomposing the positional table into smaller parts.[^9] This is a good place to remember that positional encoding is also a memory-design question, not only a representation-design question. ## Relative position methods: position enters attention itself ### Shaw-style relative position representations Peter Shaw and coauthors introduced an extremely influential relative-position approach for self-attention.[^10] Instead of adding a position vector only at the input, they inject learned relative-distance vectors into attention computations. The core idea is: - let the attention from token `i` to token `j` depend on the relative offset `i - j` - optionally inject relative information into both the key-side score computation and the value-side aggregation - clip distances so large offsets share a learned representation This clipping is important. It reduces parameter growth and can improve generalization because distances beyond a threshold are treated as equivalent.[^10] The lecture-friendly summary is: Absolute methods tell each token where it is. Relative methods tell each pair how far apart they are. ### Transformer-XL and relative attention decomposition Transformer-XL turns relative attention into a more structured decomposition that supports state reuse across segments without positional confusion.[^11] Its formulation is especially useful because it separates multiple interaction types: - content-content interaction - content-position interaction - global content bias - global positional bias This is a very good moment to slow down and say clearly that “relative position” is not a single formula. It is a design space over which content-position interactions are allowed. ### T5 bucketed relative bias T5 uses a simpler and extremely practical version of relative position: a learned scalar bias added to the attention [[Glossary#Logits|logits]] based on a bucketed relative offset.[^2] The design is elegant: - near distances get finer resolution - far distances are grouped logarithmically into coarser buckets - beyond a threshold, the model becomes insensitive to exact distance within a single layer This is one of the clearest examples of a deliberate bias-variance tradeoff in architecture design. Exact far-distance resolution is thrown away in exchange for parameter efficiency and smoother generalization. > [!question] Quick check > Why does T5 bucket nearby distances more finely than far-away distances? >> [!answer] the model often needs exact resolution nearby, while coarse distance groups are usually enough for very large offsets. <video src="https://assets.montek.dev/lectures/media/llm/concepts/Positional%20Encoding/03_relative_position_methods.mp4" controls></video> ### Disentangled content and position DeBERTa makes the separation between content and position explicit by using disentangled attention, where content and position vectors interact through separate pathways rather than being naively mixed from the start.[^13] TUPE makes a related critique: adding word embeddings and positional embeddings directly can mix heterogeneous information too early and introduce noisy correlations.[^14] This is a valuable point because many people implicitly assume “just add the vectors” is a neutral choice. It is not neutral. It is a modeling commitment about how content and position should be fused. ## Rotary and multiplicative methods ### RoPE: position as rotation Rotary Position Embedding (RoPE) is one of the most important modern positional methods for LLMs.[^15] Instead of adding a position vector to the representation, it rotates the query and key vectors by a position-dependent angle. The key design target is: - encode absolute positions at the vector-transform level - make the query-key dot product depend naturally on relative offset - preserve vector norms under the transform That combination is why RoPE is so attractive. It behaves like a relative method at the level that matters most, the attention score, while remaining simple to integrate into standard attention computation.[^15] Two engineering properties matter in lecture: - rotations preserve vector norms - different frequency bands induce a distance-sensitive phase interaction, which gives the model a built-in prior over locality and offset relationships[^15] This is one of the rare positional methods whose geometric intuition is actually worth presenting, because you can see position as angle rather than as an added coordinate. ### RoPE is powerful, but context extension is delicate RoPE often extrapolates better than learned absolute tables, but naïve extrapolation to much longer contexts can still fail badly. The reason is not mysterious: the model is exposed to phase relationships far outside what it saw during training, and attention scores can become unstable.[^7] This is why so much long-context engineering focuses on RoPE scaling: - Position Interpolation rescales positions so longer contexts map back toward the pretrained range.[^7] - YaRN extends context with additional scaling refinements and lower training cost.[^17] - LongRoPE and related methods push the same family of ideas to very large context windows.[^18] This is a major modern systems point: Long-context scaling is often mostly a positional-scaling problem, not a whole-architecture redesign. ### ALiBi: position as a linear attention bias ALiBi removes additive position embeddings entirely and instead adds a head-specific linear distance penalty directly to the attention logits.[^19] Its inductive bias is easy to state: - recent tokens are preferred - different heads get different slope magnitudes - the model gets a spectrum from highly local to less local attention behavior ALiBi is famous because it extrapolates well in language modeling settings and is computationally cheap.[^19] But it also provides a great cautionary tale: strong [[Glossary#Perplexity|perplexity]] extrapolation does not automatically mean strong downstream length generalization on reasoning or algorithmic tasks.[^3] > [!question] Quick check > What inductive bias does ALiBi add to attention? >> [!answer] it adds a distance penalty that favors more recent tokens, with different heads using different slope strengths. <video src="https://assets.montek.dev/lectures/media/llm/concepts/Positional%20Encoding/04_rope_and_alibi.mp4" controls></video> ## Expressivity, inductive bias, and failure modes ### Positional encoding is symmetry breaking The cleanest theoretical statement in the lecture is: Without positional information, self-attention is permutation-equivariant. With positional information, the model can approximate general order-sensitive sequence functions.[^5] This is where you stop seeing positional encoding as an implementation detail and start seeing it as a group-symmetry design decision. ### Different schemes impose different priors Each positional method creates a characteristic attention geometry: - learned absolute embeddings favor absolute anchoring and fixed layouts - Shaw-style relative methods emphasize distance and translation equivariance - T5 buckets favor coarse distance sensitivity - ALiBi creates explicit recency bias - RoPE encodes relative behavior through rotation and frequency mixing That is why there is no universally best positional encoding. The right choice depends on what biases the task actually rewards. ### Important failure modes It is worth naming the common failure modes directly: - **Hard length cap**: learned absolute embeddings stop at the trained maximum length.[^8] - **Saturation**: bucketed or clipped relative methods lose exact far-distance resolution.[^10][^2] - **Recency bias**: ALiBi may over-favor local context on tasks that require truly global dependence.[^20] - **Phase distribution shift**: RoPE can destabilize when extrapolated too far without rescaling.[^7] - **Mixed-correlation noise**: additive absolute methods can entangle content and position too early.[^14] ## Training dynamics and scaling behavior ### Positional signals need sane scale The original Transformer multiplies token embeddings by $\sqrt{d_{\text{model}}}$ and adds positional encodings to that result.[^1] This matters because positional information is not useful if it is numerically drowned out, and not stable if it dominates the content signal. That connects directly to the previous embedding lecture: - token identity and position must live on comparable scales early in training - the sum of embeddings and positional signals is part of the optimization problem, not just an input pre-processing step The original Transformer also applies dropout to the sum of token embeddings and positional encodings, another under-taught detail that affects how the model learns to balance content and position.[^1] ### Compute constraints shape position learning BERT gives a concrete example of this. It pretrains mostly on shorter sequences and only later uses longer sequences, explicitly because long self-attention is expensive.[^8] That means positional learning is not only about architecture. It is also about what sequence lengths the optimizer actually sees during training. > [!question] Quick check > If a model trains mostly on sequences of length 128, why should you be cautious about behavior at position 500? >> [!answer] because the model may be operating on positional regimes it barely saw during training, so extrapolation can fail. ### Perplexity extrapolation is not the same as systematic length generalization This is one of the most important modern distinctions. Some methods, especially ALiBi and RoPE variants, can extrapolate well in language-model perplexity. But downstream tasks that require systematic reasoning over much longer sequences may still fail.[^3] That is because the inductive bias that preserves next-token prediction quality is not always the same inductive bias needed for compositional or algorithmic reasoning. This is a high-value correction to a common mistake: Better long-context perplexity does not imply better long-context reasoning. > [!example] Notebook walkthroughs in this lecture > > If you want to study this note in code, use these notebook sections. If the viewer ignores the fragment, search for the exact heading text in the notebook: > > - [`Self-attention is permutation-equivariant without position`](https://github.com/Montekkundan/llm/blob/main/notebooks/positional_encoding/lecture_walkthrough.ipynb#self-attention-is-permutation-equivariant-without-position) > - [`Sinusoidal absolute positions`](https://github.com/Montekkundan/llm/blob/main/notebooks/positional_encoding/lecture_walkthrough.ipynb#sinusoidal-absolute-positions) > - [`Learned absolute positional embeddings`](https://github.com/Montekkundan/llm/blob/main/notebooks/positional_encoding/lecture_walkthrough.ipynb#learned-absolute-positional-embeddings) > - [`Relative distance bias`](https://github.com/Montekkundan/llm/blob/main/notebooks/positional_encoding/lecture_walkthrough.ipynb#relative-distance-bias) > - [`Rotary position embeddings`](https://github.com/Montekkundan/llm/blob/main/notebooks/positional_encoding/lecture_walkthrough.ipynb#rotary-position-embeddings) > > A useful study order is: > > 1. verify the permutation-symmetry problem first > 2. inspect sinusoidal and learned absolute methods side by side > 3. compare relative bias and rotary methods at the attention level > 4. then connect those choices to context extension and long-range behavior > > <video src="https://assets.montek.dev/lectures/media/llm/concepts/Positional%20Encoding/05_positional_symmetry_and_failures.mp4" controls></video> > [!tip] TensorTonic practice for this lecture > > If you want to practice this lecture in a more implementation-focused format, work through the TensorTonic positional encoding exercise: > > - [TensorTonic: Transformers Positional Encoding](https://www.tensortonic.com/research/transformer/transformers-positional-encoding) > > It is a good follow-up because it forces you to implement the core mechanics directly: > > - building the sinusoidal encoding matrix > - separating sine and cosine dimensions correctly > - understanding how position and frequency interact > - returning a reusable positional table that can be added to embeddings <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Embedding Layer" href="Embedding%20Layer">Embedding Layer</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Scaled Dot-Product Attention" href="Scaled%20Dot-Product%20Attention">Scaled Dot-Product Attention</a></div> </div> </div> ### References [^1]: Ashish Vaswani et al., "Attention Is All You Need," 2017. https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf [^2]: Philipp Dufter, Martin Schmitt, and Hinrich Schutze, "Position Information in Transformers: An Overview," 2021. https://www.jmlr.org/papers/volume21/20-074/20-074.pdf [^3]: Amirhossein Kazemnejad et al., "The Impact of Positional Encoding on Length Generalization in Transformers," 2023. https://arxiv.org/abs/2305.19466 [^5]: Chulhee Yun et al., "Are Transformers universal approximators of sequence-to-sequence functions?," 2019. https://arxiv.org/abs/1912.10077 [^6]: Adi Haviv, Ori Ram, and Ofir Press, "Transformer Language Models without Positional Encodings Still Learn Positional Information," 2022. https://arxiv.org/abs/2203.16634 [^7]: Shouyuan Chen, Sherman Wong, and Liangjian Chen, "Extending Context Window of Large Language Models via Positional Interpolation," 2023. https://arxiv.org/abs/2306.15595 [^8]: Jacob Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," 2019. https://aclanthology.org/N19-1423.pdf [^9]: Jonathan Ho, Nal Kalchbrenner, and Dirk Weissenborn, "Axial Attention in Multidimensional Transformers," 2019. https://arxiv.org/abs/1912.12180 [^10]: Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani, "Self-Attention with Relative Position Representations," 2018. https://aclanthology.org/N18-2074.pdf [^11]: Zihang Dai et al., "Transformer-XL: Attentive Language Models beyond a Fixed-Length Context," 2019. https://aclanthology.org/P19-1285.pdf [^13]: Pengcheng He, Xiaodong Liu, and Jianfeng Gao, "DeBERTa: Decoding-enhanced BERT with Disentangled Attention," 2020. https://arxiv.org/abs/2006.03654 [^14]: Guolin Ke, Di He, and Tie-Yan Liu, "Rethinking Positional Encoding in Language Pre-training," 2020. https://arxiv.org/abs/2006.15595 [^15]: Jianlin Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding," 2021. https://arxiv.org/abs/2104.09864 [^17]: Bowen Peng, Jeffrey Quesnelle, and Honglu Fan, "YaRN: Efficient Context Window Extension of Large Language Models," 2023. https://arxiv.org/abs/2309.00071 [^18]: Yiran Ding et al., "LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens," 2024. https://arxiv.org/abs/2402.13753 [^19]: Ofir Press, Noah A. Smith, and Mike Lewis, "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation," 2021. https://arxiv.org/abs/2108.12409 [^20]: Shanda Li, Chong You, and Guru Guruganesh, "Functional Interpolation for Relative Positions Improves Long Context Transformers," 2023. https://arxiv.org/abs/2310.04418