Embedding Layer - lectures

> [!info] Course code > Use the companion repository for runnable notebooks, figures, and implementation references for this lecture: > - [notebooks/embedding_layer/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/embedding_layer/lecture_walkthrough.ipynb) > - [course_tools/runtime.py](https://github.com/Montekkundan/llm/blob/main/course_tools/runtime.py) > - [picollm/accelerated/gpt.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/gpt.py) ## What This Concept Is Imagine the tokenizer has already done its job and handed the model IDs like `57, 12, 908`. Those numbers are useful as labels, but they do not yet carry usable geometry. The embedding layer is the moment those labels become learned numeric objects the rest of the Transformer can work with. If tokenization answers "which symbol is this?", the embedding layer starts answering "where does that symbol live in the model's learned space?" ## Foundation Terms You Need First Keep four objects separate as you read. A **[[Glossary#Token ID|token ID]]** is just an index. A **[[Glossary#Matrix|matrix]]** is the larger table that stores one learned row per vocabulary entry. An **[[Glossary#Embedding|embedding]]** is the row you retrieve for one specific token ID. A **[[Glossary#Vector|vector]]** is the shape of that row: an ordered list of numbers that later layers can add, compare, project, and normalize. So when someone says "the model embeds a token," what actually happens is less mystical than it sounds. The model uses the token ID as an address, looks up one learned row in a table, and passes that vector forward. ```mermaid flowchart TD A["Token IDs: 57, 12, 908"] --> B["Embedding matrix E"] B --> C["Lookup rows E[57], E[12], E[908]"] C --> D["Dense vectors in R^d_model"] D --> E["Add positional information"] E --> F["Enter the Transformer stack"] ``` <video src="https://assets.montek.dev/lectures/media/llm/concepts/Embedding%20Layer/01_token_id_to_vector.mp4" controls></video> ## Which code surfaces matter - `notebooks/embedding_layer/lecture_walkthrough.ipynb` is the main runnable surface for lookup, one-hot equivalence, scaling, and weight tying. - `course_tools/runtime.py` is the smallest end-to-end runtime. Use it when you want the cleanest possible path from token IDs to vectors. - `picollm/accelerated/gpt.py` shows how token embeddings, output logits, and extra learned tables appear inside a full model. ## Why the embedding step exists Token IDs are useful as stable labels, but the model cannot reason with those labels directly. Token ID `57` is not "larger" or "more meaningful" than token ID `12`. The IDs are arbitrary indices into a learned [[Glossary#Matrix|matrix]]. > [!question] Why is tokenization not enough by itself? >> [!answer] Tokenization gives you discrete IDs. The embedding layer turns those IDs into learned vectors so attention, normalization, and feed-forward layers can compute on them. That is why the embedding layer is the point where symbolic identity becomes geometry. > [!question] Quick check > If all token IDs were permuted consistently everywhere, would the model still work? >> [!answer] Yes. Meaning is not stored in the raw integer labels. It is stored in the row assignment and the learned parameters attached to those rows. > [!example] Code for this section > - Notebook: [notebooks/embedding_layer/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/embedding_layer/lecture_walkthrough.ipynb) > - Runtime: [course_tools/runtime.py](https://github.com/Montekkundan/llm/blob/main/course_tools/runtime.py) > [!tip] TensorTonic follow-up > - [TensorTonic: Transformers Embedding](https://www.tensortonic.com/research/transformer/transformers-embedding) > Use it here to practice the token-ID-to-vector lookup story immediately after this section. ## The core mathematics ### The embedding matrix The fundamental object is an embedding matrix: $ E \in \mathbb{R}^{V \times d_{\text{model}}} $ Here, $V$ is the [[Glossary#Vocabulary size|vocabulary size]], and each row is the learned vector associated with one token ID.[^1] Given token ID $i$, the embedding output is: $ x_i = E[i] $ For a sequence of length $n$, the embedding layer maps a list of token IDs into a matrix in $\mathbb{R}^{n \times d_{\text{model}}}$. In code, this is a row-gather operation rather than a dense matrix multiplication.[^1][^5] ### The one-hot identity The same lookup can be written mathematically as: $ x_i = e_i^\top E $ where $e_i$ is the [[Glossary#One-hot vector|one-hot vector]] for token $i$. This matters because it makes the implementation idea precise: an embedding layer is mathematically a linear map applied to a one-hot input, even though the code uses efficient row lookup instead of explicitly building the one-hot vector.[^5] <video src="https://assets.montek.dev/lectures/media/llm/concepts/Embedding%20Layer/05_one_hot_lookup_identity.mp4" controls></video> This also clarifies a common misunderstanding. You might hear that "the embedding layer is just a lookup table." Implementation-wise, that is true. Conceptually, it is incomplete. The lookup is an efficient way to realize a linear map from a discrete input into a learned vector space. The embedding matrix contains learned vectors whose role is shaped only through training and the downstream objective.[^1][^6] ## Scaling and embedding bundles ### Why the Transformer scales embeddings The original Transformer scales embedding vectors by $\sqrt{d_{\text{model}}}$ before combining them with positional information.[^1] If embedding components start with variance $\sigma^2$, then a typical embedding norm scales on the order of $\sqrt{d_{\text{model}}}\sigma$. That is why the scaling term helps keep token identity and positional information in a comparable numerical range at the start of training. > [!question] Quick check > Could the model learn the right scale later without this factor? >> [!answer] Yes, but that would waste optimization effort recovering a sensible signal scale first. The explicit factor makes early training more stable. ### Embedding bundles In some models, the input representation is not only one token embedding table. BERT is the canonical example: the model sums token embeddings, segment embeddings, and position embeddings.[^3] That is why "embedding layer" often really means an embedding bundle: several learned tables whose outputs are added together before the Transformer stack starts. <video src="https://assets.montek.dev/lectures/media/llm/concepts/Embedding%20Layer/04_embedding_bundle_and_scaling.mp4" controls></video> > [!example] Code for this section > - Notebook: [notebooks/embedding_layer/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/embedding_layer/lecture_walkthrough.ipynb) > - Full model: [picollm/accelerated/gpt.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/gpt.py) > [!tip] TensorTonic follow-up > - [TensorTonic: BERT Segment Embedding](https://www.tensortonic.com/research/bert/bert-segment-embedding) > Use it here to practice how multiple embedding components are added together in a BERT-style input bundle. ## Weight tying The original Transformer shares the embedding matrix with the pre-softmax output projection.[^1] If the output weights are tied to the input embedding matrix, predicting the next token becomes an alignment problem between the hidden state and the vocabulary vectors: $ \text{logit}_j = h^\top E_j $ where $h$ is the current hidden state and $E_j$ is the embedding row for token $j$.[^7] That means the model reads tokens and writes tokens using the same geometry. Weight tying reduces parameter count, but it also makes the input and output spaces line up more cleanly. > [!question] Quick check > If V = 50,000 and d_model = 1024, how many parameters are in the input embedding matrix? >> [!answer] $50{,}000 \times 1024 = 51{,}200{,}000$ parameters. If the output projection is untied, that adds another full $1024 \times 50{,}000$ matrix. <video src="https://assets.montek.dev/lectures/media/llm/concepts/Embedding%20Layer/02_weight_tying_geometry.mp4" controls></video> > [!example] Code for this section > - Notebook: [notebooks/embedding_layer/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/embedding_layer/lecture_walkthrough.ipynb) > - Full model: [picollm/accelerated/gpt.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/gpt.py) ## Static embeddings versus contextual representations The embedding layer itself is static: token ID $i$ always retrieves the same row $E[i]$. > [!example] Notebook follow-up > - [`Static rows versus contextual hidden states`](https://github.com/Montekkundan/llm/blob/main/notebooks/embedding_layer/lecture_walkthrough.ipynb#static-rows-versus-contextual-hidden-states) > Use this notebook section here to compare static lookup rows with later contextual hidden states. > [!tip] TensorTonic follow-up > - [TensorTonic: GPT-2 Embedding](https://www.tensortonic.com/research/gpt2/gpt2-embedding) > Work through it here to reinforce how decoder-only models use embeddings at the start of the stack. But later Transformer layers produce contextual representations. Once attention starts operating, the representation of a token depends on the surrounding tokens in the sequence. ELMo and BERT are standard reference points for that distinction.[^3][^11] This is the practical difference: - the embedding row gives a token its initial coordinate assignment - the later hidden states reshape that representation using context <video src="https://assets.montek.dev/lectures/media/llm/concepts/Embedding%20Layer/03_static_vs_contextual.mp4" controls></video> > [!question] Quick check > Why can the word "bank" have one embedding row but many different contextual meanings? >> [!answer] The static embedding row is only the starting point. Later layers use surrounding tokens to produce different contextual representations for different uses of the same word. ## Practical issues that matter ### Rare tokens and subword structure Rare tokens receive fewer direct updates, so their embeddings are often under-trained. Subword tokenization is one response to this problem: instead of forcing a rare word to live as one poorly trained atomic ID, represent it as a sequence of more frequent pieces.[^16] This is one reason tokenization and embeddings should be understood together. A weak embedding for a rare token is often a tokenization problem in disguise. ### Padding behavior Standard embedding APIs often let you specify a padding index whose row stays fixed and is excluded from gradient updates.[^5] That keeps the padding token from accidentally learning semantic content. ### Frequency-scaled gradients and norm constraints Embedding implementations may also expose options such as frequency-scaled gradients or max-norm constraints.[^5] These exist because frequent tokens can dominate learning dynamics, while some rows can become numerically oversized if they are left completely unconstrained. > [!example] Notebook walkthroughs in this lecture > > If you want to run the implementation walkthrough while reading this note, use these notebook sections. If the viewer ignores the fragment, search for the exact heading text in the notebook: > > - [`Token ID to vector mapping`](https://github.com/Montekkundan/llm/blob/main/notebooks/embedding_layer/lecture_walkthrough.ipynb#token-id-to-vector-mapping) > - [`Sequence lookup and output shape`](https://github.com/Montekkundan/llm/blob/main/notebooks/embedding_layer/lecture_walkthrough.ipynb#sequence-lookup-and-output-shape) > - [`Padding rows and gradient flow`](https://github.com/Montekkundan/llm/blob/main/notebooks/embedding_layer/lecture_walkthrough.ipynb#padding-rows-and-gradient-flow) > - [`Vocabulary size and parameter count`](https://github.com/Montekkundan/llm/blob/main/notebooks/embedding_layer/lecture_walkthrough.ipynb#vocabulary-size-and-parameter-count) > - [`Weight tying and shared geometry`](https://github.com/Montekkundan/llm/blob/main/notebooks/embedding_layer/lecture_walkthrough.ipynb#weight-tying-and-shared-geometry) > - [`Static rows versus contextual hidden states`](https://github.com/Montekkundan/llm/blob/main/notebooks/embedding_layer/lecture_walkthrough.ipynb#static-rows-versus-contextual-hidden-states) > [!tip] TensorTonic practice for this lecture > > If you want to practice this lecture in a more implementation-focused format, work through these TensorTonic exercises: > > - [TensorTonic: Transformers Embedding](https://www.tensortonic.com/research/transformer/transformers-embedding) > - [TensorTonic: BERT Segment Embedding](https://www.tensortonic.com/research/bert/bert-segment-embedding) > - [TensorTonic: GPT-2 Embedding](https://www.tensortonic.com/research/gpt2/gpt2-embedding) > > They are good follow-ups because they make the embedding stack concrete in three important settings: > > - token IDs becoming learned vectors > - additive input construction in BERT with token, position, and segment information > - decoder-only embedding lookup and shape flow in GPT-style models <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Tokenization" href="Tokenization">Tokenization</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Positional Encoding" href="Positional%20Encoding">Positional Encoding</a></div> </div> </div> ## References [^1]: Ashish Vaswani et al., "Attention Is All You Need," 2017. https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf [^3]: Jacob Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," 2019. https://aclanthology.org/N19-1423.pdf [^5]: PyTorch, "torch.nn.Embedding," 2025. https://docs.pytorch.org/docs/stable/generated/torch.nn.Embedding.html [^6]: Yoshua Bengio et al., "A Neural Probabilistic Language Model," 2003. https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf [^7]: Ofir Press and Lior Wolf, "Using the Output Embedding to Improve Language Models," 2016. https://arxiv.org/abs/1608.05859 [^8]: Zellig S. Harris, "Distributional Structure," 1954. https://www.its.caltech.edu/~matilde/ZelligHarrisDistributionalStructure1954.pdf [^9]: Tomas Mikolov et al., "Efficient Estimation of Word Representations in Vector Space," 2013. https://arxiv.org/abs/1301.3781 [^11]: Matthew E. Peters et al., "Deep contextualized word representations," 2018. https://arxiv.org/abs/1802.05365 [^16]: Rico Sennrich, Barry Haddow, and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units," 2015. https://arxiv.org/abs/1508.07909