Causal Language Modeling

> [!info] Course code > Use the companion repository for runnable notebooks, figures, and implementation references for this lecture: > - [notebooks/causal_language_modeling/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/causal_language_modeling/lecture_walkthrough.ipynb) > - [picollm/accelerated/gpt.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/gpt.py) > - [picollm/accelerated/pretrain/train.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/pretrain/train.py) > - [picollm/accelerated/chat/sft.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/chat/sft.py) ## What This Concept Is Take the token sequence for a sentence and imagine covering the next token with your hand. The model sees the earlier tokens and has to guess what comes next. Then you slide the window one step forward and repeat. That repeated next-token guessing game is causal language modeling. It sounds small when described that way, but this one training contract is what turns a decoder-only Transformer into a generator. ## Foundation Terms You Need First Keep four pieces straight. A **sequence** is an ordered list of token IDs. A **prefix** is the part the model is allowed to see before making the next guess. The model outputs **[[Glossary#Logits|logits]]**, which are raw scores over the vocabulary. A **[[Glossary#Causal mask|causal mask]]** makes sure the model cannot see future positions while training or generating. So the basic rhythm of this note is simple: show a prefix, score the possible next tokens, compare against the true next token, and repeat across many positions and many sequences. ## How this lecture maps to picoLLM This note is the objective-level bridge from theory to the full picoLLM build. - `picollm/accelerated/pretrain/train.py` applies causal next-token prediction during base training - `picollm/accelerated/chat/sft.py` still uses next-token prediction during post-training - the change from [[Glossary#Base model|base model]] to chat model comes from formatting and data distribution, not from replacing the [[Glossary#Loss|loss]] with a different fundamental objective That is one of the most important course takeaways: > picoLLM is still a causal language model all the way through; the assistant behavior comes from data and runtime conventions layered on top > [!tip] `picollm` code anchor > Even in `picollm/accelerated/chat/sft.py`, the objective is still next-token prediction. > Chat behavior comes from the training data format and prompt structure, not from replacing causal LM with a different loss family. ## The objective in one line For a token sequence $x_1, x_2, \ldots, x_T$, the model factorizes the joint distribution as: $ p(x_1, x_2, \ldots, x_T) = \prod_{t=1}^{T} p(x_t \mid x_{<t}) $ This is the [[Glossary#Autoregressive model|autoregressive]] decomposition. The model never predicts the whole sentence at once. It predicts the next token conditioned on the left context. <video src="https://assets.montek.dev/lectures/media/llm/concepts/Causal%20Language%20Modeling/01_shifted_targets.mp4" controls></video> ## Why the target is shifted In code, the supervised labels are already inside the text. We simply offset the sequence by one position: > [!example] Notebook follow-up > - [`Next-token targets are the input shifted by one`](https://github.com/Montekkundan/llm/blob/main/notebooks/causal_language_modeling/lecture_walkthrough.ipynb#next-token-targets-are-the-input-shifted-by-one) > Use this notebook section here to inspect the shifted-target construction directly. - inputs: `x_1 ... x_(T-1)` - targets: `x_2 ... x_T` That is why language modeling is often described as self-supervised. The corpus supplies both the input and the target. In the companion code, this shift happens when batches are constructed in the batching helper: - `x` is a block of token IDs - `y` is the same block shifted by one token You should see that this is not a trick. It is the direct implementation of next-token prediction. <video src="https://assets.montek.dev/lectures/media/llm/concepts/Causal%20Language%20Modeling/02_logits_and_cross_entropy.mp4" controls></video> ## What the model actually outputs At every sequence position, the decoder produces [[Glossary#Logits|logits]] over the [[Glossary#Vocabulary|vocabulary]]. Those logits are unnormalized scores for every possible next token. During training: > [!example] Notebook follow-up > - [`A decoder-only model produces logits for every position`](https://github.com/Montekkundan/llm/blob/main/notebooks/causal_language_modeling/lecture_walkthrough.ipynb#a-decoder-only-model-produces-logits-for-every-position) > - [`Teacher forcing scores all positions in parallel`](https://github.com/Montekkundan/llm/blob/main/notebooks/causal_language_modeling/lecture_walkthrough.ipynb#teacher-forcing-scores-all-positions-in-parallel) > Use these notebook sections here to connect the logits view to the actual training pass. > [!tip] TensorTonic follow-up > - [TensorTonic: GPT-2 Forward](https://www.tensortonic.com/research/gpt2/gpt2-forward) > Work through it here to practice the decoder-forward view behind causal LM. 1. the model produces logits with shape `[B, T, V]` 2. the target tensor has shape `[B, T]` 3. cross-entropy compares the predicted distribution to the true next token In the companion code this appears in the model `forward()` path, where the loss is computed with `F.cross_entropy(...)`. That is a useful point to pause on: > A chat model is trained as a giant repeated classification problem over vocabulary IDs. ## Why causal masking is required If the model could see future tokens while predicting the next token, the task would be corrupted. The model would be rewarded for cheating instead of learning the left-to-right conditional distribution. The [[Glossary#Causal mask|causal mask]] enforces the rule: - token $t$ may attend to tokens $\leq t$ - token $t$ may not attend to tokens gt; t$ That mask is what makes the architecture compatible with the autoregressive factorization. <video src="https://assets.montek.dev/lectures/media/llm/concepts/Causal%20Language%20Modeling/03_many_local_decisions.mp4" controls></video> ## A concrete worked example Take the sequence: `"The cat sat on the mat"` Token-wise supervision conceptually looks like: - input prefix `"The"` predicts the next token for `"cat"` - input prefix `"The cat"` predicts the next token for `"sat"` - input prefix `"The cat sat"` predicts the next token for `"on"` - and so on Every position becomes a supervised example. One training sequence therefore contains many prediction tasks. This is the point people usually miss. The model is not trained on whole answers. It is trained on many local next-token decisions. ## Why this objective is surprisingly powerful The next-token objective is local, but repeated over large corpora it pressures the model to learn: - syntax - long-range statistical dependencies - discourse structure - code and markup regularities - broad world-pattern correlations present in text Many high-level behaviors that feel like reasoning or recall emerge because the model becomes good at conditional continuation over many kinds of sequences. [^1] [^3] ## What causal LM does not solve by itself Base pretraining does not automatically create: - instruction following - stable assistant behavior - role-aware dialogue - product guardrails - tool usage conventions Those require later changes to the training distribution and serving stack. In this course, that bridge appears in [[Chat Format and SFT]]. > [!tip] TensorTonic follow-up > - [TensorTonic: BERT Masked LM](https://www.tensortonic.com/research/bert/bert-masked-lm) > Use it here as a contrast exercise so you can compare causal next-token prediction with the masked-LM objective. ## Common confusions ### "Is the model predicting words or sentences?" Neither. It predicts the next token. Sentences and answers appear only because repeated token prediction can be unrolled into longer text. ### "Why can training be parallel if generation is sequential?" Because training sees the full sequence at once but still applies a causal mask. Inference has to generate one new token at a time because future tokens do not exist yet. ### "Is this unsupervised learning?" In modern usage it is better described as self-supervised learning. The labels are derived from the data itself. ## A useful exercise Take one short prompt and manually write out: - the tokenized sequence - the shifted targets - the logits over the vocabulary at one position - the correct target token Then connect that to the actual loss call in the companion code. ## Key takeaway Causal language modeling is the bridge between decoder architecture and generative behavior. Without it, you have a sequence processor. With it, you have a next-token engine that can be rolled forward into text generation. > [!example] Notebook walkthroughs in this lecture > > Use this order: > > 1. [notebooks/causal_language_modeling/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/causal_language_modeling/lecture_walkthrough.ipynb) > Use these sections as you read: > - [`Next-token targets are the input shifted by one`](https://github.com/Montekkundan/llm/blob/main/notebooks/causal_language_modeling/lecture_walkthrough.ipynb#next-token-targets-are-the-input-shifted-by-one) > - [`A decoder-only model produces logits for every position`](https://github.com/Montekkundan/llm/blob/main/notebooks/causal_language_modeling/lecture_walkthrough.ipynb#a-decoder-only-model-produces-logits-for-every-position) > - [`Each position predicts the next token, not itself`](https://github.com/Montekkundan/llm/blob/main/notebooks/causal_language_modeling/lecture_walkthrough.ipynb#each-position-predicts-the-next-token-not-itself) > - [`Teacher forcing scores all positions in parallel`](https://github.com/Montekkundan/llm/blob/main/notebooks/causal_language_modeling/lecture_walkthrough.ipynb#teacher-forcing-scores-all-positions-in-parallel) > > That code walk makes the mathematical story concrete: > > - the dataset gives token windows > - the model maps tokens to logits > - the loss compares logits to shifted targets > - the optimizer updates parameters > [!tip] TensorTonic practice for this lecture > > If you want to practice this lecture in a more implementation-focused format, work through these TensorTonic exercises: > > - [TensorTonic: BERT Masked LM](https://www.tensortonic.com/research/bert/bert-masked-lm) > - [TensorTonic: GPT-2 Forward](https://www.tensortonic.com/research/gpt2/gpt2-forward) > > They are good follow-ups because they let you compare the two core pretraining objective families directly: > > - masked-token reconstruction in an encoder-style setup > - next-token prediction in a decoder-only setup > - how logits are produced before the loss is applied > - why objective choice changes what kind of behavior the model is naturally built for <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Decoder Block" href="Decoder%20Block">Decoder Block</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Training Loop" href="Training%20Loop">Training Loop</a></div> </div> </div> ## Further reading - Yoshua Bengio et al., "A Neural Probabilistic Language Model," 2003. https://www.jmlr.org/papers/v3/bengio03a.html - Alec Radford et al., "Language Models are Unsupervised Multitask Learners," 2019. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf - Jared Kaplan et al., "Scaling Laws for Neural Language Models," 2020. https://arxiv.org/abs/2001.08361 - Hugging Face, "Causal language modeling," 2025. https://huggingface.co/docs/transformers/tasks/language_modeling --- [^1]: Alec Radford et al., "Language Models are Unsupervised Multitask Learners," 2019. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf [^2]: Yoshua Bengio et al., "A Neural Probabilistic Language Model," 2003. https://www.jmlr.org/papers/v3/bengio03a.html [^3]: Jared Kaplan et al., "Scaling Laws for Neural Language Models," 2020. https://arxiv.org/abs/2001.08361