gt; t$ That mask is what makes the architecture compatible with the autoregressive factorization. <video src="https://assets.montek.dev/lectures/media/llm/concepts/Causal%20Language%20Modeling/03_many_local_decisions.mp4" controls></video> ## A concrete worked example Take the sequence: `"The cat sat on the mat"` Token-wise supervision conceptually looks like: - input prefix `"The"` predicts the next token for `"cat"` - input prefix `"The cat"` predicts the next token for `"sat"` - input prefix `"The cat sat"` predicts the next token for `"on"` - and so on Every position becomes a supervised example. One training sequence therefore contains many prediction tasks. This is the point people usually miss. The model is not trained on whole answers. It is trained on many local next-token decisions. ## Why this objective is surprisingly powerful The next-token objective is local, but repeated over large corpora it pressures the model to learn: - syntax - long-range statistical dependencies - discourse structure - code and markup regularities - broad world-pattern correlations present in text Many high-level behaviors that feel like reasoning or recall emerge because the model becomes good at conditional continuation over many kinds of sequences. [^1] [^3] ## What causal LM does not solve by itself Base pretraining does not automatically create: - instruction following - stable assistant behavior - role-aware dialogue - product guardrails - tool usage conventions Those require later changes to the training distribution and serving stack. In this course, that bridge appears in [[Chat Format and SFT]]. > [!tip] TensorTonic follow-up > - [TensorTonic: BERT Masked LM](https://www.tensortonic.com/research/bert/bert-masked-lm) > Use it here as a contrast exercise so you can compare causal next-token prediction with the masked-LM objective. ## Common confusions ### "Is the model predicting words or sentences?" Neither. It predicts the next token. Sentences and answers appear only because repeated token prediction can be unrolled into longer text. ### "Why can training be parallel if generation is sequential?" Because training sees the full sequence at once but still applies a causal mask. Inference has to generate one new token at a time because future tokens do not exist yet. ### "Is this unsupervised learning?" In modern usage it is better described as self-supervised learning. The labels are derived from the data itself. ## A useful exercise Take one short prompt and manually write out: - the tokenized sequence - the shifted targets - the logits over the vocabulary at one position - the correct target token Then connect that to the actual loss call in the companion code. ## Key takeaway Causal language modeling is the bridge between decoder architecture and generative behavior. Without it, you have a sequence processor. With it, you have a next-token engine that can be rolled forward into text generation. > [!example] Notebook walkthroughs in this lecture > > Use this order: > > 1. [notebooks/causal_language_modeling/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/causal_language_modeling/lecture_walkthrough.ipynb) > Use these sections as you read: > - [`Next-token targets are the input shifted by one`](https://github.com/Montekkundan/llm/blob/main/notebooks/causal_language_modeling/lecture_walkthrough.ipynb#next-token-targets-are-the-input-shifted-by-one) > - [`A decoder-only model produces logits for every position`](https://github.com/Montekkundan/llm/blob/main/notebooks/causal_language_modeling/lecture_walkthrough.ipynb#a-decoder-only-model-produces-logits-for-every-position) > - [`Each position predicts the next token, not itself`](https://github.com/Montekkundan/llm/blob/main/notebooks/causal_language_modeling/lecture_walkthrough.ipynb#each-position-predicts-the-next-token-not-itself) > - [`Teacher forcing scores all positions in parallel`](https://github.com/Montekkundan/llm/blob/main/notebooks/causal_language_modeling/lecture_walkthrough.ipynb#teacher-forcing-scores-all-positions-in-parallel) > > That code walk makes the mathematical story concrete: > > - the dataset gives token windows > - the model maps tokens to logits > - the loss compares logits to shifted targets > - the optimizer updates parameters > [!tip] TensorTonic practice for this lecture > > If you want to practice this lecture in a more implementation-focused format, work through these TensorTonic exercises: > > - [TensorTonic: BERT Masked LM](https://www.tensortonic.com/research/bert/bert-masked-lm) > - [TensorTonic: GPT-2 Forward](https://www.tensortonic.com/research/gpt2/gpt2-forward) > > They are good follow-ups because they let you compare the two core pretraining objective families directly: > > - masked-token reconstruction in an encoder-style setup > - next-token prediction in a decoder-only setup > - how logits are produced before the loss is applied > - why objective choice changes what kind of behavior the model is naturally built for <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Decoder Block" href="Decoder%20Block">Decoder Block</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Training Loop" href="Training%20Loop">Training Loop</a></div> </div> </div> ## Further reading - Yoshua Bengio et al., "A Neural Probabilistic Language Model," 2003. https://www.jmlr.org/papers/v3/bengio03a.html - Alec Radford et al., "Language Models are Unsupervised Multitask Learners," 2019. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf - Jared Kaplan et al., "Scaling Laws for Neural Language Models," 2020. https://arxiv.org/abs/2001.08361 - Hugging Face, "Causal language modeling," 2025. https://huggingface.co/docs/transformers/tasks/language_modeling --- [^1]: Alec Radford et al., "Language Models are Unsupervised Multitask Learners," 2019. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf [^2]: Yoshua Bengio et al., "A Neural Probabilistic Language Model," 2003. https://www.jmlr.org/papers/v3/bengio03a.html [^3]: Jared Kaplan et al., "Scaling Laws for Neural Language Models," 2020. https://arxiv.org/abs/2001.08361