Inference and Sampling

> [!info] Course code > Use the companion repository for runnable notebooks, figures, and implementation references for this lecture: > - [notebooks/inference_and_sampling/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/inference_and_sampling/lecture_walkthrough.ipynb) > - [picollm/accelerated/engine.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/engine.py) > - [picollm/accelerated/chat/cli.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/chat/cli.py) > - [picollm/accelerated/chat/web.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/chat/web.py) ## What This Concept Is At inference time, the model does not output one final answer directly. It produces a distribution over the next token, then some rule chooses one token, appends it, and asks again. Inference and sampling are the parts of the story that turn a trained model into an actual text generator. So if training is about making the distribution better, inference is about deciding how to use that distribution at runtime. ## Foundation Terms You Need First The core object here is the **[[Glossary#Logits|logit]]** vector: raw scores over the vocabulary for the next token. A **probability distribution** is what you get after normalizing those scores. **[[Glossary#Temperature|Temperature]]** changes how sharp or flat that distribution feels. A **sampling rule** such as greedy, top-k, or top-p decides which token is actually chosen. That means this note lives at the boundary between model belief and runtime choice. The logits come from the model. The sampling rule decides how deterministic, risky, or diverse the final output will be. ```mermaid flowchart TD A["Prompt tokens"] --> B["Prefill pass"] B --> C["KV cache and next-token logits"] C --> D["Decode one token at a time"] D --> E["Greedy: argmax"] D --> F["Sampling: temperature / top-k / top-p"] E --> G["Append token and continue"] F --> G ``` ## How this lecture maps to picoLLM The notebook introduces the clean inference ideas first. Then inspect the serious runtime surfaces: - `engine.py` for [[Glossary#Prefill|prefill]], decode, and batch generation - `chat/cli.py` for a direct user-facing runtime - `chat/web.py` for streaming and API-compatible serving This is also where the course should keep the reference hierarchy explicit: - `rasbt` is the concept-first external reference - `picollm` is the real implementation path you will build on - `nanochat` is the systems-oriented comparison reference ## The prefill-decode view At inference time there are two phases: - **prefill**: run the prompt through the model to build hidden state and cache - **decode**: repeatedly generate one new token at a time The companion runtime makes this explicit in the inference demo with: - `prefill_prompt(...)` - `decode_next_token(...)` - `generate_text(...)` - `stream_text(...)` That is a very clear design because you can see the runtime shape of generation directly. ## Greedy decoding The simplest rule is greedy decoding: - compute probabilities from logits - choose the highest-probability token - append it - repeat This is deterministic when the model and prompt are fixed. It is also often brittle or repetitive because it always takes the local argmax. In the companion runtime, setting `temperature = 0.0` produces greedy behavior. > [!example] Notebook follow-up > - [`Greedy decoding`](https://github.com/Montekkundan/llm/blob/main/notebooks/inference_and_sampling/lecture_walkthrough.ipynb#greedy-decoding) > Use this notebook section here to see the argmax path step by step. > [!tip] TensorTonic follow-up > - [TensorTonic: GPT-2 Greedy Decode](https://www.tensortonic.com/research/gpt2/gpt2-greedy-decode) > Work through it here to practice the same deterministic decoding rule directly. ## Temperature and top-k The current companion runtime implements two main probability-shaping knobs: - `temperature` - `top_k` ### Temperature [[Glossary#Temperature|Temperature]] rescales logits before [[Glossary#Softmax|softmax]]. - lower temperature sharpens the distribution - higher temperature flattens it It helps to understand this as probability calibration, not as a magic creativity dial. ### Top-k [[Glossary#Top-k|Top-k]] keeps only the `k` most likely tokens and suppresses the rest. This prevents the sampler from wandering into extremely low-probability tails. That logic appears directly in the `decode_next_token(...)` helper in the companion runtime code. > [!example] Notebook follow-up > - [`Temperature and top-k on the same logits`](https://github.com/Montekkundan/llm/blob/main/notebooks/inference_and_sampling/lecture_walkthrough.ipynb#temperature-and-top-k-on-the-same-logits) > Use this notebook section here to compare how the same logits change under different decoding settings. > [!tip] TensorTonic follow-up > - [TensorTonic: GPT-2 Top-k Sampling](https://www.tensortonic.com/research/gpt2/gpt2-topk-sampling) > Work through it here to practice the sampling path from this paragraph. ## Top-p is a common extension It is also worth learning about nucleus sampling, or [[Glossary#Top-p|top-p]] decoding, because it is common in production systems. [^1] - sort tokens by probability - keep the smallest set whose cumulative probability exceeds $p$ - sample from that truncated distribution Important note: - the small concept runtime can be taught with temperature and top-k first - the serious picoLLM runtime now also exposes `top_p`, `min_p`, `max_tokens`, and `seed` in its CLI and web surfaces So you should understand top-p both as a standard literature technique and as a real runtime control in the serious picoLLM path. > [!example] Notebook follow-up > - [`Top-p as a common extension`](https://github.com/Montekkundan/llm/blob/main/notebooks/inference_and_sampling/lecture_walkthrough.ipynb#top-p-as-a-common-extension) > Use this notebook section here to compare top-p against the earlier greedy and top-k rules. ## Why sampling matters Sampling is the cleanest place to show that output quality is not only a model question. A weak model can look a little better under safer decoding. A strong model can look much worse under reckless sampling. That gives you a more mature mental model: - training defines the learned distribution - decoding defines how we traverse that distribution ## Stopping conditions A practical inference loop also needs termination rules: - end-of-sequence token - maximum new tokens - application-level stop strings if used Without bounded decoding, demos become fragile and products become unpredictable. In the companion runtime, `max_new_tokens` is part of the generation config, and the loop stops early when the end token is generated. ## KV cache and why inference is a systems problem Once generation starts, recomputing the full attention history on every step is wasteful. That is why modern runtimes cache previous keys and values. [^2] The companion runtime supports this directly: - prefill computes the initial cache - each decode step appends to `past_kvs` This is an essential systems idea. Inference is not only probability theory. It is [[Glossary#Latency|latency]] engineering. Be able to restate the practical runtime story clearly: - previous keys and values do not need to be recomputed from scratch - the [[Glossary#KV cache|KV cache]] stores them - only the new token's contribution must be processed That is why interactive chat feels much better with a cache than without one. When you want to show the product-facing runtime path directly, point them at: - [notebooks/inference_runtime_and_kv_cache/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/inference_runtime_and_kv_cache/lecture_walkthrough.ipynb) The key helper names to highlight are: - `prefill_prompt(...)` - `decode_next_token(...)` - `stream_text(...)` ## Streaming is a product choice built on the same decoder The web app does not use a different model from the CLI. It uses the same generation loop but emits partial text as it arrives. That is why `stream_text(...)` is useful here. You can see that: - the model is still generating one token at a time - the UI just exposes partial outputs earlier ## Common confusions ### "If training is parallel, why is inference sequential?" Because at generation time the future tokens do not exist yet. You can only condition on what has already been produced. ### "Does sampling change the model?" No. It only changes how you select from the distribution the model already produced. ### "Is deterministic output always better?" Not necessarily. Determinism can improve repeatability, but it can also make outputs repetitive or brittle depending on the task. ## A useful comparison Use one prompt and run it four ways: 1. greedy 2. low temperature 3. medium temperature plus small `top_k` 4. higher temperature plus larger `top_k` Then explicitly note that top-p is another standard family you will encounter in other runtimes. ## Key takeaway Inference is where a probability model becomes an interactive artifact. The main thing to leave this note with is that generation quality depends both on what the model learned and on how we choose to decode from it. > [!example] Notebook walkthroughs in this lecture > > Use this order: > > 1. [notebooks/inference_and_sampling/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/inference_and_sampling/lecture_walkthrough.ipynb) > Use these sections as you read: > - [`Prefill then decode`](https://github.com/Montekkundan/llm/blob/main/notebooks/inference_and_sampling/lecture_walkthrough.ipynb#prefill-then-decode) > - [`Greedy decoding`](https://github.com/Montekkundan/llm/blob/main/notebooks/inference_and_sampling/lecture_walkthrough.ipynb#greedy-decoding) > - [`Temperature and top-k on the same logits`](https://github.com/Montekkundan/llm/blob/main/notebooks/inference_and_sampling/lecture_walkthrough.ipynb#temperature-and-top-k-on-the-same-logits) > - [`Top-p as a common extension`](https://github.com/Montekkundan/llm/blob/main/notebooks/inference_and_sampling/lecture_walkthrough.ipynb#top-p-as-a-common-extension) > 2. [picollm/accelerated/engine.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/engine.py) > Use it to connect the notebook ideas to the real runtime path. > [!tip] TensorTonic practice for this lecture > > If you want to practice this lecture in a more implementation-focused format, work through these TensorTonic exercises: > > - [TensorTonic: GPT-2 Greedy Decode](https://www.tensortonic.com/research/gpt2/gpt2-greedy-decode) > - [TensorTonic: GPT-2 Top-k Sampling](https://www.tensortonic.com/research/gpt2/gpt2-topk-sampling) > > They are good follow-ups because they make the decoding-policy choice visible on the same model outputs: > > - taking the argmax path token by token > - restricting the candidate set before sampling > - seeing how a decoding rule changes behavior without changing the model weights > - comparing deterministic and stochastic generation directly <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Training Configuration and Hyperparameters" href="Training%20Configuration%20and%20Hyperparameters">Training Configuration and Hyperparameters</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Evaluation and Model Quality" href="Evaluation%20and%20Model%20Quality">Evaluation and Model Quality</a></div> </div> </div> ## Further reading - Angela Fan, Mike Lewis, and Yann Dauphin, "Hierarchical Neural Story Generation," 2018. https://arxiv.org/abs/1805.04833 - Hugging Face, "Generation strategies," 2025. https://huggingface.co/docs/transformers/en/generation_strategies --- ## References [^1]: Ari Holtzman et al., "The Curious Case of Neural Text Degeneration," 2020. https://arxiv.org/abs/1904.09751 [^2]: Hugging Face documentation on KV cache and `past_key_values`. https://huggingface.co/docs/transformers/main/cache_explanation