Inference Runtime and KV Cache

> [!info] Course code > Use the companion repository for runnable notebooks, figures, and implementation references for this lecture: > - [notebooks/inference_runtime_and_kv_cache/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/inference_runtime_and_kv_cache/lecture_walkthrough.ipynb) > - [course_tools/runtime.py](https://github.com/Montekkundan/llm/blob/main/course_tools/runtime.py) > - [picollm/accelerated/engine.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/engine.py) ## What This Concept Is When a chat model feels responsive, a lot of hidden runtime work is going right. The model has to process the prompt, remember the right internal state, and generate new tokens without recomputing everything from scratch every time. This note is about that runtime story. It sits exactly between the clean theory of generation and the harder systems story of serious serving. ## Foundation Terms You Need First Two phases matter immediately. **[[Glossary#Prefill|Prefill]]** is the prompt-processing pass that builds the initial state. **[[Glossary#Decode|Decode]]** is the repeated one-token-at-a-time generation phase after that. The **[[Glossary#KV cache|KV cache]]** stores reusable attention state between decode steps. The **runtime loop** is the cycle of scoring, choosing, and appending the next token. So as you read, keep one contrast in mind: prefill is about absorbing the prompt, while decode is about extending it efficiently. ```mermaid flowchart TD A["Prompt tokens"] --> B["picoLLM Engine receives prompt IDs"] B --> C["Prefill forward pass through Transformer layers"] C --> D["Write K/V for prompt positions into KV cache"] D --> E["Decode loop begins"] E --> F["Newest token plus cached history goes back through the model"] F --> G["Append one new K/V position per layer"] G --> H["Return logits for the next token"] H --> I["Sample next token"] I --> J{"More tokens to generate?"} J -->|yes| F J -->|no| K["Emit completed response"] ``` ## The core idea Without caching, every new generated token would require recomputing attention over the entire prompt history again and again. With a [[Glossary#KV cache|KV cache]]: - the prompt is processed once in the [[Glossary#Prefill|prefill]] phase - keys and values are stored - each decode step appends only one new position per layer That is why cached decoding is the standard runtime path in modern chat systems. ## What the notebook demonstrates The companion notebook is intentionally small. It shows: > [!example] Notebook follow-up > - [notebooks/inference_runtime_and_kv_cache/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/inference_runtime_and_kv_cache/lecture_walkthrough.ipynb) > Use this notebook here to inspect prefill, decode, and cache growth immediately after the runtime summary above. - prefill builds the initial cache - decode adds one token at a time - cached decoding is measurably faster than naive recomputation That is the right conceptual surface. You do not need a full vLLM implementation to understand the systems idea. ## Why this matters It is easy to think of inference as "just calling generate." That hides the real runtime story. The right mental model is: - prefill is prompt ingestion - decode is iterative generation - cache growth is a memory cost - responsiveness depends on how efficiently those phases are handled ## How picoLLM exposes this runtime in practice The current serious runtime surface is `picollm/accelerated/engine.py`, which is used by: - `picollm/accelerated/pretrain/eval.py` for base-model sampling - `picollm/accelerated/chat/cli.py` for terminal chat - `picollm/accelerated/chat/web.py` for browser chat and API-compatible serving That is why runtime concepts such as prefill, decode, and sampling are now visible in the actual published surfaces. ## Sampling controls now visible at the runtime boundary The current CLI and web surfaces expose: - `temperature` - `top_k` - `top_p` - `min_p` - `max_tokens` - `seed` That matters because the sampling theory note can now be connected directly to concrete runtime controls. ## Relationship to the rest of the course Teach this after: - [[Decoder Block]] - [[Inference and Sampling]] Teach it before: - [[Serving, Latency, and Observability]] - [[Advanced Inference Systems]] That sequence preserves the abstraction ladder from mechanism to runtime to systems optimization. ## Key takeaway The KV cache is not an implementation detail. It is one of the central ideas that turns autoregressive generation from a toy loop into a usable product runtime. > [!example] Notebook walkthroughs in this lecture > Use these companion notebook links as you read or review this lecture: > > - [notebooks/inference_runtime_and_kv_cache/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/inference_runtime_and_kv_cache/lecture_walkthrough.ipynb) <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Reproducibility and Research Method" href="Reproducibility%20and%20Research%20Method">Reproducibility and Research Method</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Quantization" href="Quantization">Quantization</a></div> </div> </div> ## Further reading - Hugging Face, "Caching," 2025. https://huggingface.co/docs/transformers/main/cache_explanation - vLLM team, "Efficient Memory Management for Large Language Model Serving with PagedAttention," 2023. https://arxiv.org/abs/2309.06180 - Tri Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness," 2022. https://arxiv.org/abs/2205.14135 - vLLM, "Documentation," 2025. https://docs.vllm.ai/