> [!info] Course code > Use these repo paths together with this note: > - [picollm/accelerated/engine.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/engine.py) > - [picollm/accelerated/flash_attention.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/flash_attention.py) > - [picollm/accelerated/chat/web.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/chat/web.py) ## What This Concept Is Basic inference tells you how a model generates one response. Advanced inference asks what happens when many users show up, prompts get long, memory gets tight, and the system still has to stay fast. That is the jump this note cares about. So this note is not about changing the model's objective. It is about operating the frozen model as a serious runtime system. ## Foundation Terms You Need First Start from one overloaded server in your head. It receives prompts, builds state, keeps caches alive, schedules work, and tries not to waste memory. Two phases matter immediately: **[[Glossary#Prefill|prefill]]**, where the prompt is processed and the initial runtime state is built, and **[[Glossary#Decode|decode]]**, where one new token is generated at a time. The **[[Glossary#KV cache|KV cache]]** stores reusable attention state between decode steps. A **serving system** is the larger runtime layer that manages scheduling, memory, batching, and API behavior around the checkpoint. If you keep those layers separate, the systems trade-offs in this note become much easier to follow. ## Prefill versus decode Inference has two different phases. [[Glossary#Prefill|Prefill]] processes the prompt and builds the initial internal state. Decode generates new tokens autoregressively, one step at a time. These two phases have different performance profiles, which is why serious serving systems analyze them separately.[^1] You should learn to ask separately: - how expensive is prompt ingestion? - how expensive is each generated token? That distinction explains why long prompts and long completions stress the system in different ways. ## KV-cache scaling The [[Glossary#KV cache|KV cache]] is essential because it avoids recomputing the entire prompt history for every generated token. But the cache consumes memory, and that memory grows with [[Glossary#Context window|context window]] length and [[Glossary#Concurrency|concurrency]]. As a result, serving often becomes constrained not only by compute but by memory capacity and bandwidth.[^2] This is the practical reason that multi-user serving can become memory-bound even when the GPU still looks underutilized from a naive compute perspective. ## Batching schedulers Batching improves [[Glossary#Throughput|throughput]] by serving multiple requests together, but batching can also increase [[Glossary#Latency|latency]] for individual users if queueing becomes significant. Good inference systems therefore balance batching aggressiveness against responsiveness. At research level, this becomes a scheduling problem: - static batching is simple but wasteful - continuous batching raises utilization - queue discipline changes tail latency The serving objective is not "maximize throughput" in isolation. It is optimize under an SLA or product goal. ## Speculative decoding [[Glossary#Speculative decoding|Speculative decoding]] is one of the interesting ideas in modern inference optimization. A smaller or auxiliary model proposes tokens, and the larger target model verifies them. If the proposals are good enough, generation can accelerate significantly. This is a nice example of how systems ingenuity can improve user-visible speed without changing the [[Glossary#Base model|base model]] weights.[^3] The key advanced insight is that serving can improve by changing the algorithm around the model, not only the model itself. ## Quantization trade-offs for serving [[Glossary#Quantization|Quantization]] can reduce memory footprint and sometimes improve serving efficiency, but it is still a trade-off. Lower precision may affect quality, kernel behavior, or hardware compatibility. Good serving engineering therefore treats quantization as a measured decision, not a universal win. You should therefore evaluate quantized serving with: - quality regression checks - latency checks - memory headroom checks ## Throughput versus latency Throughput and latency are related but not identical. The best serving configuration depends on whether the goal is single-user responsiveness, many-user throughput, or a compromise between them. You should be taught that serving is optimization under product constraints, not one single number. Tail latency matters especially because users do not experience averages. They experience slow worst cases. ## Admission control and observability A real serving system should expose enough [[Glossary#Observability|observability]] to answer: - are requests waiting in queue? - is prefill dominating? - is decode dominating? - did concurrency push the box into memory pressure? This is why latency benchmarking and observability belong together. ## Scope of this course and where to go deeper This course introduces the main serving concepts and grounds them in the accelerated runtime files: `picollm/accelerated/engine.py` for token generation flow, `picollm/accelerated/flash_attention.py` for the fast attention path, and `picollm/accelerated/chat/web.py` for the product-facing server. That is enough to show how to reason about prefill, decode, KV-cache cost, and throughput-versus-latency trade-offs in a concrete way. A deeper follow-on module could focus entirely on scheduler design, paged attention, speculative decoding systems, cache eviction strategies, continuous batching, and production inference backends such as TGI or vLLM. If you want to go deeper after this note, the best next steps are: - compare latency and throughput across different batch and decoding settings - study speculative decoding and paged-attention papers - [[Glossary#Benchmark|benchmark]] prompt length versus response length separately - treat serving as a systems optimization problem rather than only a frontend integration problem <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Advanced Distributed Training Systems" href="Advanced%20Distributed%20Training%20Systems">Advanced Distributed Training Systems</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Post-Training Beyond SFT" href="Post-Training%20Beyond%20SFT">Post-Training Beyond SFT</a></div> </div> </div> ## References [^1]: NVIDIA, [Mastering LLM Techniques: Inference Optimization](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/) [^2]: Hugging Face, [Text Generation Inference](https://huggingface.co/docs/text-generation-inference/index) [^3]: Yaniv Leviathan, Matan Kalman, and Yossi Matias, Google, [Fast Inference from Transformers via Speculative Decoding](https://arxiv.org/abs/2211.17192) [^4]: Hugging Face TB, [The Smol Training Guide](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook)