Serving, Latency, and Observability

> [!info] Course code > Use these repo paths together with this note: > - [picollm/accelerated/chat/web.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/chat/web.py) > - [picollm/accelerated/report.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/report.py) > - [picollm/accelerated/engine.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/engine.py) > - [apps/vercel_ai_sdk_chat/README.md](https://github.com/Montekkundan/llm/blob/main/apps/vercel_ai_sdk_chat/README.md) ## What This Concept Is Training gives you a checkpoint. Serving asks a much more operational question: if a real user sends a real request right now, can this system answer quickly enough, cheaply enough, and clearly enough that you would trust it in production? This note is about that backend reality. It is where the system becomes something you have to run, not just something you can admire. ## Foundation Terms You Need First Start from one user request and then widen the view. **[[Glossary#Latency|Latency]]** is how long one request or generation step takes. **[[Glossary#Throughput|Throughput]]** is how much useful serving work finishes per unit time. **[[Glossary#Concurrency|Concurrency]]** is how many requests the system handles at once. **[[Glossary#Observability|Observability]]** is the logging and measurement layer that tells you what the system is doing. Those terms are easy to mix together, but they answer different operator questions: how fast is one request, how much total load can the system carry, and how do we know what is happening? ## The picoLLM-first framing For this course, the serving stack should be explained in this order: - `picollm/accelerated/engine.py` for generation runtime - `picollm/accelerated/chat/web.py` for the API boundary - `picollm/accelerated/report.py` and health/stats endpoints for [[Glossary#Observability|observability]] hooks - product clients such as the Vercel app or OpenTUI only after the backend is clear That keeps the backend as the main object of study instead of letting the client apps dominate the discussion. ## Latency and throughput are different goals Latency asks: how long does one request take? Throughput asks: how many requests or tokens can the system serve per unit time? These goals are related but not identical. A configuration optimized for single-user responsiveness may not maximize total system throughput, and a configuration that serves many requests efficiently may still feel slow to one user if batching and queueing are heavy.[^2] You should learn this distinction early because production trade-offs become much easier to reason about once these two terms are separated clearly. ## Why inference is still expensive Inference is cheaper than training per token, but it is still expensive because serving is autoregressive, memory-hungry, and continuous. The model generates one token at a time. The [[Glossary#KV cache|KV cache]] grows with context length. Multiple users increase concurrency pressure. And unlike training, the service may need to stay online all day.[^3] That is why product teams often care deeply about smaller models, [[Glossary#Quantization|quantization]], batching strategy, and observability. A slightly better model that doubles latency and serving cost may not be the right product choice. ## Observability: what to watch in production Observability in serving is the product-side analogue of experiment tracking in training. The service should make it possible to answer questions like: - Are requests succeeding? - What is the end-to-end latency? - Is token generation slowing down? - Are users hitting long contexts? - Is GPU memory pressure increasing? - Are certain prompts causing failure patterns? Without observability, operators are flying blind. They may know that users are unhappy, but not why. In the current picoLLM path, you should explicitly watch: - `/health` for service availability and default generation config - `/stats` for lightweight runtime information - report sections and `run_manifest.json` from the training side - smoke tests from `scripts/deployment/smoke_test_accelerated.py` That is enough to teach the idea without pretending this repo is already a full Prometheus/OpenTelemetry production stack. ## Relationship to rasbt and nanochat The reference split should stay consistent here too: - `rasbt/LLMs-from-scratch` is the clean concept-first reference for understanding why generation works at all - `picollm` is the course’s real serving path - `nanochat` remains a useful external systems comparison for you who want to inspect a more optimized end-to-end stack ## Why this matters for the course A course that teaches only how to train is incomplete. A course that teaches training plus serving begins to look like real systems work. Adding observability to that serving story makes it much closer to what production ML teams actually do. <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Quantization" href="Quantization">Quantization</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="FastAPI Chat App" href="FastAPI%20Chat%20App">FastAPI Chat App</a></div> </div> </div> ## Further reading - vLLM team, "Efficient Memory Management for Large Language Model Serving with PagedAttention," 2023. https://arxiv.org/abs/2309.06180 - Tri Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness," 2022. https://arxiv.org/abs/2205.14135 - OpenTelemetry, "Documentation," 2025. https://opentelemetry.io/docs/ - Prometheus, "Documentation overview," 2025. https://prometheus.io/docs/introduction/overview/ ## References [^1]: Montekkundan, [llm repository](https://github.com/Montekkundan/llm) [^2]: Hugging Face, [Text generation inference concepts](https://huggingface.co/docs/text-generation-inference/index) [^3]: NVIDIA, [Mastering LLM Techniques: Inference Optimization](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/)