> [!info] Course code > Use the companion repository for runnable demos and scripts for this lecture: > - [picollm/accelerated/chat/web.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/chat/web.py) > - [picollm/accelerated/ui.html](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/ui.html) > - [scripts/deployment/smoke_test_accelerated.py](https://github.com/Montekkundan/llm/blob/main/scripts/deployment/smoke_test_accelerated.py) ## What This Concept Is At some point a model has to stop being just a notebook experiment and become something other programs can call. This note explains that step by putting the model behind an HTTP service. If the earlier notes teach how the model thinks, this note teaches how other software talks to it. ## Foundation Terms You Need First A **[[Glossary#Backend|backend]]** is the server-side code that loads the model and handles requests. A **route** is one HTTP endpoint exposed by that backend. A **request-response cycle** is the simple loop of receiving input, running generation, and sending output back. A **service contract** is the input and output shape the client relies on. That means the main shift in this note is from model internals to interface boundaries. You are not changing what the model is; you are changing how it is exposed. ```mermaid flowchart TD A["Browser UI or API client"] --> B["FastAPI routes in chat/web.py"] B --> C["Worker pool and loaded checkpoint"] C --> D["picoLLM Engine.generate_batch(...)"] D --> E["JSON response or stream chunks"] ``` ## Why an API layer is necessary A trained model is not yet a product. Users and other services need a stable way to interact with it. That means exposing endpoints with defined request and response shapes. In the course repo, FastAPI is the boundary where: - prompts become HTTP requests - generation becomes a service - model metadata becomes inspectable - the frontend becomes only one client among many ## What the companion demo app exposes The current app exposes: - `GET /` for the browser UI - `GET /health` for service health and default generation settings - `GET /stats` for runtime stats - `GET /v1/models` for model discovery - `POST /chat/completions` for browser streaming - `POST /v1/chat/completions` for [[Glossary#OpenAI-compatible API|OpenAI-compatible]] chat clients This is better than a toy single-endpoint demo because you see that production apps are defined by contracts, not just by a `generate()` call. ## Why FastAPI fits this note FastAPI is explicit enough for you to understand request validation and response shaping, but small enough that the ML story is not buried under framework ceremony. [^1] [^2] It also teaches a valuable engineering principle: > validation belongs at the system boundary. ## What the API validates The app uses Pydantic models to validate inputs such as: - message lists - token limits - [[Glossary#Temperature|temperature]] - [[Glossary#Top-k|top-k]] settings - [[Glossary#Top-p|top-p]] and min-p settings - optional seed selection That matters because many demo failures come from sloppy boundaries, not from bad ML. ## Streaming versus non-streaming responses You should see that a chat API may expose: - one-shot responses - token-streaming responses The companion demo supports both. The same model runtime can therefore serve: - a simple local browser chat - OpenAI-compatible streaming clients This helps you separate model behavior from transport behavior. ## Relationship to the UI The web UI is a client, not the system itself. The API is the durable contract. Once you understand that split, you understand why: - the CLI can exist beside the browser - a different frontend could be swapped in later - the same service can be tested independently of the UI The current built-in UI also exposes runtime details that are worth understanding directly: - persisted generation settings - presets such as balanced, creative, and deterministic - model source badge (`base` versus `sft`) - copy and clear actions - explicit seed handling ## CLI and web chat as shared clients It is useful to show two interfaces over the same generation runtime: - the CLI is ideal for debugging and deterministic testing - the browser is ideal for product intuition and user interaction What should stay shared between them is: - prompt formatting - [[Glossary#Checkpoint|checkpoint]] loading - generation config - stop behavior - model identity If those drift apart, you stop seeing one system and start seeing unrelated demos. For the product-facing client examples, point at: - [scripts/cli_and_web_chat/chat_cli.py](https://github.com/Montekkundan/llm/blob/main/scripts/cli_and_web_chat/chat_cli.py) - [scripts/cli_and_web_chat/chat_web.py](https://github.com/Montekkundan/llm/blob/main/scripts/cli_and_web_chat/chat_web.py) - [scripts/cli_and_web_chat/ui/index.html](https://github.com/Montekkundan/llm/blob/main/scripts/cli_and_web_chat/ui/index.html) ## What to inspect in code Walk through these locations: 1. [picollm/accelerated/chat/web.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/chat/web.py) This defines request models, endpoints, streaming behavior, worker loading, and OpenAI-compatible routing. 2. [picollm/accelerated/ui.html](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/ui.html) This shows that the browser is only a thin client over the API. 3. [scripts/deployment/smoke_test_accelerated.py](https://github.com/Montekkundan/llm/blob/main/scripts/deployment/smoke_test_accelerated.py) This shows how to validate the running backend from outside the UI. ## Common confusions ### "Is the web page the app?" No. The page is one interface over the API. ### "Why bother with `/metadata` or `/health`?" Because real services must be inspectable and debuggable from outside the model logic. ### "Why use an OpenAI-compatible route?" Because compatibility lets other clients and SDKs call the local model without inventing a custom protocol. [^3] ## A useful check Do three things in sequence: 1. open the browser UI 2. call `GET /health` 3. call `POST /v1/chat/completions` That makes it clear that one model can sit behind three different interaction surfaces. ## Key takeaway The FastAPI app is where the course stops being "some ML code" and becomes a proper software system with an interface, runtime behavior, and deployment target. <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Serving, Latency, and Observability" href="Serving%2C%20Latency%2C%20and%20Observability">Serving, Latency, and Observability</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Deployment" href="Deployment">Deployment</a></div> </div> </div> ## Further reading - Sebastián Ramírez, "StreamingResponse," 2025. https://fastapi.tiangolo.com/advanced/custom-response/#streamingresponse --- ## References [^1]: Sebastián Ramírez, "FastAPI documentation," 2025. https://fastapi.tiangolo.com/ [^2]: Samuel Colvin et al., "Pydantic documentation," 2025. https://docs.pydantic.dev/latest/ [^3]: OpenAI, "Chat Completions API reference," 2025. https://platform.openai.com/docs/api-reference/chat