Deployment - lectures

> [!info] Course code > Use the companion repository for runnable demos and scripts for this lecture: > - [picollm/accelerated/README.md](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/README.md) > - [scripts/deployment/smoke_test_accelerated.py](https://github.com/Montekkundan/llm/blob/main/scripts/deployment/smoke_test_accelerated.py) > - [scripts/upload_picollm_model_to_hf.py](https://github.com/Montekkundan/llm/blob/main/scripts/upload_picollm_model_to_hf.py) > - [scripts/upload_picollm_archive_to_hf.py](https://github.com/Montekkundan/llm/blob/main/scripts/upload_picollm_archive_to_hf.py) > - [scripts/restore_picollm_from_hf.py](https://github.com/Montekkundan/llm/blob/main/scripts/restore_picollm_from_hf.py) > - [scripts/export_picollm_to_transformers.py](https://github.com/Montekkundan/llm/blob/main/scripts/export_picollm_to_transformers.py) > - [scripts/export_picollm_to_gguf.py](https://github.com/Montekkundan/llm/blob/main/scripts/export_picollm_to_gguf.py) ## What This Concept Is Suppose the model works on your machine today. That is not the same thing as having a system you can ship, restore, test, or run somewhere else tomorrow. Deployment is the step where you package the model, move it, restore it, test it, and prove that it can run outside the original training context. This is the note where the project has to become reproducible instead of personal. ## Foundation Terms You Need First Keep one very practical picture in your head. You trained a model, saved some files, copied them somewhere else, and now you need to know whether the whole thing still works. An **artifact** is the saved bundle representing the model and its metadata. **Restore** means rebuilding a runnable environment from those saved files. A **smoke test** is a quick end-to-end check that the restored system actually works. A **client surface** is the CLI, web app, or API that talks to the deployed model. So as you read, keep the bar in mind: deployment is not only "copy files somewhere." It is "make the system runnable and checkable on another machine or in another environment." ```mermaid flowchart TD A["Local artifacts in PICOLLM_BASE_DIR"] --> B["Inference bundle or archive bundle"] B --> C["Optional HF upload"] C --> D["Restore on another machine"] D --> E["smoke_test_accelerated.py"] E --> F["CLI, web UI, or compatible API client"] ``` ## What deployment actually means For this course, deployment means: - packaging the application and dependencies - defining the service entrypoint - making [[Glossary#Checkpoint|checkpoint]] and [[Glossary#Tokenizer|tokenizer]] artifacts available - exposing the API on a reachable host - verifying the running service behaves correctly This is not "DevOps after the real work." It is part of the real work. ## The deployment target in this repo The default deployment path is: - Docker container - single cloud VM That target is good because it exposes real operational concerns without burying you in orchestration complexity. ## Why Docker is the default packaging layer Docker gives you one concrete artifact that bundles: [^1] [^2] - Python environment - project code - startup command - runtime dependencies It is the cleanest way to teach the difference between: - "works in my shell" - "works from a clean image" ## What you must remember to ship with the model Deployment is not only code shipping. It is artifact shipping. You must ensure: - checkpoint and tokenizer belong together - runtime config matches the checkpoint - identity and manifest assumptions are explicit when post-training behavior depends on them - ports and bindings are correct - the API can start from a clean environment This is where earlier system discipline pays off. ## Common deployment failure classes You should explicitly learn these failure modes: - missing checkpoint files - mismatched tokenizer and checkpoint - wrong host or port binding - missing environment or startup assumptions - health endpoint says "up" while the model is not really ready - memory exhaustion from a checkpoint that is too large for the target machine The point is not to scare them. The point is to teach them that deployment reveals hidden assumptions. ## What to inspect in code and docs Use this order: 1. [picollm/accelerated/README.md](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/README.md) This is the current operator-facing guide for speedrun, smoke tests, restore, upload, and export. 2. [scripts/deployment/smoke_test_accelerated.py](https://github.com/Montekkundan/llm/blob/main/scripts/deployment/smoke_test_accelerated.py) This is the current smoke test for a running accelerated backend. 3. [scripts/upload_picollm_model_to_hf.py](https://github.com/Montekkundan/llm/blob/main/scripts/upload_picollm_model_to_hf.py) This publishes the inference-focused artifact bundle. 4. [scripts/upload_picollm_archive_to_hf.py](https://github.com/Montekkundan/llm/blob/main/scripts/upload_picollm_archive_to_hf.py) This publishes the fuller archive bundle. 5. [scripts/restore_picollm_from_hf.py](https://github.com/Montekkundan/llm/blob/main/scripts/restore_picollm_from_hf.py) This proves that a published model repo is actually restorable. ## The artifact split you must understand Current picoLLM deployment deliberately separates: - **model repo**: lighter, inference-oriented, meant for restore and chat - **archive dataset repo**: fuller checkpoint history, including optimizer shards when present, meant for preservation and resume That teaches an important systems lesson: a release artifact and an exact training-state backup are not the same thing. ## Deployment now includes export bridges The current release story also includes: - export to a Transformers `trust_remote_code` bundle - export to GGUF You should also learn the limitation clearly: the GGUF export is real, but stock `llama.cpp` does not yet include a picoLLM runtime, so the native picoLLM checkpoint format remains the fully supported runtime path in this repo. ## A good deployment sequence 1. run the app locally 2. build the Docker image 3. run the image locally 4. verify `/health`, `/v1/models`, and `/v1/chat/completions` 5. move the same container to a VM That sequence teaches continuity instead of making deployment feel like a disconnected final trick. ## Key takeaway Deployment is how an LLM project proves that its theory, systems, and product layers are coherent enough to survive outside the notebook where it was born. [^3] <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="FastAPI Chat App" href="FastAPI%20Chat%20App">FastAPI Chat App</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="OpenTUI Terminal Chat App" href="OpenTUI%20Terminal%20Chat%20App">OpenTUI Terminal Chat App</a></div> </div> </div> ## Further reading - Sebastián Ramírez, "FastAPI in containers - Docker," 2025. https://fastapi.tiangolo.com/deployment/docker/ --- ## References [^1]: Docker, "Docker documentation," 2025. https://docs.docker.com/ [^2]: Open Container Initiative, "OCI Image Spec," 2024. https://specs.opencontainers.org/image-spec/ [^3]: Adam Wiggins, "The Twelve-Factor App," 2017. https://12factor.net/