Real Chatbot Workflow

> [!info] Course code > Use the companion repository for runnable notebooks, figures, and implementation references for this lecture: > - [prompts/real_chatbot_workflow/base_vs_chat_sft_prompts.md](https://github.com/Montekkundan/llm/blob/main/prompts/real_chatbot_workflow/base_vs_chat_sft_prompts.md) > - [picollm/accelerated/README.md](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/README.md) > - [picollm/accelerated/speedrun.sh](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/speedrun.sh) > - [picollm/accelerated/pretrain/train.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/pretrain/train.py) > - [picollm/accelerated/chat/sft.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/chat/sft.py) > - [picollm/accelerated/chat/eval.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/chat/eval.py) > - [picollm/accelerated/chat/cli.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/chat/cli.py) > - [picollm/accelerated/chat/web.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/chat/web.py) ## What This Concept Is If the earlier notes teach you pieces of the system one at a time, this note puts the whole practical path back together. Imagine walking from raw data all the way to a chatbot someone can actually type into. That full walk is what this note is for. This is the note where the course title finally becomes concrete. ## Foundation Terms You Need First Keep the run in four mental pieces. A **base model** is the checkpoint trained on general next-token prediction before assistant specialization. **[[Glossary#Post-training|Post-training]]** is the stage that reshapes that base model into a chatbot. An **evaluation artifact** is the set of prompts, reports, and metrics that let you compare stages honestly. A **launch surface** is the CLI or web interface used to test the final checkpoint. So this note is not only about the final demo. It is about keeping track of how each stage changes the system and how you verify that change. ```mermaid flowchart TD A["Dataset bootstrap"] --> B["Tokenizer train + tokenizer eval"] B --> C["Base pretraining"] C --> D["Base eval: BPB, CORE, samples"] D --> E["Identity asset verification"] E --> F["Chat SFT"] F --> G["Chat eval"] G --> H["Report + run_manifest.json"] H --> I["Optional HF upload"] I --> J["CLI or web chat"] ``` ## What success looks like By the end of this practical arc, the project should actually match the course title: - create your own [[Glossary#Tokenizer|tokenizer]] and [[Glossary#Base model|base model]] from scratch - post-train that model into a chatbot - evaluate the base and chat checkpoints honestly - serve the final [[Glossary#Checkpoint|checkpoint]] through a CLI or web UI - understand how that backend later connects to product-facing apps If you ask why the cloud run is expensive or why multi-GPU matters, point back to: - [[Compute, Time, and Cost of LLMs]] - [[picollm Code Map]] ## Why we need a separate workflow The tiny from-scratch model is valuable because you can understand every part of it. But it is not strong enough to behave like a modern assistant. So the course should end with a separate workflow: the accelerated `picollm` path that behaves much more like a real training system.[^1][^2] ## The four stages in `picollm/accelerated` ### 1. Tokenizer and dataset preparation This stage makes the raw data usable for the model. - `picollm/accelerated/dataset.py` downloads and prepares the shard set - `picollm/accelerated/tokenizer.py` defines the serious tokenizer and chat tokens - `picollm/accelerated/pretrain/train_tokenizer.py` trains the tokenizer - `picollm/accelerated/pretrain/tokenizer_eval.py` checks the resulting tokenizer This is the first place where the repo starts to feel more like `nanochat`: the tokenizer is its own stage, not a forgotten preprocessing footnote.[^1] ### 2. Base pretraining This stage teaches the model general language behavior before assistant specialization. - `picollm/accelerated/pretrain/train.py` is the serious base-training entrypoint - `picollm/accelerated/gpt.py` defines the actual model stack - `picollm/accelerated/checkpoint_manager.py` handles checkpoint save and load This is where you should connect earlier theory notes back to a real system: - tokenization determines the IDs - embeddings and RoPE define the representation space - the [[Glossary#Decoder block|decoder block]] is stacked in `gpt.py` - distributed launch and precision choices determine the runtime story ### 3. Chat SFT and evaluation This stage bends the base model into assistant behavior. - `picollm/accelerated/chat/sft.py` runs supervised fine-tuning - `picollm/accelerated/tasks/` defines the task mixture - `picollm/accelerated/chat/eval.py` checks the final conversational behavior - `picollm/accelerated/report.py` summarizes the run The key point is that [[Glossary#SFT|SFT]] here is not one tiny prompt-response dataset. It is a deliberate mixture of conversation, identity, reasoning, and spelling-oriented tasks.[^2][^3][^4][^5] ### 4. CLI and web interaction This is the visible end of the pipeline: - `picollm/accelerated/chat/cli.py` - `picollm/accelerated/chat/web.py` The final lesson is that a training run is not finished when a checkpoint exists. It is finished when a human can actually interact with the resulting model and judge it. ## The exact `speedrun.sh` stages For the serious capstone, teach the script as an operator runbook: | Stage | What it does | Why it exists | |---|---|---| | Preflight | runs `speedrun_doctor` and distributed synthetic preflight | catches hardware and config failures early | | Dataset Bootstrap | resets reports, starts dataset work, optionally starts periodic archive sync | makes the run observable and resumable | | Tokenizer | trains tokenizer and runs tokenizer eval | fixes the token interface before pretraining | | Base Pretrain | runs `pretrain/train.py` | learns broad language behavior | | Base Eval | runs `pretrain/eval.py` | measures BPB, CORE, and sample quality | | Identity Verification | verifies canonical or hosted identity data against the manifest | makes post-training provenance explicit | | SFT | runs `chat/sft.py` | reshapes the base model into an assistant | | Chat Eval | runs `chat/eval.py` | measures post-SFT task performance | | Report | generates markdown report sections and `run_manifest.json` | creates durable experimental evidence | | HF Upload | optionally publishes model and archive artifacts separately | separates runnable release artifacts from resume-training artifacts | | Launch | opens CLI or web chat | turns the run into an interactive product | ## Why the identity file is verified before SFT The identity file is part of the assistant behavior contract. That is why `speedrun.sh` now either: - uses the repo-local canonical file, or - downloads the hosted mirror and verifies it against the manifest checksum before SFT This is a real systems lesson: post-training behavior depends on data provenance, not only on model code. ## What the end of the run produces By the time the speedrun finishes, you should expect: - `base_checkpoints/` - `chatsft_checkpoints/` - `report/` - `run_manifest.json` - optional Hugging Face model repo upload - optional Hugging Face archive dataset upload ## Best end-of-course demo The best sequence here is: 1. trace one example through the tokenizer and chat-format rendering 2. show the base-pretraining entrypoint and explain the serious hardware/runtime story 3. compare a base checkpoint against the chat-SFT checkpoint on the same prompts 4. launch the final checkpoint in the CLI or web UI 5. connect that final runtime back to the course notes you already studied ## Suggested walkthrough order Use this order if you want to trace the full chatbot build without getting lost: 1. inspect the accelerated folder map 2. inspect tokenizer rendering and [[Glossary#Special tokens|special tokens]] 3. inspect the base-training command and model architecture 4. inspect the SFT task mixture 5. compare base and chat checkpoints on fixed prompts 6. launch the final chatbot in CLI or web mode 7. connect the same backend idea to the frontend-app notes If you want the shortest cloud path, use the one-command accelerated speedrun. If you want to see the stages clearly, inspect the tokenizer, base-train, and SFT entrypoints separately first, then see how `speedrun.sh` chains them together. ## Prompt set for base vs chat SFT Use the same prompts on both models so you can see the behavioral shift directly: - `Explain tokenization for a beginner.` - `Why is the sky blue?` - `Use one analogy to explain self-attention.` - `Give me a two-step study plan for learning transformers.` Use: 1. the base checkpoint 2. the chat-SFT checkpoint Then compare: - style - whether the answer sounds like a general assistant or a lecture assistant - whether the SFT model follows the user role more clearly - whether repetitive base-model continuation turns into a more assistant-like reply The point is not that SFT invents knowledge from nowhere. The point is that it reshapes how the model uses the knowledge the base run already started to learn. ## Hardware story Leave this lecture with a realistic hardware map: - Apple Silicon Macs are good for inference, inspection, and small local demos - NVIDIA GPUs are better for heavier local tuning - full pretraining is best done in the cloud This matters because deployment is not only about code. It is also about matching the workflow to the hardware budget. ## What this lecture is not This lecture is not trying to replace nanochat as a production research codebase.[^1] Instead, it gives you a clean final ladder: - understand the model - run the model seriously - specialize the model into a chatbot - serve the model - connect the model to a production-style web app - then study nanochat for deeper optimization If you want a broader industry-scale planning guide after this lecture, point them to the Hugging Face Smol Training Playbook:[^2] - [The Smol Training Guide: The Secrets to Building World-Class LLMs](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook#attention) ## What to inspect in code Use this order: 1. [prompts/real_chatbot_workflow/base_vs_chat_sft_prompts.md](https://github.com/Montekkundan/llm/blob/main/prompts/real_chatbot_workflow/base_vs_chat_sft_prompts.md) 2. [picollm/accelerated/README.md](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/README.md) 3. [picollm/accelerated/tokenizer.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/tokenizer.py) 4. [picollm/accelerated/pretrain/train.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/pretrain/train.py) 5. [picollm/accelerated/chat/sft.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/chat/sft.py) 6. [picollm/accelerated/chat/eval.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/chat/eval.py) 7. [picollm/accelerated/chat/cli.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/chat/cli.py) 8. [apps/vercel_ai_sdk_chat/README.md](https://github.com/Montekkundan/llm/blob/main/apps/vercel_ai_sdk_chat/README.md) ## Key takeaway The real chatbot workflow is where theory becomes an actual system. It should feel like the natural end of the course, not a disconnected engineering afterthought. The final project is not only "I trained a checkpoint." It is: 1. I trained my own tokenizer and base model 2. I turned that base model into a chatbot with chat post-training 3. I evaluated the result honestly 4. I served it through a real interaction surface <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="OpenTUI Terminal Chat App" href="OpenTUI%20Terminal%20Chat%20App">OpenTUI Terminal Chat App</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="picollm Code Map" href="picollm%20Code%20Map">picollm Code Map</a></div> </div> </div> ## References [^1]: Andrej Karpathy, [nanochat](https://github.com/karpathy/nanochat) [^2]: Hugging Face TB, [The Smol Training Guide](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook) [^3]: Hugging Face, [SmolTalk dataset card](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) [^4]: Dan Hendrycks et al., [Measuring Massive Multitask Language Understanding](https://arxiv.org/abs/2009.03300) [^5]: Karl Cobbe et al., [Training Verifiers to Solve Math Word Problems](https://arxiv.org/abs/2110.14168)