Chat Format and SFT

> [!info] Course code > Use the companion repository for runnable notebooks, figures, and implementation references for this lecture: > - [notebooks/chat_format_and_sft/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/chat_format_and_sft/lecture_walkthrough.ipynb) > - [picollm/accelerated/tokenizer.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/tokenizer.py) > - [picollm/accelerated/chat/sft.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/chat/sft.py) > - [picollm/accelerated/chat/eval.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/chat/eval.py) ## What This Concept Is Suppose you already have a base model that can continue text. How do you make it behave more like an assistant that answers users in the right role, with the right boundaries, and in the right format? This note explains that shift. The short answer is: you do not replace the whole model. You change the conversation format, the supervision target, and the data distribution the model trains on. ## Foundation Terms You Need First Start with a role-structured conversation: user, assistant, and sometimes system messages. A **[[Glossary#Chat template|chat template]]** or rendering function flattens those messages into one token stream using **[[Glossary#Special tokens|special tokens]]** that mark the boundaries. A **training mask** decides which parts of that stream count toward the loss. **[[Glossary#SFT|SFT]]** then teaches the model to predict the assistant side of those examples. So this note is less about inventing a new architecture and more about controlling what the existing architecture is asked to imitate. ```mermaid flowchart TD A["Structured conversation"] --> B["tokenizer.render_conversation(...)"] B --> C["Flattened token sequence with chat boundary tokens"] C --> D["Assistant-only supervision mask"] D --> E["SFT loss"] E --> F["Assistant-style checkpoint"] ``` ## A first toy conversation Start with something small enough to hold in your head: ```text System: You are a helpful assistant. User: What is attention? Assistant: Attention lets one token weigh other tokens. ``` A base model does not naturally think in roles. It only sees a sequence and learns to continue it. Chat formatting makes those roles explicit, and SFT teaches the model that the assistant part is the continuation style we care about. That is the cleanest first-pass summary of the whole note: - the architecture stays the same - the sequence format changes - the supervised targets change - the behavior changes with them ## How the accelerated picoLLM path formats and masks conversations Once that toy example is clear, the serious `picollm` path becomes much easier to read. Chat formatting is not hidden inside a vague helper. It is a named, inspectable part of the system. - `picollm/accelerated/tokenizer.py` defines the [[Glossary#Special tokens|special tokens]] such as `<|user_start|>` and `<|assistant_start|>`. - The same file implements `render_conversation(...)`, which flattens a role-structured conversation into one causal token stream. - That function also creates a training mask so the assistant tokens are learned as targets while user-side control tokens are not treated as assistant answers. This is an extremely important point. > the model is still doing next-token prediction, but we decide which next tokens count as "the assistant behavior we want" Then `picollm/accelerated/chat/sft.py` builds a real task mixture on top of that formatting. It does not train only on one conversation dataset. It mixes: - SmolTalk - identity conversations - MMLU - GSM8K - spelling-style tasks That means you can see [[Glossary#SFT|SFT]] not only as "teach the model to chat," but as "reshape the model into a more useful assistant with a deliberately mixed curriculum." ## Why the current picoLLM SFT mixture exists Each current SFT component covers a different weakness: - **SmolTalk** teaches natural assistant turn-taking and response style. - **identity conversations** stabilize project-specific identity behavior around picoLLM and Montek Singh Kundan. - **MMLU** keeps broad academic and factual question answering alive during post-training. - **GSM8K** keeps arithmetic and stepwise reasoning behavior from collapsing. - **spelling-style tasks** force exact-format and counting behavior that fluent chat models often still get wrong. This is an important point: SFT is curriculum design, not just "one chat dataset." ## The identity dataset is now part of the release contract The serious path now relies on the canonical identity file: - `picollm/accelerated/data/identity_conversations.jsonl` And `speedrun.sh` can optionally fetch the hosted mirror and verify it against: - `picollm/accelerated/data/identity_conversations.manifest.json` That is worth stating explicitly. If you change the identity file, you have changed the assistant behavior. A base language model predicts the next token. A chat model behaves like an assistant. The difference is not mystical. It is created by data formatting, supervised fine-tuning, and serving conventions. ## Why base models are not chat products A [[Glossary#Base model|base model]] trained on generic text learns statistical continuation. If you prompt it with a question, it may answer, continue a dialogue, imitate a document, or drift stylistically. That ambiguity is expected because the pretraining objective never required the model to adopt the role of a helpful assistant. ## Chat formatting is a data intervention Chat formatting changes the structure of the training data. Instead of generic text, the model sees role-structured turns such as: - system message - user message - assistant reply In the companion code, this formatting step is made explicit in the chat-format demo helper, where `format_messages(...)` linearizes a message list into one causal text sequence. In the serious `picollm` path, the same idea shows up in two concrete places: - `picollm/accelerated/tokenizer.py` turns message lists into trainable token streams with explicit chat boundary tokens - `picollm/accelerated/chat/sft.py` fine-tunes the base [[Glossary#Checkpoint|checkpoint]] on those rendered conversations That is the key point: > Chat is not only a frontend format. It is a training-distribution format. > [!important] > If training format and inference format disagree, the model often looks much worse than it really is. > That is why the `picollm` chat prompt format matters: the assistant should see the same control-token structure during serving that it learned during SFT. ## How role-structured data becomes a token stream The model is still autoregressive. So a conversation must be flattened into text before [[LLM/concepts/Tokenization]]. > [!example] Notebook follow-up > - [`A conversation becomes one causal token stream`](https://github.com/Montekkundan/llm/blob/main/notebooks/chat_format_and_sft/lecture_walkthrough.ipynb#a-conversation-becomes-one-causal-token-stream) > Use this notebook section here to see how chat roles become one autoregressive training sequence. A simplified sequence looks like: ```text System: You are a helpful assistant. User: What is self-attention? Assistant: ... ``` After that flattening, the same next-token objective applies. The model simply sees a sequence where certain role prefixes correlate with certain kinds of continuation. This is the same broad logic you see in established projects: - [`rasbt/LLMs-from-scratch`](https://github.com/rasbt/LLMs-from-scratch) is excellent for understanding why the model still stays autoregressive after chat formatting - [`nanochat`](https://github.com/karpathy/nanochat) is a good reference for how that formatting becomes part of a full cloud training and serving workflow ## What SFT does Supervised fine-tuning takes a pretrained base model and continues training it on curated prompt-response examples or conversations. [^1] > [!example] Notebook follow-up > - [`The architecture stayed the same and the training distribution changed`](https://github.com/Montekkundan/llm/blob/main/notebooks/chat_format_and_sft/lecture_walkthrough.ipynb#the-architecture-stayed-the-same-and-the-training-distribution-changed) > Use this notebook section here to connect the base-model architecture to the new training distribution. > [!tip] TensorTonic follow-up > - [TensorTonic: BERT Fine-Tuning](https://www.tensortonic.com/research/bert/bert-fine-tuning) > Use it here as the closest adaptation-style contrast to the SFT story in this section. The high-level idea is: - pretraining gives broad language competence - SFT bends that competence toward a useful interaction style SFT does not replace pretraining. It specializes it. In course terms, SFT sits directly on top of: - [[Causal Language Modeling]] - [[Training Loop]] - [[Training Configuration and Hyperparameters]] ## What the companion code does In the companion code: - the live lecture walkthrough lives in [notebooks/chat_format_and_sft/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/chat_format_and_sft/lecture_walkthrough.ipynb) That is enough for you to see the whole adaptation loop: - start from a base checkpoint - load conversation-style data - continue training - compare behavior before and after > [!tip] > When you read the code, compare the formatting helper, the SFT trainer, and the serving prompt side by side. > That is one of the cleanest ways to see that "chat" is a distributional contract, not a magical extra layer. ## Canonical schema for the course For clarity, the course standardizes one schema: - `system` - `user` - `assistant` Using one schema across notebooks, CLI examples, API examples, and SFT data prevents unnecessary confusion. ## Why the comparison to the base model matters Without a base-vs-SFT comparison, it is easy to over-credit the architecture. But the architecture did not suddenly become "chatty." The training distribution changed. That is why the right experiment here is: 1. run the same prompt on the base checkpoint 2. run it again on the SFT checkpoint 3. compare style, relevance, and role-following This makes the effect of SFT visible instead of mystical. > [!example] Notebook follow-up > - [`Base model versus SFT model`](https://github.com/Montekkundan/llm/blob/main/notebooks/chat_format_and_sft/lecture_walkthrough.ipynb#base-model-versus-sft-model) > Use this notebook section here to compare behavior before and after SFT on the same prompt set. ## The SFT flow in practice It helps to see the adaptation path as a short, reproducible systems flow: 1. load a base checkpoint 2. load conversation data under one canonical schema 3. format each conversation into a causal sequence 4. fine-tune for a controlled number of steps 5. compare the resulting checkpoint against the base model That flow is the product-facing version of the conceptual material in this note. For the runnable path, look at: - [notebooks/sft_flow/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/sft_flow/lecture_walkthrough.ipynb) The key point is not only that SFT improves behavior. It is that the improvement comes from a controlled change in data distribution and training procedure. ## What SFT does not automatically solve It is also worth being clear about the limits: - SFT does not guarantee safety - SFT does not guarantee truthfulness - SFT does not by itself create tool-use infrastructure - SFT does not replace careful evaluation It is a distributional specialization step, not a complete product alignment stack. ## Common confusions ### "Is chat formatting just prompt engineering?" No. In this context it is part of the training data representation and serving contract. ### "Does SFT change the architecture?" Usually no. It changes the parameters through additional training on specialized data. ### "Why not just instruct the base model better?" Sometimes prompting helps, but SFT shifts the model's learned continuation behavior more reliably across many inputs. ## A useful comparison Take one base checkpoint and one SFT checkpoint. Ask both: `Explain self-attention to a beginner in 4 sentences.` Then compare: - structure - directness - role-following - verbosity You will immediately see what changed. ## Key takeaway Chat is not a separate species of model. It is a language model whose training distribution, prompt format, and serving conventions were changed so that the next-token objective manifests as assistant-like behavior. > [!example] Notebook walkthroughs in this lecture > > Use this order: > > 1. [notebooks/chat_format_and_sft/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/chat_format_and_sft/lecture_walkthrough.ipynb) > Use these sections as you read: > - [`A conversation becomes one causal token stream`](https://github.com/Montekkundan/llm/blob/main/notebooks/chat_format_and_sft/lecture_walkthrough.ipynb#a-conversation-becomes-one-causal-token-stream) > - [`Base model versus SFT model`](https://github.com/Montekkundan/llm/blob/main/notebooks/chat_format_and_sft/lecture_walkthrough.ipynb#base-model-versus-sft-model) > - [`The architecture stayed the same and the training distribution changed`](https://github.com/Montekkundan/llm/blob/main/notebooks/chat_format_and_sft/lecture_walkthrough.ipynb#the-architecture-stayed-the-same-and-the-training-distribution-changed) > 2. [picollm/accelerated/tokenizer.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/tokenizer.py) > 3. [picollm/accelerated/chat/sft.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/chat/sft.py) > [!tip] TensorTonic practice for this lecture > > If you want to practice this lecture in a more implementation-focused format, work through this TensorTonic exercise: > > - [TensorTonic: BERT Fine-Tuning](https://www.tensortonic.com/research/bert/bert-fine-tuning) > > It is a good follow-up because it helps you separate pretraining from task adaptation: > > - starting from a pretrained checkpoint > - attaching a task-specific objective on top > - updating the model for a narrower behavior target > - comparing that adaptation mindset with instruction tuning and chat-style specialization <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Data Curation and Dataset Quality" href="Data%20Curation%20and%20Dataset%20Quality">Data Curation and Dataset Quality</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="SFT Flow" href="SFT%20Flow">SFT Flow</a></div> </div> </div> ## Further reading - Hugging Face, "Chat templates," 2025. https://huggingface.co/docs/transformers/chat_templating - Hugging Face, "SFT Trainer," 2025. https://huggingface.co/docs/trl/en/sft_trainer - Hugo Touvron et al., "Llama 2: Open Foundation and Fine-Tuned Chat Models," 2023. https://arxiv.org/abs/2307.09288 --- ## References [^1]: Long Ouyang et al., "Training language models to follow instructions with human feedback," 2022. https://arxiv.org/abs/2203.02155