Post-Training Beyond SFT

> [!info] Course code > Use these repo paths together with this note: > - [picollm/accelerated/chat/sft.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/chat/sft.py) > - [picollm/accelerated/chat/eval.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/chat/eval.py) > - [picollm/accelerated/report.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/report.py) ## What This Concept Is SFT gets you from a base model to an assistant-like model, but many modern pipelines do not stop there. After demonstrations, teams often add preference learning, ranking signals, and other post-training stages to shape quality more finely. This note is about that second layer of shaping. ## Foundation Terms You Need First **Demonstration data** shows the model what a good answer looks like. **Preference data** shows which of two answers people prefer. A **[[Glossary#Reward model|reward model]]** is trained to score outputs using those preferences. **Post-training** is the umbrella for stages after base pretraining that specialize behavior. So if SFT teaches the model the broad style of an assistant, later post-training often teaches it how to rank close alternatives and behave better under tricky trade-offs. ## Why SFT is not the end SFT is powerful because it teaches the model the general style of desirable behavior. It learns how an assistant should respond, how roles are formatted, and what a typical helpful answer looks like. But demonstration data alone does not always tell the model how to rank two plausible answers when both are fluent. Preference data exists because quality often lives in comparisons, not just in demonstrations. ## RLHF pipeline intuition [[Glossary#RLHF|RLHF]] usually involves three conceptual stages: 1. supervised fine-tuning on demonstrations 2. training a [[Glossary#Reward model|reward model]] from preference comparisons 3. reinforcement learning to optimize policy behavior against that reward signal The details vary, but the intuition is stable: first teach the model the general task, then refine it using preference information.[^2] ## Reward models A reward model is trained to predict which response humans prefer among alternatives. That model becomes a learned proxy for response quality. Reward models are useful because they transform pairwise preference data into a signal that can guide further optimization. But they also introduce a new source of modeling error, because the reward model is itself only an approximation of human preference.[^2] At advanced level, you should understand reward misspecification as a real research issue, not a footnote. ## DPO Direct Preference Optimization became influential because it showed that preference optimization could be formulated more directly, without the full RLHF stack in the original PPO style. [[Glossary#DPO|DPO]] reframes the problem in a way that is often simpler to implement and reason about, which is why it became important in open-source post-training workflows.[^3] The key research idea is that post-training can be expressed as relative preference learning against a reference policy rather than only as explicit online RL. ## Preference data quality Preference optimization is only as good as the comparisons it learns from. You should therefore ask: - who produced the preferences? - under what rubric? - how noisy are the comparisons? - what behaviors are overrepresented? This is exactly parallel to ordinary data engineering: post-training data is still data, and it can still encode bias and label error. ## KL control and policy drift One of the central concerns in post-training is policy drift. If the optimized model moves too far from the reference policy, it may become unstable, overly stylized, or reward-hacked. This is why many post-training methods include an explicit or implicit pressure to stay near the reference model. You do not need the full derivation on first pass, but you should understand that alignment optimization is constrained optimization, not unconstrained score chasing. ## Why you should understand this even if we do not implement it Even if the course stays implementation-light beyond SFT, you should still know where the field goes next. Otherwise it is easy to leave with the mistaken impression that SFT is the full modern assistant-training story. It is not. SFT is the foundation; preference optimization is one of the major layers built on top of it. This course keeps the production path focused on SFT. That is enough to teach the main idea cleanly. A deeper follow-on module could cover reward-model training, PPO-style policy optimization, preference data collection pipelines, online sampling loops, and post-training evaluation design in much more detail. If you want to go deeper after this note, the best next steps are: - compare SFT and DPO outputs on the same [[Glossary#Checkpoint|checkpoint]] using a fixed prompt suite - study how reward models are trained and audited - read the original RLHF and DPO papers closely - build a small preference dataset and test how sensitive DPO is to its quality <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Advanced Inference Systems" href="Advanced%20Inference%20Systems">Advanced Inference Systems</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Safety and Alignment Evaluation" href="Safety%20and%20Alignment%20Evaluation">Safety and Alignment Evaluation</a></div> </div> </div> ## References [^1]: Sebastian Raschka, [LLMs-from-scratch](https://github.com/rasbt/LLMs-from-scratch) [^2]: Long Ouyang et al., OpenAI, [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155) [^3]: Rafael Rafailov et al., Stanford, [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://arxiv.org/abs/2305.18290) [^4]: Nisan Stiennon et al., OpenAI, [Learning to summarize from human feedback](https://arxiv.org/abs/2009.01325)