Safety and Alignment Evaluation

> [!info] Course code > Use these repo paths together with this note: > - [picollm/accelerated/chat/eval.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/chat/eval.py) > - [picollm/accelerated/report.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/report.py) ## What This Concept Is A model can sound fluent and still behave badly. It can answer harmful requests too readily, refuse safe requests too often, or become inconsistent under adversarial phrasing. This note is about how to evaluate that side of model behavior instead of treating safety as a vague feeling. In practice, alignment evaluation is where "helpful" gets tested against actual boundaries. ## Foundation Terms You Need First An **alignment target** is the intended pattern of helpful but bounded behavior. **Harmful compliance** means answering something the system should refuse. **False refusal** means blocking a request that should have been answered. An **evaluation protocol** is the set of prompts, categories, and scoring rules used to judge that behavior. So this note is not asking "does the model ever refuse?" It is asking whether the model refuses in the right places and stays useful in the others. ## Refusal behavior A useful assistant should refuse some requests. The hard part is that refusal quality is not binary. A model can under-refuse and comply with harmful requests. It can also over-refuse and become uselessly restrictive. Alignment evaluation therefore asks whether the model refuses in the right regions of request space, not simply whether it ever says no. You should explicitly distinguish: - correct refusal - correct compliance - harmful compliance - harmless false refusal That confusion matrix is more informative than a generic statement such as "the model seemed safe." ## Prompt injection and jailbreak susceptibility Prompt injection and jailbreak attempts matter because language models are instruction-following systems that can be manipulated by adversarial phrasing. A model that behaves well on ordinary prompts may still fail under adversarial wrappers, role-play attacks, or instruction overrides. This is why safety evaluation includes adversarial prompting and red-teaming, not just ordinary user questions.[^2] The research habit to teach is: - test direct unsafe requests - test wrapped unsafe requests - test role-play and fictional framing - test instruction-conflict cases The point is not to sensationalize jailbreaks. It is to see whether the system remains policy-consistent when the surface form changes. ## Harmful-output evaluation Harmful-output evaluation usually probes categories such as self-harm assistance, illicit instructions, hate content, or dangerous misinformation. The exact taxonomy depends on the product and its risk model, but the core idea is stable: align the evaluation with concrete harm categories rather than vague discomfort. For classroom rigor, you should record: - the harm category - whether the model refused - whether the refusal was sufficiently clear - whether the answer still leaked actionable harmful detail That final point matters because a polite refusal can still be unsafe if it includes operational instructions around the edges. ## Over-refusal and utility loss Safety work becomes unserious if it ignores the opposite error. A model that refuses benign educational, medical, or technical questions too aggressively may be safer in one narrow sense and much worse in real use. This is why evaluation should include benign-but-sensitive prompts, not only clearly disallowed ones. Reviewers should be able to see whether safety tuning produced: - safer refusal boundaries - or blunt avoidance behavior that harms usefulness ## Why deployment-time trade-offs matter A slightly more helpful model that is much easier to jailbreak may be a worse deployment choice. Likewise, a very safe model that over-refuses harmless questions may damage product usefulness. This is why alignment is a trade-off surface, not a one-dimensional score. This is also why safety evaluation should be reported alongside ordinary quality evaluation. Safety results in isolation can hide a utility collapse. Utility results in isolation can hide dangerous compliance. ## Evaluator protocol At graduate level, you should specify how safety judgments were made. For example: - what categories were tested - what counted as refusal success - what counted as harmful leakage - whether raters used a rubric - whether examples were reviewed pairwise across checkpoints That protocol detail is what moves the work from "we tried a few bad prompts" to something a reviewer can actually inspect. ## Relation to post-training Safety is not only an evaluation problem. It is also a post-training problem. Even in a course path that stops at [[Glossary#SFT|SFT]], refusal behavior and policy consistency can shift after chat tuning. That means safety testing should be part of [[Glossary#Checkpoint|checkpoint]] comparison, not a final ceremonial pass after the model is already chosen. ## Scope of this course and where to go deeper This course gives you a concrete starting point through the accelerated chat evaluation path and the shared reporting flow. That is enough to teach the difference between casual chatbot testing and deliberate alignment evaluation, even before a dedicated standalone safety harness is added. What this course is not trying to do is replicate a full industrial safety program. A deeper follow-on module could focus entirely on red-team operations, adversarial prompt taxonomies, policy design, moderation systems, evaluator rubrics, constitutional methods, and deployment governance. If you want to go deeper after this note, the best next steps are: - expand the safety prompt suite by harm category and refusal style - compare base and SFT checkpoints on the same safety probes - read more about constitutional methods, red-teaming, and adversarial prompting - separate harmless false refusals from dangerous non-refusals in your evaluation notes <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Post-Training Beyond SFT" href="Post-Training%20Beyond%20SFT">Post-Training Beyond SFT</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Advanced Data Engineering for LLMs" href="Advanced%20Data%20Engineering%20for%20LLMs">Advanced Data Engineering for LLMs</a></div> </div> </div> ## References [^1]: Yuntao Bai et al., Anthropic, [Constitutional AI: Harmlessness from AI Feedback](https://arxiv.org/abs/2212.08073) [^2]: NIST, [Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations](https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-2e2023.pdf) [^3]: OpenAI, [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155) [^4]: Princeton Center for Information Technology Policy et al., [Red-Teaming for Generative AI](https://arxiv.org/abs/2401.15897)