Formal Evaluation and Benchmarking

> [!info] Course code > Use these repo paths together with this note: > - [picollm/accelerated/core_eval.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/core_eval.py) > - [picollm/accelerated/pretrain/eval.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/pretrain/eval.py) > - [picollm/accelerated/chat/eval.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/chat/eval.py) > - [picollm/accelerated/report.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/report.py) ## What This Concept Is It is easy to quote a benchmark score. It is harder to know what that score actually means, whether it is comparable, and whether it is honest. This note is about the discipline behind benchmark claims. A benchmark is only useful when the task, [[Glossary#Protocol|protocol]], and comparison standard are all clear. ## Foundation Terms You Need First A **benchmark** is a shared task or suite used for comparison. A **[[Glossary#Protocol|protocol]]** is the exact setup used to produce a score: prompts, splits, decoding choices, and scoring rules. A **contamination check** asks whether benchmark material leaked into training. A **claim** is the conclusion someone draws from the score. So when you read benchmark numbers, do not stop at the number. Ask what task it came from, how it was measured, and whether the comparison is actually fair. ## Held-out validation design A held-out set is useful only if it is actually held out and actually relevant. Validation splits should not be accidental leftovers. They should represent a distribution you care about and remain untouched by training updates. If the validation data is too close to the training data, it can overestimate generalization. If it is too different, it can answer a question you did not mean to ask.[^1] For serious work, you should distinguish at least three evaluation layers: 1. optimization validation 2. product-style prompt evaluation 3. benchmark or publication-style evaluation Optimization validation asks whether training is progressing sensibly. Product evaluation asks whether the model behaves usefully on tasks that matter to your application. Formal benchmark evaluation asks whether the system performs competitively on standardized tasks under a documented protocol. Confusing these layers is one of the fastest ways to make weak claims. ## Benchmark methodology Benchmarks standardize evaluation, but standardization only helps if the protocol is clear. You should understand what the benchmark measures, what prompting protocol it assumes, whether it uses few-shot or zero-shot setups, and what scoring method is applied. A benchmark number without methodological context is weaker than it looks.[^2] At minimum, a benchmark report should record: - benchmark name and version - exact task subset - prompt template or harness settings - decoding settings - model [[Glossary#Checkpoint|checkpoint]] identifier - [[Glossary#Tokenizer|tokenizer]] version if relevant - evaluation hardware and runtime stack Without that metadata, a later reviewer often cannot tell whether two reported numbers are even comparable. ## Contamination Contamination happens when evaluation data leaks into training data. This can make benchmark results look much better than they should, because the model is no longer being tested on genuinely unseen examples. Contamination is not a theoretical curiosity. At web scale, it is a very real risk, especially when public benchmark datasets or code repositories containing those datasets are part of the corpus.[^3] You should learn to separate: - direct contamination: the exact eval item appeared in training - template contamination: near-identical prompt or answer formats were seen - benchmark-family contamination: public leaderboard items or mirrors entered the corpus The research habit to teach is not "claim contamination never happened." It is "state what contamination controls were used and what residual risk remains." ## Prompt-suite design Prompt suites are the bridge between raw benchmarks and human product intuition. A good prompt suite should be stable, diverse enough to reveal different behaviors, and scored or reviewed consistently. In this course, the practical probe set lives alongside the accelerated workflow and prompt artifacts, which teaches the right habit: evaluate with a fixed, named set of prompts instead of improvising every time. For research-grade use, a prompt suite should be designed with explicit coverage: - instruction following - formatting fidelity - reasoning or multi-step structure - factuality or retrieval dependence - refusal behavior where relevant - regression-sensitive product behaviors The key point is that evaluation items should be sampled from an evaluation design, not from whatever the instructor happens to remember during a demo. ## Human evaluation and pairwise comparison Human evaluation remains important because not every useful behavior is captured by automatic metrics. Pairwise comparison is often more reliable than absolute scoring because humans are usually better at saying which of two answers is preferable than assigning a precise numerical value to a single one.[^4] But human evaluation is only strong if the rubric is clear. At minimum, you should specify: - what raters were asked to judge - whether the comparison was blinded - whether order was randomized - how ties were handled - whether judges saw model names or checkpoint metadata This is where many "human eval" claims collapse. A human preference result without protocol detail is often only a polished anecdote. ## Statistical thinking Even when a result looks better, you should ask whether the difference is large enough and stable enough to matter. Formal statistics can go much deeper than a course needs, but the core habit should still be taught: do not overstate tiny deltas without checking whether they are consistent and meaningful. For checkpoint comparisons, the right default instinct is: - report the absolute metric difference - report the evaluation sample size - report uncertainty or at least discuss expected variance - avoid strong conclusions from tiny deltas on small suites For prompt suites and pairwise review, paired analysis matters because the same prompts are often being scored across both systems. That makes paired bootstrap or related resampling logic more appropriate than pretending the samples are independent. ## Rubrics, agreement, and evaluator quality A rubric should not only say "better answer." It should tell a reviewer what dimensions matter. For example: - correctness - completeness - faithfulness to requested format - safety or refusal quality - concision If multiple human raters are involved, disagreement is itself signal. Low agreement can indicate that the rubric is too vague, the task is underspecified, or the examples are too ambiguous. You do not need an industrial annotation platform for this course, but you should understand that evaluator agreement is part of evaluation quality, not an administrative afterthought. ## Regression tracking Formal evaluation is not only about chasing improvement. It is also about preserving previous behavior. A run that improves one benchmark but damages formatting quality or refusal behavior may not be a better system overall. Regression tracking exists to make those trade-offs visible. This is exactly why the course keeps: - a shared prompt suite - checkpoint comparison tooling - [[Glossary#Latency|latency]] measurement - smoke tests Those tools are not a substitute for a full lab evaluation stack, but they enforce the right habit: every gain should be checked against what might have been lost. ## What a defensible evaluation section looks like For a reviewable project report, you should be able to write an evaluation section with: 1. the question being tested 2. the checkpoints being compared 3. the fixed protocol 4. the metrics or rubric 5. the observed results 6. the limitations and contamination caveats That structure is stronger than a benchmark screenshot because it makes the claim inspectable. ## Scope of this course and where to go deeper This course introduces the logic of formal evaluation and gives a runnable path through `picollm/accelerated/core_eval.py`, `picollm/accelerated/pretrain/eval.py`, `picollm/accelerated/chat/eval.py`, and `picollm/accelerated/report.py`. That is enough for you to learn the difference between casual prompting and structured evaluation. What we are not trying to do here is reproduce the full industrial evaluation stack used inside major labs. A deeper follow-on module could spend entire lectures on benchmark governance, contamination audits, human-labeling protocol design, confidence intervals, rubric calibration, paired-bootstrap testing, evaluator agreement, and large-scale automatic evaluation pipelines. If you want to go deeper after this note, the best next steps are: - study `lm-eval-harness` task configuration and result interpretation in detail - read more about contamination and benchmark validity at web scale - design one domain-specific benchmark or evaluation rubric of your own - compare automatic scoring with human pairwise review on the same outputs <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Failure Modes and Debugging" href="Failure%20Modes%20and%20Debugging">Failure Modes and Debugging</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Reproducibility and Research Method" href="Reproducibility%20and%20Research%20Method">Reproducibility and Research Method</a></div> </div> </div> ## References [^1]: Ian Goodfellow, Yoshua Bengio, and Aaron Courville, [Deep Learning](https://www.deeplearningbook.org/) [^2]: EleutherAI, [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) [^3]: Luca Soldaini et al., Allen Institute for AI, [Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research](https://arxiv.org/abs/2402.00159) [^4]: Long Ouyang et al., OpenAI, [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)