Evaluation and Model Quality

> [!info] Course code > Use these repo paths together with this note: > - [picollm/accelerated/pretrain/eval.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/pretrain/eval.py) > - [picollm/accelerated/chat/eval.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/chat/eval.py) > - [picollm/accelerated/report.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/report.py) ## What This Concept Is Imagine two checkpoints sitting side by side. One trained longer. One has a lower loss. One gave a nicer answer on a prompt you happened to try. Which one is actually better? This note is about how to answer that question without fooling yourself. In practice, evaluation is where vague impressions get replaced by evidence. ## Foundation Terms You Need First Stay with that two-checkpoint picture for a second. A **metric** is the rule you use to score their behavior. A **validation set** is held-out data used for measurement rather than parameter updates. A **baseline** is the earlier run or reference system you compare against. A **protocol** is the full measurement setup: prompts, splits, scoring rules, and reporting choices. That last word matters a lot. A metric without a clear protocol is easy to misread. Two scores only mean something when the measurement setup is comparable and explicit. ## The picoLLM evaluation stack you will actually run In the accelerated `picollm` path, evaluation is not one monolithic score. It is a staged stack that corresponds to different questions. ```mermaid flowchart TD A["Base pretraining loop"] --> B["Validation BPB every eval_every steps"] A --> C["CORE benchmark every core_metric_every steps"] A --> D["Greedy samples every sample_every steps"] B --> E["Can the model compress held-out text better?"] C --> F["Has general capability improved beyond random baselines?"] D --> G["Does the model sound sane on short probes?"] H["Chat SFT checkpoint"] --> I["Chat eval: ARC, MMLU, GSM8K, HumanEval, SpellingBee"] I --> J["ChatCORE metric"] ``` The code paths are: - `picollm/accelerated/pretrain/train.py` - `picollm/accelerated/pretrain/eval.py` - `picollm/accelerated/loss_eval.py` - `picollm/accelerated/chat/eval.py` The central question at each layer is different: - [[Glossary#Bits per byte (BPB)|BPB]] answers: is the [[Glossary#Base model|base model]] fitting held-out language better? - CORE answers: is the base model improving on a mixed reasoning and knowledge suite? - samples answer: do quick textual probes look sane, degenerate, or unexpectedly strong? - chat eval answers: after [[Glossary#SFT|SFT]], can the assistant solve the kinds of user-facing tasks we actually care about? ## What `Validation bpb` means When the training log prints a line such as: ```text Step 05568 | Validation bpb: 0.717964 ``` it is reporting **bits per byte** on held-out validation text. In `picollm/accelerated/loss_eval.py`, BPB is computed as: ```text bpb = total_nats / (ln(2) * total_bytes) ``` The key idea is that BPB normalizes by the number of UTF-8 bytes represented by the target tokens, not just by the number of tokens. That matters because token-level loss is partly entangled with [[Glossary#Tokenizer|tokenizer]] design. If you change the [[Glossary#Vocabulary|vocabulary]], the same underlying text can break into a different number of tokens. BPB is a cleaner compression-style measure for comparing language-model quality across tokenizations. The implementation also excludes: - [[Glossary#Special tokens|special tokens]] such as `<|bos|>` - masked positions with `ignore_index` - any token IDs whose byte length is defined as zero So BPB is not a decorative metric. It is an attempt to measure how efficiently the model predicts real textual content instead of rewarding or punishing tokenization accidents. ## What `CORE metric` means When the log prints: ```text Step 02000 | CORE metric: 0.2667 ``` the training loop has called `evaluate_core(...)` from `picollm/accelerated/pretrain/eval.py`. The CORE pipeline does four things: 1. load the `eval_bundle` and `core.yaml` 2. iterate through the configured ICL tasks 3. compute raw accuracy for each task 4. compute a **centered** score by subtracting the random baseline and normalizing the remaining headroom The centered formula in the current code is: ```text centered = (accuracy - random_baseline) / (1 - random_baseline) ``` with the implementation reading `random_baseline` from `eval_meta_data.csv` and converting it from percentage form. Why center the score? Because `0.55` accuracy means very different things on a 2-choice task, a 4-choice task, and an open-ended task. Centering asks a more comparable question: how much of the gap above chance has the model closed? ## Why the run prints `Evaluating: ... accuracy: ... centered: ...` A typical block: ```text Evaluating: hellaswag_zeroshot (0-shot, type: multiple_choice)... accuracy: 0.5500 | centered: 0.4000 | time: 1.19s ``` means: - `hellaswag_zeroshot` is the task label from `core.yaml` - `0-shot` is the few-shot count for that task - `type: multiple_choice` tells the evaluator how to score continuations - `accuracy` is raw task accuracy - `centered` is the accuracy adjusted by the task's random baseline - `time` is wall-clock time for that task on the current hardware The current CORE suite used by picoLLM covers a deliberately mixed set of capabilities: - commonsense multiple choice: HellaSwag, PIQA, CommonsenseQA, OpenBookQA, BoolQ, COPA - school-style reasoning and QA: ARC-Easy, ARC-Challenge, SQuAD, CoQA - language modeling probes: Jeopardy, LAMBADA, selected BIG-bench tasks - schema and pronoun resolution: Winograd, WinoGrande - higher-difficulty reasoning probes: AGIEval LSAT-AR That mixture is not arbitrary. It reflects a long-standing evaluation practice in language modeling: do not trust any single [[Glossary#Benchmark|benchmark]] family to stand in for "general ability." ## What the sample lines and `<|bos|>` mean At periodic intervals the training loop also prints short samples such as: ```text <|bos|>The capital of France is Paris. ``` `<|bos|>` is the **beginning-of-sequence token**. In the base evaluator, conditioned and unconditioned samples are explicitly started with `tokenizer(..., prepend="<|bos|>")`. This matters for two reasons: - it makes the causal boundary token visible instead of hiding it - it reminds you that generation always begins from a concrete tokenized prefix, even when the prompt looks like plain text The two sample modes are: - **conditioned samples**: the model is seeded with a short prompt such as `"The capital of France is"` - **unconditioned samples**: the model is seeded only with BOS and asked to continue freely Conditioned samples are quick sanity checks for factual continuation and stylistic coherence. Unconditioned samples are better for spotting degeneration, repetition loops, or strange unconditional priors. ## How chat evaluation differs from base evaluation After SFT, `speedrun.sh` runs: ```bash torchrun --standalone --nproc_per_node="$PICOLLM_NPROC_PER_NODE" -m picollm.accelerated.chat.eval -- \ -i sft \ -b "$PICOLLM_CHAT_EVAL_BATCH_SIZE" ``` The current chat-eval tasks are: - `ARC-Easy` - `ARC-Challenge` - `MMLU` - `GSM8K` - `HumanEval` - `SpellingBee` These are split into two evaluation styles in `picollm/accelerated/chat/eval.py`: - **categorical**: the evaluator scores [[Glossary#Logits|logits]] over answer letters such as `A/B/C/D` - **generative**: the model must produce a textual completion that the task-specific evaluator grades The resulting `ChatCORE metric` is a centered mean across those tasks, using simple random baselines such as `0.25` for four-choice multiple-choice tasks and `0.0` for open-ended tasks. ## Why these datasets were chosen You should understand the suite historically, not treat it as a bag of brand names. The current picoLLM eval choices cover distinct failure modes: - **HellaSwag, PIQA, CommonsenseQA, COPA, WinoGrande** test commonsense completion and disambiguation, which became important because early language models could memorize surface forms yet still fail obvious-to-humans causal or physical reasoning.[^10][^11][^12][^13] - **ARC** was designed to be harder than fact lookup and simple retrieval, pushing toward grade-school reasoning rather than pattern matching.[^14] - **MMLU** became a standard broad-coverage knowledge benchmark because it spans many school and professional subjects under one protocol.[^15] - **GSM8K** and **HumanEval** matter because arithmetic and code expose brittle reasoning and planning failures more clearly than fluent prose does.[^16][^17] - **SQuAD** and **CoQA** capture reading comprehension and conversational QA, which are closer to many assistant-style product tasks than raw completion alone.[^18][^19] - **BIG-bench** and **AGIEval** represent the modern tendency to aggregate many narrow probes into a larger capability picture rather than over-trust one flagship benchmark.[^20][^21] ## Reading the distributed chat-eval logs When you see: ```text [MMLU] Loaded 14042 examples. Evaluating 14042 across 8 ranks (~1756/rank). Rank 7 | 0/3 (0.00%) Final: 5169/14042 (36.81%) ``` the important idea is that **rank** means one distributed process in the `torchrun` job. In other words: - 8 ranks = 8 worker processes - each rank gets its own slice of the evaluation set - the partial counts are reduced across ranks at the end - `Final:` is the globally aggregated result This is why the log prints both local rank progress and one final merged score. It is not eight separate evaluations. It is one distributed evaluation job. ## Where the benchmark bundle comes from If the logs show a line like: ```text Downloading https://karpathy-public.s3.us-west-2.amazonaws.com/eval_bundle.zip... ``` that is the runtime fetching the public eval bundle used by `evaluate_core(...)`. In the current picoLLM stack, the public dependency manifest prefers a picoLLM-controlled mirror first and only falls back to the historical S3 URL when needed. That is part of the broader systems point that serious stacks eventually take ownership of their external assets instead of relying on unstated historical URLs forever. ## The first distinction: optimization is not the same thing as quality Training optimizes an objective. For decoder-only language models, that objective is usually next-token prediction. The model is rewarded when it assigns high probability to the correct continuation of the sequence. That is mathematically clean and operationally powerful, but it is still only a proxy. Users do not experience "cross-entropy loss." They experience answers, refusals, formatting, reasoning traces, latency, and consistency. That gap matters. A model can reduce next-token loss by getting better at common continuations in web text, but users may care more about whether it answers clearly, avoids repetition, keeps structure in markdown, and follows instructions in chat. This is why real evaluation usually needs multiple layers: training metrics, validation metrics, targeted prompt-based checks, and sometimes benchmark-style task evaluation.[^3] The core lesson is simple: loss tells you whether optimization is progressing, not whether the product is already good enough. ## Training loss, validation loss, and why both matter The most basic evaluation distinction is between training loss and [[Glossary#Validation loss|validation loss]]. Training loss tells you how well the model fits the data it is actively learning from. Validation loss tells you how well it generalizes to held-out data that was not used for parameter updates. If the training loss falls while the validation loss stalls or worsens, that is often the first sign of overfitting.[^4] A common question is why validation matters when the dataset is already huge. The answer is that size alone does not protect you from mismatch, memorization, or bad data curation. If the training distribution is messy or the held-out examples represent different surface patterns, a model can still learn to fit the wrong thing. This is especially relevant in post-training. An SFT run on chat data may reduce training loss while teaching the model brittle prompt habits, unnatural verbosity, or overconfident stylistic patterns that only become obvious once you look outside the training set. In production-adjacent work, validation is also a decision tool. It helps you decide whether to keep training, stop early, or compare checkpoints. Without it, engineers often drift into a weak pattern: "this run feels good, so let us keep it." Good teams do not rely on feelings first. They check held-out behavior. ## Why perplexity is useful, and why it is not enough [[Glossary#Perplexity|Perplexity]] is a standard way to summarize how surprised a language model is by text. Lower perplexity means the model is assigning higher probability to the observed continuation. It is closely tied to cross-entropy and therefore helpful when you want a compact measure of general language-model fit.[^5] But perplexity is not the same as product quality. A chatbot with better perplexity can still be a worse assistant if it is more verbose, more repetitive, or worse aligned to the chat format. This is why assistant-style systems are usually judged with a combination of automatic metrics and human or prompt-based evaluations. For a course project, that means you should not treat "lower perplexity" as the whole story. It is one useful instrument on the panel, not the entire cockpit. ## Prompt-based evaluation: the bridge between metrics and human experience For conversational models, prompt-based evaluation is often the most intuitive intermediate layer. You select a stable set of prompts and run them against different checkpoints under controlled decoding settings. The point is not to cherry-pick beautiful outputs. The point is to create a repeatable probe that reveals whether a model improved on the behaviors you actually care about. A good prompt set should cover more than generic "tell me a joke" queries. It should include short greetings, factual explanation prompts, structured formatting tasks, style transfer, refusal boundaries, instruction-following checks, and multi-turn continuation. If the course is about building a chatbot, then the evaluation set should include the sorts of messages you would actually type into the CLI or web UI. This is also where consistency matters. If you compare checkpoints using different temperatures, different stop conditions, or different chat formatting wrappers, the comparison is contaminated. Real evaluation requires keeping the inference setup stable enough that the [[Glossary#Checkpoint|checkpoint]], not the sampling configuration, explains the difference in behavior.[^6] ## Benchmark-style evaluation and why teams still use it carefully Benchmarks are useful because they offer standardization. A benchmark can tell you how a model performs on question answering, mathematical reasoning, coding, summarization, or safety-specific tasks under a consistent protocol. This helps compare models or checkpoints beyond anecdotal prompts.[^7] But benchmarks are not neutral truth machines. They can be contaminated, overfit, poorly matched to your use case, or too narrow to capture product behavior. A model may score well on a benchmark and still feel bad in real conversation. This is why strong teams treat benchmarks as one lens, not as a complete replacement for domain-relevant evaluation. The most honest pattern is: - use training and validation loss to understand optimization - use prompt suites to understand chat behavior - use task benchmarks when you need standardized comparison - do not confuse any single one of these with "the whole truth" ## Evaluating before and after SFT One of the most important capstone lessons is to compare the base checkpoint and the post-trained chat checkpoint on the same prompts. This is where you start to see the difference between "a language model that can continue text" and "a model that behaves like a conversational assistant." The base model may know language reasonably well but still respond in awkward continuations, unstable role-play fragments, or raw sequence-completion style text. After SFT, the model may become much more consistent in greeting, formatting, and turn-taking. That comparison matters because it grounds the purpose of chat formatting and SFT in observable behavior. The evaluation is not theoretical anymore. It becomes a direct before/after demonstration. ## Regression testing: why one good demo is not enough A common beginner mistake is to test one or two favorite prompts, see that they improved, and conclude that the model is better. Real ML practice is harsher than that. Every model update should also be checked for regressions. Did it become more repetitive? Did it get worse at short direct answers? Did markdown formatting degrade? Did the model start hallucinating more confidently after SFT? This is where regression testing enters. In small course projects, a regression suite can simply be a fixed prompt set with saved outputs or scoring rubrics. In larger research or production systems, it may include benchmark deltas, automatic formatting checks, safety evaluations, latency measurements, and human reviews.[^8] The important idea is not the exact tooling. It is the habit of asking "what got worse?" in addition to "what got better?" ## The best practical evaluation stack for this course For `picollm`, a realistic evaluation stack should include four layers. First, training and validation metrics should tell you whether the optimization process is behaving sensibly. That includes loss, learning-rate schedule, gradient norms, tokens processed, and [[Glossary#Throughput|throughput]]. Second, a stable prompt suite should compare checkpoints using identical chat wrappers and decoding settings. This is the easiest way to make base-vs-SFT differences concrete. Third, you should keep a small regression sheet or saved outputs for recurring prompts. This turns evaluation into a scientific habit rather than a vague memory. Fourth, when the course grows further, you can attach benchmark-style probes for domain-specific tasks. But even before that, the first three layers already put the course closer to real research practice than a simple "train and eyeball" workflow. ## What researchers and production teams do differently The difference is not that researchers have secret metrics. The difference is discipline. Good teams define evaluation before they launch the run. They know what success would look like, what failure would look like, and how they will compare checkpoints. They do not wait until the run completes to decide what counts as "good." Production teams add another layer: they evaluate not only output quality but also operational behavior. A model that answers slightly better but doubles serving cost or latency may be the wrong production choice. This is why evaluation and systems thinking cannot really be separated. The best model is often the best model under constraints, not the model with the nicest isolated sample.[^9] ## What you should remember If you remember only one sentence from this note, let it be this: a finished training run is not a result, it is only a candidate result. The real result is the measured behavior of the checkpoint under a clear evaluation procedure. That is how real researchers think. That is how serious ML engineers think. And that is the mindset that turns training code into actual model work. <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Inference and Sampling" href="Inference%20and%20Sampling">Inference and Sampling</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Compute, Time, and Cost of LLMs" href="Compute%2C%20Time%2C%20and%20Cost%20of%20LLMs">Compute, Time, and Cost of LLMs</a></div> </div> </div> ## References [^1]: Sebastian Raschka, [LLMs-from-scratch](https://github.com/rasbt/LLMs-from-scratch) [^2]: Hugging Face, [Evaluate](https://huggingface.co/docs/evaluate/index) [^3]: Hugging Face TRL, [SFTTrainer documentation](https://huggingface.co/docs/trl/sft_trainer) [^4]: Ian Goodfellow, Yoshua Bengio, and Aaron Courville, [Deep Learning](https://www.deeplearningbook.org/) [^5]: Hugging Face, [Perplexity of fixed-length models](https://huggingface.co/docs/transformers/en/perplexity) [^6]: OpenAI, [Best practices for prompt engineering](https://platform.openai.com/docs/guides/prompt-engineering) [^7]: EleutherAI, [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) [^8]: Andrej Karpathy, [nanochat](https://github.com/karpathy/nanochat) [^9]: Hugging Face TB, [The Smol Training Guide](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook) [^10]: Rowan Zellers et al., [HellaSwag: Can a Machine Really Finish Your Sentence?](https://arxiv.org/abs/1905.07830) [^11]: Yonatan Bisk et al., [PIQA: Reasoning about Physical Commonsense in Natural Language](https://arxiv.org/abs/1911.11641) [^12]: Alon Talmor et al., [CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge](https://arxiv.org/abs/1811.00937) [^13]: Keisuke Sakaguchi et al., [WinoGrande: An Adversarial Winograd Schema Challenge at Scale](https://arxiv.org/abs/1907.10641) [^14]: Peter Clark et al., [Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge](https://arxiv.org/abs/1803.05457) [^15]: Dan Hendrycks et al., [Measuring Massive Multitask Language Understanding](https://arxiv.org/abs/2009.03300) [^16]: Karl Cobbe et al., [Training Verifiers to Solve Math Word Problems](https://arxiv.org/abs/2110.14168) [^17]: Mark Chen et al., [Evaluating Large Language Models Trained on Code](https://arxiv.org/abs/2107.03374) [^18]: Pranav Rajpurkar et al., [SQuAD: 100,000+ Questions for Machine Comprehension of Text](https://arxiv.org/abs/1606.05250) [^19]: Reddy et al., [CoQA: A Conversational Question Answering Challenge](https://arxiv.org/abs/1808.07042) [^20]: BIG-bench authors, [Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models](https://arxiv.org/abs/2206.04615) [^21]: Ruisheng Cao et al., [AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models](https://arxiv.org/abs/2304.06364)