Reproducibility and Research Method

> [!info] Course code > Use these repo paths together with this note: > - [picollm/accelerated/speedrun.sh](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/speedrun.sh) > - [picollm/accelerated/checkpoint_manager.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/checkpoint_manager.py) > - [picollm/accelerated/report.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/report.py) ## What This Concept Is A convincing experiment is not only one that worked once on your machine. It is one that another person can inspect, rerun, and challenge using the recorded method and artifacts. This note is about building that standard into your work. In other words, it is about turning runs into credible evidence. ## Foundation Terms You Need First **[[Glossary#Repeatability|Repeatability]]** asks whether the same team can rerun the same setup and get materially similar results. **Reproducibility** asks whether another team can reconstruct the result from the documentation and artifacts. An **artifact trail** is the set of configs, checkpoints, reports, and assumptions that make that possible. **[[Glossary#Robustness|Robustness]]** asks whether the claim survives small changes such as seed or hardware differences. Those words are often blurred together, but keeping them separate makes the research standard much clearer. ## Seed sensitivity Random seeds matter because training is not perfectly deterministic across all stacks and hardware settings. Small differences in initialization, ordering, or low-level execution can produce measurable output differences. This does not make research impossible. It means you should understand that single-run outcomes should be interpreted with some care. The mature habit is: - do not report one lucky seed as if it were the system - do not assume a tiny improvement is real if it disappears on rerun - distinguish "this exact run worked" from "this method is reliably better" For expensive runs, a full seed sweep may be unrealistic. But even then, authors should acknowledge the limitation explicitly instead of silently implying certainty. ## Run documentation Every serious run should carry enough metadata to answer basic questions later: - what code version was used? - what dataset and split were used? - what model size and [[Glossary#Tokenizer|tokenizer]] were used? - what hardware was used? - what were the key hyperparameters? - what was the evaluation procedure? Without this information, many comparisons decay into guesswork. In practice, you should learn to preserve at least: - commit hash or tagged code snapshot - config file or CLI command - dataset identifier and any local preprocessing assumptions - [[Glossary#Checkpoint|checkpoint]] path and artifact naming convention - evaluation command and decoding settings ## Determinism tiers Not all parts of an ML workflow are equally deterministic. It is useful to teach a small determinism taxonomy: 1. exact deterministic reruns on identical hardware and software 2. statistically similar reruns on the same general stack 3. claim-level robustness across reasonable implementation differences For modern LLM work, tier 1 is often difficult across different hardware or library versions. Tier 2 is more realistic for most course projects. Tier 3 is what matters most scientifically: does the conclusion survive modest perturbations? ## Data and artifact provenance Good research method includes data provenance, not only model configs. You should be able to answer: - which dataset snapshot was used? - what filtering or formatting logic changed the raw source? - what tokenizer version transformed the data? - which artifacts were retained for later inspection? This matters because many downstream discrepancies are not optimizer bugs at all. They are provenance bugs. The dataset quietly changed. The tokenizer was rebuilt. The prompt template drifted. The evaluation suite was edited in place. Reproducibility starts failing long before anyone notices. ## Reproducibility limits Even with careful logging, exact bit-for-bit reproducibility may be difficult across hardware, libraries, and distributed environments. Good scientific practice does not require pretending those limits do not exist. It requires being explicit about them. This is especially important in distributed training, mixed precision, and cloud workflows where non-deterministic kernels, data ordering differences, or interrupted runs can all affect the trajectory. ## Writing claims narrowly and honestly One of the strongest research habits is learning to phrase conclusions with the right scope. Instead of saying "this model is better," say "this model performed better on our prompt suite and validation setup under these decoding settings." That is a narrower claim, but it is also a much more trustworthy one. At review time, narrow honest claims are usually stronger than broad vague ones because they let a reader see what was actually tested. ## Structuring a research report You should learn to structure experiment writeups with a few stable sections: - question or hypothesis - setup - metrics - observations - comparison against baseline - conclusion with limits That structure forces clarity and makes the work easier to review. For stronger work, add: - artifact provenance - uncertainty or seed caveats - evaluation protocol details - failure cases or negative results Negative results belong in a research notebook because they explain which hypotheses failed and prevent later self-deception. ## Minimal reproducibility checklist Before you claim a result is final, you should be able to check: - I can reconstruct the command or config. - I know which code snapshot produced the run. - I know which data source and tokenizer were used. - I preserved the checkpoint and evaluation outputs. - I can explain what could still vary on rerun. That checklist is simple, but it already moves the work from "demo memory" toward actual method. <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Formal Evaluation and Benchmarking" href="Formal%20Evaluation%20and%20Benchmarking">Formal Evaluation and Benchmarking</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Inference Runtime and KV Cache" href="Inference%20Runtime%20and%20KV%20Cache">Inference Runtime and KV Cache</a></div> </div> </div> ## Further reading - Joelle Pineau et al., "Improving Reproducibility in Machine Learning Research," 2021. https://jmlr.org/papers/v22/20-303.html - NeurIPS, "Paper Checklist Guidelines," 2025. https://neurips.cc/public/guides/PaperChecklist - PyTorch, "Reproducibility," 2025. https://docs.pytorch.org/docs/stable/notes/randomness.html ## References [^1]: Chris Drummond, [Replicability is not Reproducibility: Nor is it Good Science](https://cseweb.ucsd.edu/~wgg/CSE255/Drummond.pdf) [^2]: Hugging Face, [Evaluate](https://huggingface.co/docs/evaluate/index) [^3]: Sebastian Raschka, [LLMs-from-scratch](https://github.com/rasbt/LLMs-from-scratch)