Research Workflow and Ablations

> [!info] Course code > Use these repo paths together with this note: > - [picollm/accelerated/README.md](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/README.md) > - [picollm/accelerated/speedrun.sh](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/speedrun.sh) ## What This Concept Is A lot of people can launch experiments. Fewer can explain what question the experiment was meant to answer and what evidence supports the conclusion. This note is about that more disciplined way of working. If you remember one thing, remember this: changing things is easy; learning from the change is the real skill. ## Foundation Terms You Need First A **[[Glossary#Hypothesis|hypothesis]]** is the claim a run is meant to test. A **[[Glossary#Baseline|baseline]]** is the control you compare against. An **[[Glossary#Ablation|ablation]]** changes one important factor while trying to keep the others fixed. A **conclusion** is the written interpretation supported by that comparison. So a good research workflow is not only "run more experiments." It is "ask a clear question, isolate a variable, measure the change, and write down what the evidence actually supports." ## The basic research loop At a high level, the workflow is simple: 1. form a hypothesis 2. design a run that tests it 3. measure the result 4. compare against a baseline 5. write down the conclusion carefully The important thing is not that this loop is complicated. The important thing is that many people skip half of it. They change several variables at once, forget to define a baseline, look at a few outputs, and conclude that the new run is better. That is not research. That is improvisation. ## What a good hypothesis looks like A good hypothesis is narrow enough that the result can actually teach you something. "Can we make the model better?" is too vague. "Does switching from one [[Glossary#SFT|SFT]] dataset to another improve greeting and short-answer behavior without hurting formatting consistency?" is much better. It points to a concrete intervention and implies a concrete evaluation procedure. The reason this matters is that experiment tracking becomes much more useful when the run has an explicit question behind it. A run name, a W&B note, or a short experiment sheet should tell future-you what the run was trying to learn. ## Why baselines matter A baseline is the reference point that makes a comparison meaningful. In `picollm`, a baseline might be: - the base pretrained [[Glossary#Checkpoint|checkpoint]] before SFT - the old SFT dataset before a new one is introduced - the old warmup schedule before the new one is tested - the earlier hardware or batch setting before a [[Glossary#Throughput|throughput]] optimization is attempted Without a baseline, people often end up evaluating a run in isolation. That is a weak habit because almost every interesting claim in ML is comparative. ## What counts as a proper ablation A proper ablation changes a limited part of the system and then measures the consequences. If you change the dataset, [[Glossary#Tokenizer|tokenizer]], model width, sequence length, and scheduler all in one shot, then even a better result does not tell you which of those choices mattered most. This is why experienced teams do smaller, sharper comparisons. They do not necessarily change only one byte of config, but they try to keep the scope narrow enough that the outcome teaches them something reusable. In a course setting, that often means comparing: - one dataset choice at a time - one optimizer or scheduler setting at a time - one prompt formatting convention at a time - one serving or decoding change at a time The best ablation is the one that answers a question clearly enough that the next experiment becomes easier to design. ## Why writing matters in research workflow It is easy to think the real research happens only in code. But a large amount of serious research work is actually note-taking: writing the hypothesis, the configuration, the observed metrics, the outputs, and the interpretation. If a run cannot be described clearly afterward, then the knowledge gained from it is fragile. This is one reason tools like W&B notes, run names, and experiment sheets matter. They create a paper trail. The point is not bureaucracy. The point is preserving the reasoning process. If you want to see a more workflow-oriented tooling example, you can also look at Tangle. It is an open-source collaborative platform with a drag-and-drop editor for ML and data pipelines. That is useful as an external reference for experiment organization and reproducible workflow design, especially when teams outgrow ad hoc notebooks and shell scripts. But it should stay an optional tooling example, not a prerequisite for understanding the scientific logic of ablations.[^4] ## The danger of over-claiming One of the most important research habits is learning how not to overstate results. A better-looking output on a few prompts does not necessarily prove a model is broadly better. A lower [[Glossary#Loss|loss]] on one dataset split does not necessarily imply improved user-facing quality. A faster run does not necessarily imply a better training recipe if it sacrifices stability or output quality. You should learn to write claims with the right scope. For example: - "This SFT dataset improved short conversational prompts in our fixed prompt suite." - "This scheduler reduced early instability in this training recipe." - "This batch configuration improved throughput on our H100 node." These are strong claims because they say exactly what changed and under what conditions. ## Why this matters for professors and researchers reviewing the course If the course only shows how to launch end-to-end runs, it looks practical but not fully scientific. Adding research workflow and ablations changes that. It shows how to reason about evidence, isolate variables, and communicate claims responsibly. That is one of the biggest differences between a good engineering demo course and a course that researchers will respect. <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Experiment Tracking and Run Analysis" href="Experiment%20Tracking%20and%20Run%20Analysis">Experiment Tracking and Run Analysis</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Data Curation and Dataset Quality" href="Data%20Curation%20and%20Dataset%20Quality">Data Curation and Dataset Quality</a></div> </div> </div> ## References [^1]: Sebastian Raschka, [LLMs-from-scratch](https://github.com/rasbt/LLMs-from-scratch) [^2]: Andrej Karpathy, [nanochat](https://github.com/karpathy/nanochat) [^3]: Hugging Face TB, [The Smol Training Guide](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook) [^4]: Tangle, [Build ML and data pipelines collaboratively](https://tangleml.com/)