Experiment Tracking and Run Analysis

> [!info] Course code > Use these repo paths together with this note: > - [picollm/accelerated/speedrun.sh](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/speedrun.sh) > - [picollm/accelerated/pretrain/train.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/pretrain/train.py) > - [picollm/accelerated/chat/sft.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/chat/sft.py) ## What This Concept Is During a serious run, the logs, metrics, checkpoints, and reports are not background decoration. They are the evidence trail that tells you what the run actually did. This note is about learning to read that trail instead of only glancing at one final number. If training is the experiment, run analysis is the part where you learn what really happened. ## Foundation Terms You Need First A **run** is one concrete training or evaluation execution with a specific config and artifact trail. A **metric history** is how a quantity such as loss or throughput changes over time. An **[[Glossary#Artifact|artifact]]** is a saved output such as a checkpoint, manifest, or report. **Run analysis** is the act of connecting those signals into a defensible explanation. So the habit to build here is simple: do not ask only "did the run finish?" Ask what the curve shapes, checkpoints, samples, and reports are saying together. ## How to read the picoLLM training log line One representative line from `picollm/accelerated/pretrain/train.py` looks like this: ```text step 05567/05568 (99.98%) | loss: 2.336916 | lrm: 0.05 | dt: 997.40ms | tok/sec: 1,051,311 | bf16_mfu: 63.45 | epoch: 1 pq: 122 rg: 80 | total time: 92.46m | eta: 0.0m ``` You should be able to decode every field: - `step 05567/05568`: current optimizer step over total planned steps - `(99.98%)`: percent of the scheduled base-pretraining run completed - `loss`: debiased EMA-smoothed training loss, not one raw noisy micro-batch - `lrm`: learning-rate multiplier from the [[Glossary#Warmup|warmup]]/warmdown schedule - `dt`: wall-clock step time - `tok/sec`: effective token throughput across the whole distributed job - `bf16_mfu`: model [[Glossary#FLOP / FLOPS|flops]] utilization estimate relative to hardware peak [[Glossary#BF16|BF16]] throughput - `epoch`: approximate dataset pass count from the dataloader state - `pq`: current parquet-file index in the distributed dataloader - `rg`: current Parquet row-group index inside that file - `total time`: accumulated wall-clock training time - `eta`: estimated remaining runtime from recent average step time That line is not cosmetic logging. It is the compact state of the run. ## What `pq` and `rg` mean `picollm/accelerated/dataloader.py` streams the training corpus from Parquet shards. The loader state includes: - `pq_idx`: which Parquet file the loader is currently reading - `rg_idx`: which row group inside that file is currently being consumed - `epoch`: how many full passes through the dataset have been completed So when the log says: ```text epoch: 1 pq: 122 rg: 80 ``` it means: - the run is in its first full dataset pass - it is currently on Parquet shard 122 - inside that shard it is reading row group 80 That is especially useful for resumability and debugging. If a run dies, the [[Glossary#Checkpoint|checkpoint]] metadata knows roughly where to restart. ## What `rank` means in the logs In accelerated picoLLM, `rank` means one distributed process launched by `torchrun`. ```mermaid flowchart TD A["torchrun"] --> B["rank 0"] A --> C["rank 1"] A --> D["..."] A --> E["rank 7"] B --> F["partial metrics"] C --> F D --> F E --> F F["all_reduce / merge"] --> G["global score or checkpoint state"] ``` That explains several common log patterns: - `Rank 7 | 0/3 (0.00%)` in chat eval means rank 7 is reporting its local slice progress - `Final: ...` means those local counts were reduced into one global result - `optim_005568_rank3.pt` means optimizer state for rank 3 was saved separately, because distributed optimizer state is sharded across processes You should not read `rank` as "quality tier" or "GPU number only." It is a worker identity in a distributed job. ## How the training loop emits telemetry The pretrain loop is also a telemetry pipeline: ```mermaid flowchart TD A["Dataloader state"] --> B["train.py step loop"] B --> C["terminal logs"] B --> D["W&B or DummyWandb"] B --> E["checkpoint metadata"] B --> F["report markdown files"] E --> G["resume state: step, pq_idx, rg_idx, epoch, val_bpb"] F --> H["final report and run_manifest.json"] ``` This is the right mental model for serious runs. A training job is not only updating weights. It is also continuously producing evidence about what happened. ## The telemetry artifacts that matter most in picoLLM Beyond terminal lines, the accelerated stack now emits several durable artifacts: - `report/` markdown sections written by `picollm/accelerated/report.py` - `run_manifest.json` written near the end of `speedrun.sh` - checkpoint metadata JSON saved beside model and optimizer state - optional W&B traces if the run is not using `WANDB_RUN=dummy` The current course should push you to inspect all four. If the run only "looked okay in the terminal," that is not enough. ## Why tracking matters Without tracking, ML work collapses into guesswork. It is easy to remember that "the run looked okay" or that "the loss was going down," but those memories are unreliable. Good experiment tracking gives you a record of what happened, when it happened, how long it took, and which configuration produced the result. This matters for at least three reasons. First, it makes reruns and comparisons honest. If one run used a different [[Glossary#Batch size|batch size]], another used a different warmup schedule, and a third used a different dataset slice, the tracker provides a shared factual record. Second, it helps you stop bad runs earlier. If the loss plateaus too early, throughput collapses, or gradient norms become unstable, you can intervene before wasting many more GPU hours. Third, it helps you communicate results. If you want to show a professor, colleague, or collaborator what happened during a run, a clean run log or dashboard is much more convincing than a memory or a screenshot of terminal output.[^2] ## TensorBoard and Weights & Biases: when to use which `picollm` supports both TensorBoard and W&B because they serve slightly different learning and operational needs. TensorBoard is the easier local default. It works well when you are training on your own machine or on a short-lived environment and want a lightweight local dashboard. It is simple, stable, and fits naturally with PyTorch and the Hugging Face training stack.[^3] [^4] Weights & Biases becomes more useful when runs are long, remote, or expensive. It is especially helpful for cloud GPUs because it gives live syncing, experiment comparison, system metrics, organized run metadata, and shareable links. If a capstone run is happening on Vast.ai for several hours, W&B is usually the better operational choice because you can monitor it without being glued to the terminal session.[^5] The course should teach both, because the real difference is not which tool is "better in the abstract." The real difference is context. TensorBoard is often the better answer for local iteration. W&B is often the better answer for cloud experimentation and sharing. ## The minimum metrics you should understand The most important tracked metrics are not the fanciest ones. They are the few signals that let you tell whether learning and systems behavior are both healthy. ### Loss Loss is the first metric everyone watches, and for good reason. If it is not trending down at all, the run is almost certainly wrong. But loss is only useful when interpreted with its slope and context. A sharp early drop followed by a slower taper is normal. A flat line near initialization may suggest the optimizer is not doing useful work. Sudden spikes can reflect unstable batches, exploding gradients, precision issues, or bad data. A loss curve is not read point-by-point. It is read as a shape. ### Learning rate It is easy to underappreciate the learning-rate curve because it is configured once and then forgotten. But the learning rate explains a lot of model behavior. If the loss drops rapidly during warmup and then stabilizes as the scheduler decays, that is often healthy. If the learning rate is too high, the run may oscillate or diverge. If it is too low, learning can appear stubbornly slow. The learning-rate chart is therefore one of the main diagnostics when a run feels too unstable or too sluggish.[^6] ### Gradient norm Gradient norm is one of the best simple indicators of optimization stability. If it explodes, training can become unstable. If it collapses to nearly nothing too early, the model may not be learning effectively. Stable, bounded gradient norms usually mean the optimization process is behaving sensibly. They do not guarantee a good model, but they do tell you whether the update mechanics are calm or chaotic. ### Throughput and step time Throughput is where experiment tracking meets systems engineering. A healthy loss curve can still hide an inefficient run. If step time increases over the course of a job, you need to know why. Is the data pipeline stalling? Are checkpoint saves blocking progress? Is multi-GPU communication becoming the bottleneck? A researcher who ignores throughput is only half-reading the experiment. For this course, you should learn to compute ETA directly from step time. If a run is doing `5000` steps at about `2.8s/it`, that is about `3.9` hours for pretraining. This is not just arithmetic; it is operational awareness. ### System metrics System metrics become especially important on cloud hardware. GPU utilization, memory usage, memory bandwidth, [[Glossary#Temperature|temperature]], power draw, and communication health all help interpret whether the job is compute-bound, memory-bound, or simply underutilizing the rented machine. A beautifully smooth loss curve on half-idle GPUs is not a victory. It is a sign that the run may be leaving performance on the table.[^7] ## How to read a W&B run properly When you first open W&B, it is easy to focus on the most visible chart, usually loss. That is a start, but it is not enough. A proper reading order is better. Start with the run metadata. Check the run name, project, hardware assumptions, key hyperparameters, dataset identity, and any notes. Without that, even a beautiful chart can be hard to interpret because you do not know what configuration produced it. Then look at the learning-rate curve. This gives the schedule context. A loss curve only makes sense relative to what the optimizer is being asked to do. Then look at training loss and, if available, [[Glossary#Validation loss|validation loss]]. Read them together, not separately. Ask whether the run is still improving at a meaningful rate and whether the gap between train and validation is widening. Then examine gradient norm. This is where you notice whether the optimization dynamics are calm or erratic. Only after that should you inspect system panels: GPU utilization, memory usage, power, and throughput. These panels help you distinguish "the model is learning slowly" from "the run is wasting hardware." The key is that W&B should be read like a control panel, not like a single decorative line chart. ## How to use TensorBoard properly TensorBoard is easiest to misuse when you treat it as "the local version of W&B" and stop there. It is better to think of it as a focused local visualization tool. TensorBoard is excellent for checking whether logging is wired correctly, whether scalar histories are sane, and whether a local run is evolving as expected. It is especially good when you want fast feedback without extra account setup.[^3] The right habit is the same as with W&B: do not open it just to admire the loss curve. Open it to answer a concrete question. Is the learning-rate warmup behaving correctly? Is the run slowing down? Did a change in batch size alter the shape of optimization? Is the validation loss separating from the training loss? Used that way, TensorBoard is a sharp instrument rather than a passive dashboard. ## Comparing runs scientifically The real value of experiment tracking appears when you compare runs. But comparison is only meaningful if the runs are close enough that the changed variable actually explains the observed difference. If one run changed the dataset, the sequence length, the batch size, the warmup schedule, and the architecture all at once, then almost nothing can be concluded cleanly. This is where scientific discipline enters. Change one major variable at a time when possible. Write down the intent of the run. Name the run so the hypothesis is visible later. Then compare the resulting curves and outputs against that specific question. For example, a clean comparison might ask: - Did increasing block size improve validation behavior enough to justify slower throughput? - Did switching datasets improve conversational behavior after [[Glossary#SFT|SFT]]? - Did a different warmup schedule reduce instability at the start of training? Those are good experiment questions because they isolate one design decision at a time. ## Common signs of a healthy run A healthy run does not mean every metric is perfectly smooth. It usually means the system shows a coherent pattern: - the learning rate follows the intended schedule - the loss falls at a plausible rate - gradient norms stay bounded - throughput is stable enough to trust ETA - checkpointing and logging do not freeze progress - no persistent network, dataloader, or precision errors appear The goal is not aesthetic perfection. The goal is internal consistency. ## Common signs of trouble The opposite pattern is also worth stating explicitly. A run should be treated as suspicious when: - loss is flat or rises for too long without explanation - gradient norm spikes wildly or becomes `nan` - step time drifts upward sharply - GPU utilization remains low for long periods - repeated dataset retries or I/O stalls dominate the logs - validation behavior diverges from training in a worrying way You should learn that these are not reasons to panic immediately. They are reasons to investigate. ## What a good run report looks like For a capstone course, you should be able to produce a short run report after training. It does not have to be complicated. A strong report might include: - hardware used - runtime - main hyperparameters - loss curve summary - throughput summary - final checkpoint path or Hub link - a few prompt comparisons before and after SFT - one paragraph explaining whether the run was worth its cost This is the sort of artifact that makes a course look serious to researchers and professors, because it reflects the habits of actual experimental work rather than a one-off demo mindset. ## What you should remember Experiment tracking is not about making dashboards look professional. It is about making your conclusions trustworthy. TensorBoard and W&B are only useful if they help you answer concrete questions about optimization, stability, throughput, and result quality. Once you understand that, they stop treating telemetry as optional decoration and start using it as part of the scientific method. <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Distributed Training and Multi-GPU" href="Distributed%20Training%20and%20Multi-GPU">Distributed Training and Multi-GPU</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Research Workflow and Ablations" href="Research%20Workflow%20and%20Ablations">Research Workflow and Ablations</a></div> </div> </div> ## References [^1]: Montekkundan, [llm repository](https://github.com/Montekkundan/llm) [^2]: Andrej Karpathy, [nanochat](https://github.com/karpathy/nanochat) [^3]: TensorFlow, [TensorBoard guide](https://www.tensorflow.org/tensorboard) [^4]: PyTorch, [TensorBoard support in PyTorch](https://docs.pytorch.org/docs/stable/tensorboard.html) [^5]: Weights & Biases, [Experiment tracking guides](https://docs.wandb.ai/guides/track) [^6]: Hugging Face, [Trainer documentation](https://huggingface.co/docs/transformers/main_classes/trainer) [^7]: NVIDIA, [Performance analysis and monitoring resources](https://developer.nvidia.com/performance-analysis-tools)