Training Loop - lectures

> [!info] Course code > Use the companion repository for runnable notebooks, figures, and implementation references for this lecture: > - [notebooks/training_loop/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/training_loop/lecture_walkthrough.ipynb) > - [course_tools/runtime.py](https://github.com/Montekkundan/llm/blob/main/course_tools/runtime.py) > - [scripts/base_training_flow/run.py](https://github.com/Montekkundan/llm/blob/main/scripts/base_training_flow/run.py) > - [picollm/accelerated/pretrain/train.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/pretrain/train.py) > - [picollm/accelerated/checkpoint_manager.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/checkpoint_manager.py) ## What This Concept Is Imagine showing the model one batch of data, letting it guess, measuring how wrong it was, and then nudging the weights a little. Now imagine doing that again and again for millions of updates. That repeated cycle is the training loop. This note is useful because it takes the model out of the realm of diagrams and turns it into a process that actually changes over time. ## Foundation Terms You Need First Keep the loop in four beats. A **batch** is the chunk of examples you show at once. The **[[Glossary#Forward pass|forward pass]]** produces logits and loss from the current weights. The **[[Glossary#Backward pass|backward pass]]** computes gradients that say how the weights should move. The **optimizer step** applies that movement. Once those four beats are clear, the larger engineering details make more sense. Evaluation cadence, checkpointing, mixed precision, and distributed launch are all additions around that same core cycle. ```mermaid flowchart TD A["Batch"] --> B["Forward pass"] B --> C["Loss"] C --> D["Backward pass"] D --> E["Optimizer step"] E --> F["Next batch"] E --> G["Eval, samples, and checkpoints"] ``` ## A toy loop before the systems details Imagine one very small batch of text. The model reads it, predicts the next tokens, gets some of those predictions wrong, and produces a loss. Backpropagation turns that loss into gradients, and the optimizer uses those gradients to slightly change the weights. Then you do it again on the next batch. That is the core loop. Everything else in this note exists because real training runs need that same loop to survive for hours or days while still producing evidence you can trust. ## The loop appears twice in this course I deliberately teach the training loop in two layers. - `course_tools/runtime.py` and the notebook show the smallest loop that you can still understand end to end. - `picollm/accelerated/pretrain/train.py` shows the serious loop that powers the real multi-GPU run. The conceptual loop is the same in both places. What changes is the systems surface: > [!example] Notebook follow-up > - [`Forward backward step repeat`](https://github.com/Montekkundan/llm/blob/main/notebooks/training_loop/lecture_walkthrough.ipynb#forward-backward-step-repeat) > Use this notebook section here to ground the small training-loop story before the larger systems discussion. - distributed launch - [[Glossary#Checkpoint|checkpoint]] recovery - evaluation cadence - mixed precision - [[Glossary#MFU|MFU]] and [[Glossary#Throughput|throughput]] estimates - long-running reporting and artifact management The training loop is where the abstract model becomes a learning system. It is easy to underestimate this note because "training loop" can sound like boilerplate. It is not boilerplate. It is the mechanism that turns data, parameters, gradients, hardware, and metrics into actual optimization. [^1] ## The actual evaluation and checkpoint triggers in picoLLM Once the small four-beat loop is clear, the right next question is: *what code line decides to run [[Glossary#Bits per byte (BPB)|BPB]], CORE, samples, and checkpoints, and at which step?* In `picollm/accelerated/pretrain/train.py`, the loop explicitly checks these conditions: ```python if args.eval_every > 0 and (last_step or step % args.eval_every == 0): ... val_bpb = evaluate_bpb(...) ``` ```python if args.core_metric_every > 0 and (last_step or (step > 0 and step % args.core_metric_every == 0)): ... results = evaluate_core(...) ``` ```python if args.sample_every > 0 and master_process and (last_step or (step > 0 and step % args.sample_every == 0)): ... ``` ```python if last_step or (step > 0 and step != args.resume_from_step and args.save_every > 0 and step % args.save_every == 0): save_checkpoint(...) ``` The default flag values at parser definition time are: - `--eval-every=250` - `--core-metric-every=2000` - `--sample-every=2000` - `--save-every=-1` So by default: - validation BPB runs every 250 steps - CORE runs every 2000 steps - samples run every 2000 steps on the master process - checkpoints save only at the end unless `save-every` is set `speedrun.sh` keeps the eval cadence defaults and mainly overrides the batch and hardware-related settings. The main save override comes from `PICOLLM_BASE_SAVE_EVERY`, especially when periodic Hugging Face archive sync is enabled. ## Why the loop runs one extra iteration at the end The pretrain loop defines: ```python last_step = step == num_iterations ``` and the comment says the loop runs `num_iterations + 1` times so it can still evaluate and save at the end. This is an important systems detail. Without it, a run whose final step is not aligned to `eval_every` or `core_metric_every` could finish without a clean terminal evaluation snapshot. ## Training-loop control flow in the serious path ```mermaid flowchart TD A["Load or resume state"] --> B["Pull next token batch"] B --> C["Forward pass and loss"] C --> D["Backward pass"] D --> E["Optimizer step"] E --> F{"step hits eval cadence?"} F -->|yes| G["Validation BPB"] F -->|no| H{"step hits CORE cadence?"} G --> H H -->|yes| I["CORE eval"] H -->|no| J{"step hits sample cadence?"} I --> J J -->|yes| K["Short sample generation"] J -->|no| L{"step hits save cadence or final step?"} K --> L L -->|yes| M["Save checkpoint + metadata"] L -->|no| N["Log step metrics"] M --> N N --> O["Increment step and repeat"] ``` ## What `rank`, `pq`, and `rg` mean inside the loop Three log fields cause confusion because they mix optimization and data-loading state. - `rank`: distributed process identity - `pq`: current Parquet shard index - `rg`: current Parquet row-group index inside that shard This is not random instrumentation. These values exist because the accelerated loader is streaming a sharded Parquet corpus across multiple distributed workers, and the run needs enough state to resume approximately where it left off. That is why the checkpoint metadata stores: - `step` - `val_bpb` - `dataloader_state_dict` - `loop_state` with `min_val_bpb`, `smooth_train_loss`, and `total_training_time` If you want a flag-by-flag explanation of the cloud training knobs such as `vocab-size`, `block-size`, `layers`, `batch-size`, `grad-accum`, `warmup-steps`, `max-steps`, `save-steps`, `bf16`, or `--alternating-chat-roles`, read [[Training Configuration and Hyperparameters]] alongside this note. ## The minimal loop A minimal language-model training loop does five things: 1. sample a batch of token sequences 2. run a forward pass to compute [[Glossary#Logits|logits]] and [[Glossary#Loss|loss]] 3. backpropagate to compute gradients 4. update parameters with an optimizer 5. log metrics and repeat If you cannot trace those five steps through code, you do not yet understand how the model changes over time. ## The companion-code version of that story In the companion repo, the conceptual loop is implemented with real engineering layers around it: - `TokenDataset.batch()` samples training windows - `GPT.forward()` computes logits and loss - the optimizer updates the model - periodic evaluation estimates [[Glossary#Validation loss|validation loss]] and BPB - checkpoints and reports are written to artifacts directories This shows that industrial complexity is layered on top of a small conceptual core. In the serious `picollm` path, those same ideas reappear as: - `tokenizing_distributed_data_loader_*` for distributed batch construction - `GPT.forward()` in `picollm/accelerated/gpt.py` - optimizer setup inside `model.setup_optimizer(...)` - periodic `evaluate_bpb(...)` and task-oriented evals - `save_checkpoint(...)` and resumable state in `checkpoint_manager.py` ## Why batching exists We do not train on one sequence at a time because modern accelerators are built for parallel numerical work. Batching improves throughput and makes the gradient estimate less noisy. But batching is not free. It changes: - memory usage - gradient statistics - optimization stability - throughput - the effective number of tokens seen per update Learn to distinguish: - per-device [[Glossary#Batch size|batch size]] - total batch size - sequence length - tokens per update Those are different knobs. ## Gradient accumulation and effective batch size One of the most important systems lessons in the repo is that the batch size you want may be larger than the batch size your device memory can hold. So training often uses [[Glossary#Gradient accumulation|gradient accumulation]]: - do several micro-batches - sum or accumulate their gradients - apply one optimizer step after enough accumulation That means the effective global batch size depends on: - device batch size - number of processes - accumulation steps This is why it helps to stop thinking about "batch size" as one scalar with one meaning. ## Steps versus epochs This is one of the easiest places to get confused, so it is worth slowing down and naming the pieces clearly. A **step** is one optimizer update. That means: - a batch goes through the model - the loss is computed - gradients are backpropagated - the optimizer updates the weights - the global step counter increases by one An **epoch** is one full pass through the dataset. That means: - if the dataset has enough batches to cover every training example once - and the model processes all of them - then you have completed one epoch So the simplest distinction is: - **step** = update count - **epoch** = dataset-coverage count They are related, but they are not the same thing. ### Why logs show fractional epochs Training logs often show values like: - `epoch = 0.13` - `epoch = 1.74` Those are normal. They mean: - `0.13` = about 13 percent of one full pass through the dataset - `1.74` = one full pass plus about 74 percent of a second pass So an epoch does not need to be a whole number while training is in progress. The logger is just reporting where the current run sits relative to one complete sweep of the dataset. ### Why `max_steps` and `num_train_epochs` feel different There are two common ways to stop training: - stop after a fixed number of optimizer updates - stop after a fixed number of full passes through the dataset In Hugging Face training code, those usually appear as: - `max_steps` - `num_train_epochs` If you set: - `max_steps = 12000` you are saying: - stop after 12000 optimizer updates, no matter how many full passes through the dataset that turns out to be If you set: - `num_train_epochs = 3` you are saying: - stop after the model has seen the dataset three full times So: - `max_steps` controls **training budget** - `epochs` describe **dataset coverage** ### Why a large dataset can still show a small epoch value If the dataset is large, one full epoch can require a very large number of steps. That is why a run can be thousands of steps in and still show: - `epoch = 0.13` That does not mean training is broken or that the model is not learning. It only means the dataset is large enough that one complete pass takes many optimizer updates. ### Where gradient accumulation fits into this Gradient accumulation changes how often the optimizer steps relative to how many micro-batches you process. So it affects: - how many tokens are seen before one optimizer update - how quickly the step counter moves - how much data is consumed per step That is one more reason steps and epochs should not be treated as interchangeable. ### The practical way to read a training run If you want a clean mental model while watching logs, read them like this: - `loss` tells you how training is going - `step` tells you how many optimizer updates have happened - `epoch` tells you how much of the dataset has been covered So if a run says: - `step = 4500` - `epoch = 0.13` the right reading is: - the optimizer has already updated the model many times - but the dataset is large enough that the run has still completed only a small fraction of one full pass ### Why fixed-step runs are common in practice Many practical runs use a fixed `max_steps` instead of a fixed epoch count because it makes training budgets easier to compare. That is especially useful when: - comparing experiments across different datasets - renting cloud GPUs by time - running short demo jobs - stopping early once quality is already good enough So a run that stops before `epoch = 1.0` is not automatically incomplete. It is just a run with a chosen optimization budget. ## The optimizer is part of the learning algorithm The optimizer is not a side choice. It changes what training looks like in practice. [^2] [^3] Pay attention to: - learning rate - weight decay - [[Glossary#Warmup|warmup]] - gradient clipping - precision mode - checkpoint cadence Architecture gives capacity. Optimization determines whether training can actually reach useful parameters. ## Validation belongs inside the loop Training loss is not enough. A real loop also needs regular validation. In the companion code, the training demos compute held-out metrics and write run summaries. That matters because a useful run should answer: - did the loss improve on unseen text? - what was the best checkpoint? - how much throughput did the run achieve? - which [[Glossary#Tokenizer|tokenizer]] and dataset were used? Once you start seeing validation, checkpoints, and reports as part of the loop, the training run stops feeling like a notebook demo and starts feeling like an actual engineering system. ## Checkpointing and reporting are not optional extras The main idea here is: > A run is not only the final weights. A run is weights plus metadata plus artifacts plus a reproducible path back to how it happened. That is why the loop writes: - checkpoints - best-checkpoint metadata - validation metrics - run reports - dataset and tokenizer references Those outputs are the memory of the training system. ## What "writing model shards" means During training, you may see logs such as: - `Writing model shards` That does not mean the model is being split across multiple machines at that moment. In this context it usually means the checkpoint is being saved into multiple files instead of one giant file. Why this happens: - large checkpoints are easier to save and move in smaller pieces - some tooling writes multiple `safetensors` files automatically once a model crosses a size threshold - loading code can stitch those files back together as one model So if I explain that a bit more directly: - the model in memory is still one checkpoint - the saved artifact on disk may be split into several shard files - sharding here is a storage format detail, not a new learning algorithm ## The base training flow in practice It helps to see the training loop as a reproducible artifact pipeline, not only as optimization pseudocode. The practical base training flow is: 1. ensure dataset artifacts exist 2. ensure tokenizer artifacts exist 3. construct the model from a config 4. run training with validation and checkpointing 5. emit reports and machine-readable metadata That is why the product scripts matter. They show that a real run is organized around dependencies and outputs, not around one notebook cell. If you want to inspect the product-facing version, look at: - [scripts/base_training_flow/](https://github.com/Montekkundan/llm/tree/main/scripts/base_training_flow) The most useful questions to ask there are: - where the preset is selected - how model and training configs are created - where checkpoints are written - where validation and report generation happen ## The base evaluation flow in practice Evaluation should not be treated as something that happens only after training is over. It is a first-class scriptable part of the system. The base evaluation flow exists to answer one question honestly: > what did this checkpoint actually learn? Its outputs should include: - validation loss - bits per byte - sample generations - task or [[Glossary#Benchmark|benchmark]] breakdowns when available - structured reports That is the practical counterpart to the conceptual point made earlier in this note: training quality only becomes believable when it is paired with metrics and artifacts. If you want to inspect the product-facing evaluation path, look at: - [scripts/base_evaluation_flow/](https://github.com/Montekkundan/llm/tree/main/scripts/base_evaluation_flow) The right recurring question to ask is: > if two checkpoints look different in chat, can we explain that difference with metrics and artifacts, or are we only relying on anecdotes? ## Distributed training changes the surface, not the objective The learning rule is still next-token prediction. Distributed training only changes how we execute it at scale. Understand the consequences: - ranks must be coordinated - logging usually happens on rank 0 - total batch size changes across workers - checkpointing and evaluation must be synchronized carefully That is why distributed training belongs in the systems track even if not everyone runs multi-GPU jobs. ### Single GPU versus multi-GPU On one GPU, one process owns the whole model and one shard of the current batch. That is the easiest setup to reason about and usually the best place to start. On multiple GPUs, the most common first step is still data parallel training: - each GPU gets its own copy of the model - each GPU processes a different mini-batch shard - gradients are synchronized before the optimizer step So the model is not learning a different objective. It is learning the same objective with more hardware working in parallel. The most important quantity to keep track of is the effective global batch: - $\text{per\_device\_batch\_size} \times \text{num\_gpus} \times \text{grad\_accum}$ That is why a multi-GPU run can behave differently even when the code looks almost the same: - throughput usually improves - optimizer steps may become less noisy because the effective batch is larger - memory pressure per GPU can stay the same if you keep the per-device batch fixed - learning-rate and warmup choices may need to be revisited once the effective batch changes In the cloud pretraining code for this course, the practical difference looks like this: Single GPU: ```bash uv run python -m picollm.accelerated.pretrain.train ... ``` Multi-GPU on one machine: ```bash uv run torchrun --nproc_per_node=2 -m picollm.accelerated.pretrain.train ... ``` That change is mostly about launch strategy. The training objective, model code, dataset, and checkpoint story remain the same. ## Common confusions ### "Is a training loop just a `for` loop?" At the smallest scale, yes. At the systems scale, no. It is a control structure that must coordinate artifacts, metrics, failure recovery, and hardware efficiency. ### "Why is validation not just done at the end?" Because you need signals during training to detect divergence, overfitting, or the best checkpoint. ### "If the architecture is correct, should training just work?" No. Bad optimization or bad data handling can ruin a correct architecture. ## A useful study pattern Study the loop twice: 1. first as 15 to 20 lines of pseudocode so the control flow stays visible 2. then as the companion-code walkthrough in [notebooks/training_loop/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/training_loop/lecture_walkthrough.ipynb) That makes the continuity between theory and engineering much easier to see. ## Key takeaway The training loop is the heartbeat of the project. If you can trace data, loss, gradients, updates, validation, checkpoints, and reports through one run, you have crossed from passive understanding into engineering understanding. > [!example] Notebook walkthroughs in this lecture > > Use this order: > > 1. [notebooks/training_loop/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/training_loop/lecture_walkthrough.ipynb) > Use these sections as you read: > - [`Forward backward step repeat`](https://github.com/Montekkundan/llm/blob/main/notebooks/training_loop/lecture_walkthrough.ipynb#forward-backward-step-repeat) > - [`Validation belongs inside the run`](https://github.com/Montekkundan/llm/blob/main/notebooks/training_loop/lecture_walkthrough.ipynb#validation-belongs-inside-the-run) > - [`Batch size and accumulation change effective tokens per update`](https://github.com/Montekkundan/llm/blob/main/notebooks/training_loop/lecture_walkthrough.ipynb#batch-size-and-accumulation-change-effective-tokens-per-update) > - [`Checkpoints and reports are part of the loop`](https://github.com/Montekkundan/llm/blob/main/notebooks/training_loop/lecture_walkthrough.ipynb#checkpoints-and-reports-are-part-of-the-loop) <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Causal Language Modeling" href="Causal%20Language%20Modeling">Causal Language Modeling</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Training Configuration and Hyperparameters" href="Training%20Configuration%20and%20Hyperparameters">Training Configuration and Hyperparameters</a></div> </div> </div> ## Further reading - PyTorch, "Autograd," 2025. https://docs.pytorch.org/docs/stable/autograd.html - PyTorch, "torchrun (Elastic Launch)," 2025. https://docs.pytorch.org/docs/stable/elastic/run.html --- ## References [^1]: Diederik P. Kingma and Jimmy Ba, "Adam: A Method for Stochastic Optimization," 2015. https://arxiv.org/abs/1412.6980 [^2]: Ilya Loshchilov and Frank Hutter, "Decoupled Weight Decay Regularization," 2019. https://arxiv.org/abs/1711.05101 [^3]: Paulius Micikevicius et al., "Mixed Precision Training," 2018. https://arxiv.org/abs/1710.03740