> [!info] Course code > Use the companion repository for the capstone code paths discussed in this note: > - [picollm/accelerated/README.md](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/README.md) > - [picollm/accelerated/pretrain/train.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/pretrain/train.py) > - [picollm/accelerated/common.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/common.py) > - [picollm/accelerated/speedrun.sh](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/speedrun.sh) ## What This Concept Is Open a training command and it can look almost harmless: a few flags, a few paths, maybe one `torchrun`. Then you look at the GPU bill, the wall-clock time, and the hardware limits, and suddenly the real story is somewhere else. This note is about building the mental model that explains that gap. Once that model is clear, cost stops feeling mysterious. You start seeing training as a budget of tokens, compute, memory movement, and wall-clock time. ## Foundation Terms You Need First Keep four quantities separate right away. A **token budget** is how many training tokens the run will process. **[[Glossary#FLOP / FLOPS|FLOP / FLOPS]]** tell you how much numerical work exists and how quickly hardware can perform it. **[[Glossary#Throughput|Throughput]]** tells you how much useful work the system is actually finishing. **Wall-clock time** is how long you wait in real life. If those units blur together, the cost story becomes confusing very quickly. If they stay separate, you can reason about why a run is slow, expensive, or unexpectedly efficient. ## The first mental shift: the model is not trained on files, it is trained on tokens One of the biggest conceptual mistakes beginners make is to estimate training cost from the size of the dataset in gigabytes. But the model never "sees gigabytes." It sees token IDs. A dataset that looks small on disk can still expand into a huge number of training tokens, and a tokenizer can make the same human-readable sentence cheap or expensive depending on how efficiently it segments the text. This is one of the reasons [[LLM/concepts/Tokenization]] matters operationally and not only linguistically.[^2] In a decoder-only language model, training is fundamentally next-token prediction. Each step pushes a sequence of token IDs through a Transformer, computes [[Glossary#Logits|logits]] over the [[Glossary#Vocabulary|vocabulary]], compares those logits to the ground-truth next tokens, backpropagates the [[Glossary#Loss|loss]], and updates parameters. That entire loop must be repeated across an enormous token budget. A modern LLM run is expensive not because any single step is magical, but because the same very large network is applied over and over again to billions of tokens.[^3] ## The units that matter To reason clearly about cost, time, and hardware, you need a few units to stay fixed in their heads. ### FLOP and FLOPS A FLOP is one floating-point operation. If you multiply two floating-point numbers, that is a [[Glossary#FLOP / FLOPS|FLOP]]. If you add them, that is another FLOP. In practice, neural network layers execute vast numbers of these operations inside fused GPU kernels, so no engineer counts them one by one. But the concept still matters because it tells us what "work" means in training. Training is not expensive in an abstract way. It is expensive because it requires an enormous number of floating-point operations.[^4] FLOPS means floating-point operations per second. That is a rate, not an amount. If FLOP is "how much work," then FLOPS is "how fast the machine can do that work." GPU marketing pages love FLOPS because they advertise peak speed, but peak speed is never the same thing as end-to-end training speed. The GPU may be capable of an enormous number of operations per second in ideal conditions, while the real training job only reaches a fraction of that due to memory traffic, communication overhead, kernel inefficiency, padding, checkpoint writes, and data-loading delays.[^5] ### Parameters Parameters are the learned weights of the model. A larger model usually has more expressive capacity, but it also usually requires more compute, more memory, and more training data to be used efficiently. This is why model size cannot be chosen in isolation. A billion-parameter model sounds exciting, but if the training budget is too short, the result may simply be an undertrained large model rather than a good one.[^6] ### Tokens and sequence length The token is the economic unit of LLM training. Training compute scales much more directly with parameter count and total tokens seen than with the raw size of the dataset on disk. Sequence length matters because [[Glossary#Self-attention|self-attention]] becomes more expensive as the sequence grows; longer contexts are not free, even when the code for `block_size=1024` looks like one harmless argument.[^3] [^7] ### Batch size, gradient accumulation, and GPU count It is easy to treat `batch_size`, `grad_accum`, and `num_gpus` as separate knobs, but operationally they combine into one quantity: how many tokens are processed per optimizer step. In other words, these values determine how much training work gets done before the optimizer updates the model weights once. For a decoder-only run, a practical approximation is: $ \text{tokens per step} = \text{per-device batch size} \times \text{grad accum} \times \text{num gpus} \times \text{sequence length} $ That one line explains a lot of runtime behavior. If you double the sequence length while keeping everything else constant, you are processing about twice as many tokens per step, but you may also increase attention cost and memory pressure. If you double the number of GPUs, you can potentially process more tokens per step in the same wall-clock time, but only if the system scales well enough.[^8] ### MFU: the gap between marketing and reality MFU, or model FLOPs utilization, is one of the most useful "grown-up" concepts in LLM systems work. It measures how much of the hardware's theoretical peak compute you are actually converting into useful training work. If a GPU advertises enormous [[Glossary#BF16|BF16]] Tensor Core [[Glossary#Throughput|throughput]], you should not expect your full training script to sustain that exact number. Real runs lose time to memory movement, synchronization, input pipeline delays, non-matmul kernels, imperfect operator fusion, and communication overhead. [[Glossary#MFU|MFU]] is therefore the bridge between hardware theory and wall-clock reality.[^9] ## Device names and precision words you should stop fearing This note is also the right place to normalize some basic vocabulary that can sound much scarier than it really is. - `cpu` means the ordinary processor path. It is good for correctness and tiny demos. - `[[Glossary#MPS|mps]]` means the Apple Metal backend used on Apple Silicon Macs. - `cuda` means the NVIDIA GPU backend. This is the serious training path for the accelerated `picollm` run. In `picollm/accelerated/common.py`, that detection happens explicitly in `autodetect_device_type()` and `_detect_compute_dtype()`. So when the logs say `Autodetected device type: cuda` or `COMPUTE_DTYPE: torch.bfloat16`, the code is not doing anything mystical. It is just checking what hardware is available and choosing an execution format that matches it. The precision words matter for cost and speed: - `fp32` is the old safe default - `bf16` is the common fast training default on Ampere and Hopper GPUs - `fp16` is still common, but often needs gradient scaling - `fp8` is the aggressive Hopper-era optimization path used only in selected accelerated runs That is why the serious run uses terms like `bf16`, `FlashAttention 3`, and `fp8`. They are not separate model ideas. They are throughput and memory ideas. ## Two ways to estimate training time There are two practical styles of estimation, and both are worth learning. ### 1. The live-log estimate This is the most useful estimate while a job is already running. Suppose a run shows: - `21 / 5000` steps completed - `2.84s/it` Then the rough wall-clock for pretraining is simply: $ 5000 \times 2.84\,\text{s} \approx 14{,}200\,\text{s} \approx 3.95\,\text{hours} $ That estimate is powerful because it uses the real system you are actually paying for, not the ideal system you wish you had. It automatically includes hidden costs like data stalls, checkpoint overhead, distributed launch friction, logging, and hardware-specific scaling inefficiencies. In the real world, this estimate is often more trustworthy than a purely theoretical FLOP calculation once the job has started. Then you add the downstream stages. If [[Glossary#SFT|SFT]] is likely to take another 30 to 60 minutes, and final checkpoint packaging plus upload takes a few minutes more, the full one-command pipeline becomes roughly $4.5$ to $5$ hours. This is exactly the style of reasoning you should learn to do from logs rather than from hope. ### 2. The FLOP estimate The more theoretical estimate begins with how much computation the run is expected to perform. For dense autoregressive transformer pretraining, a common back-of-the-envelope rule is: $ \text{training FLOPs} \approx 6 \times N \times D $ where $N$ is the number of parameters and $D$ is the number of training tokens.[^6] This formula is not a full simulator of every kernel, but it is good enough to teach the core idea that model size and token budget jointly determine training work. If the model has about $336$M parameters and you plan to process roughly $5.24$B tokens, then: $ \text{training FLOPs} \approx 6 \times 3.36 \times 10^8 \times 5.24 \times 10^9 \approx 1.06 \times 10^{19}\ \text{FLOPs} $ Now estimate the hardware throughput you can actually sustain: $ \text{effective FLOPS} \approx \text{num gpus} \times \text{peak gpu FLOPS} \times \text{MFU} $ For an $8 \times \text{H100}$ run, NVIDIA publishes the relevant Tensor Core throughput figures on the H100 product page.[^5] But even with a very fast node, what matters is the sustained throughput, not the marketing maximum. If MFU is mediocre, your runtime grows immediately. This is why two teams can rent the "same" GPU class and still report meaningfully different runtimes. Their kernels, sequence lengths, dataloaders, communication patterns, and software stack quality may differ enough to change wall-clock training by hours. ## Why training still takes so long on excellent GPUs The short answer is that a modern GPU is fast, but the workload is absurdly large. Every training step includes a forward pass, loss computation, backward pass, and optimizer update. Unlike plain inference, training must retain activations for backward, compute gradients for all trainable parameters, and often maintain optimizer state such as first and second moments. In practice, this means training is not merely "inference plus a little extra." It is substantially more expensive both in compute and in memory footprint.[^10] Then there is attention. The original Transformer made self-attention practical and powerful, but it also made sequence length a first-class systems constraint. Longer sequences are useful because they let the model condition on more context, but they also raise the cost of attention. This is one reason long-context training and inference remain expensive even on cutting-edge hardware.[^3] [^7] Another source of delay is memory bandwidth. Many you may first imagine GPUs as pure arithmetic engines, but large-model training is often bottlenecked by moving tensors around: loading parameters, reading activations, writing gradients, fetching optimizer states, and managing the KV-style structures used later at inference time. This is why GPU memory bandwidth is so prominent in accelerator specs. It is not a side detail. It is often one of the main throughput constraints.[^5] ## Why multi-GPU training is faster, and why it is not perfectly linear Adding GPUs helps because it increases the amount of work the system can perform in parallel. If one GPU can process a certain number of tokens per step, then multiple GPUs can potentially process more tokens in the same wall-clock interval. This is the basic reason distributed training exists. But the speedup is not perfectly linear. When you move from one GPU to eight GPUs, you also create a communication problem. Gradients have to be synchronized. Parameters or optimizer states may need to be partitioned or broadcast. Some GPUs may wait for slower peers. This overhead becomes more visible as models, batch structure, or node topology become less favorable. So the ideal story is "eight GPUs equals eight times faster," but the real story is "eight GPUs can be much faster, as long as the software stack and communication path are good enough."[^11] This is exactly why high-end training nodes are not priced only by raw FLOPS. Interconnect quality, memory capacity, and bandwidth all matter. A node with eight GPUs that communicate well with each other is worth more than eight disconnected accelerators that spend too much time waiting. ## Why training becomes expensive in dollars Once you understand the wall-clock side, the dollar side becomes straightforward. GPU providers charge by time, not by elegance of code. If a run takes five hours on an `$18/hour` node, the rough compute bill is: $ 5 \times \$18 = \$90 $ That is before considering bandwidth charges, storage, failed restarts, or multiple experiments. Training cost is therefore not just "one run." It is often "the final run plus all the calibration runs and mistakes that made the final run possible." This is why even small design choices affect cost. If a run is undertrained and you have to rerun it, cost doubles. If your dataloader is unstable and causes repeated stalls, you are paying for idle hardware. If your checkpoint cadence is too aggressive, you may spend more time than necessary writing to disk. Good systems engineering is therefore not separate from cost control. It is cost control. ## Why inference is expensive in a different way You may hear "training is expensive but inference is cheap." That statement is only partially true. Inference is cheaper than training per token, because you do not perform backward propagation or optimizer updates. But inference becomes expensive when it must serve real users continuously. The first reason is autoregressive decoding. A decoder-only language model generates one token at a time. That means the system cannot simply output the whole answer in one parallel pass. It repeatedly runs the model to extend the sequence token by token. This makes [[Glossary#Latency|latency]] and throughput deeply tied to hardware performance and memory movement.[^12] The second reason is the [[Glossary#KV cache|KV cache]]. During generation, the model stores key and value tensors from previous tokens so it does not have to recompute the entire history from scratch. This is necessary for efficient serving, but it also consumes memory, and the cache grows with context length and concurrency. Serving many users at once with long contexts can therefore turn inference into a memory-capacity problem as much as a compute problem.[^13] The third reason is availability. Training jobs can stop when they finish. Production inference systems must often stay up all day. Even if no single user request is extremely expensive, a system that is always available on expensive hardware accumulates cost over time. This is why a smaller well-trained model can be the better product choice if it is much cheaper to serve while still delivering acceptable quality.[^8] ## Why this matters for the capstone For the `picollm` capstone, this note is not an abstract systems detour. It explains why the accelerated path is designed the way it is: explicit tokenizer stage, large rented node, fused attention when possible, and a compressed wall-clock budget. If a run is too small, the result is weak or unstable. If it is too large, the runtime and bill become unreasonable for a course project. Good project design therefore sits between those extremes: large enough to produce a functional conversational model, but small enough to complete on a serious rented node without turning the course into a cluster-operations class. That is also why the live-log estimate matters. You should know how to launch a run, look at `steps / total`, `seconds per iteration`, and current loss, and form a defensible estimate of completion time and cost. That skill is part of real ML engineering. <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Evaluation and Model Quality" href="Evaluation%20and%20Model%20Quality">Evaluation and Model Quality</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Distributed Training and Multi-GPU" href="Distributed%20Training%20and%20Multi-GPU">Distributed Training and Multi-GPU</a></div> </div> </div> ## References [^1]: Montekkundan, [llm repository](https://github.com/Montekkundan/llm) [^2]: OpenAI, [What are tokens and how to count them?](https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-do-i-count-them) [^3]: Ashish Vaswani et al., Google Research, [Attention Is All You Need](https://arxiv.org/abs/1706.03762) [^4]: NVIDIA, [Tensor Core GPU fundamentals and performance discussions](https://developer.nvidia.com/blog) [^5]: NVIDIA, [H100 Tensor Core GPU](https://www.nvidia.com/en-us/data-center/h100/) [^6]: Jordan Hoffmann et al., DeepMind, [Training Compute-Optimal Large Language Models](https://arxiv.org/abs/2203.15556) [^7]: Tri Dao et al., Princeton University, [FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness](https://arxiv.org/abs/2205.14135) [^8]: Hugging Face TB, [Smol Training Playbook](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook) [^9]: Andrej Karpathy, [nanochat](https://github.com/karpathy/nanochat) [^10]: Sebastian Raschka, [LLMs-from-scratch](https://github.com/rasbt/LLMs-from-scratch) [^11]: NVIDIA, [NCCL and multi-GPU communication resources](https://developer.nvidia.com/nccl) [^12]: Hugging Face, [Text generation and decoder-only inference concepts](https://huggingface.co/docs/transformers/main_classes/text_generation) [^13]: NVIDIA, [Mastering LLM Techniques: Inference Optimization](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/)