Advanced Distributed Training Systems

> [!info] Course code > Use these repo paths together with this note: > - [picollm/accelerated/speedrun.sh](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/speedrun.sh) > - [picollm/accelerated/pretrain/train.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/pretrain/train.py) > - [picollm/accelerated/flash_attention.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/flash_attention.py) > - [picollm/accelerated/fp8.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/fp8.py) ## What This Concept Is Once you move past the sentence "use more GPUs," distributed training stops being a slogan and becomes a design problem. Now you have to decide where parameters live, where activations live, when gradients synchronize, and how much of the machine is actually computing instead of waiting. This note is about that deeper design space. ## Foundation Terms You Need First Picture a larger cluster for a second. The model is too large, the batches are bigger, and the expensive part is no longer just math. Keep four questions in your head as you read. Where do the parameters live? Where do the activations live? When do workers synchronize? How much time is spent computing versus communicating? Those questions define the parallelism strategy more clearly than any single buzzword does. That is why this note is less about memorizing parallelism names and more about learning the trade-offs behind them. ## FlashAttention 3, FP8, and why H100-class nodes matter This is where the note should become concrete for the current `picollm` stack. The accelerated path is trying to behave more like a modern high-[[Glossary#Throughput|throughput]] training system, not like a generic PyTorch script. That is why you will see words such as: - FlashAttention 3 - FP8 - Hopper - tensorwise scaling - [[Glossary#BF16|BF16]] Tensor Cores Those are systems choices, not new learning objectives. - [picollm/accelerated/flash_attention.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/flash_attention.py) handles the fast fused-attention path when the GPU stack supports it. - [picollm/accelerated/fp8.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/fp8.py) handles the optional FP8 training conversion for selected linear layers. You should understand the motivation clearly: - FlashAttention reduces memory traffic and makes exact attention much more IO-efficient. - FP8 reduces bandwidth and storage for selected matrix operations, but only on hardware and software stacks designed for it. - H100-class machines matter because they are the first place where these more aggressive optimizations become realistic for a course-scale serious run. ## Data parallelism, tensor parallelism, and pipeline parallelism Data parallelism replicates the model and splits the data across workers. [[Glossary#Tensor parallelism|Tensor parallelism]] splits large tensor operations across devices so that one layer can be computed jointly by several GPUs. [[Glossary#Pipeline parallelism|Pipeline parallelism]] splits the model by layers or stages and passes micro-batches through that pipeline. These are different answers to different bottlenecks. Data parallelism helps when the model fits per device and you mainly want more throughput. Tensor and pipeline parallelism become more important when the model or activation footprint no longer fits comfortably on one device.[^1] [^2] The conceptual rule is: - if memory is the bottleneck, shard model state or activations - if throughput is the bottleneck, increase parallel work while limiting communication - if both are bottlenecks, combine strategies and accept new scheduling complexity ## ZeRO and optimizer sharding [[Glossary#ZeRO|ZeRO]] was important because it showed that optimizer states, gradients, and parameters themselves could be partitioned across workers instead of being fully replicated everywhere. This sharply reduced memory pressure and made larger models trainable on the same hardware budget. You should understand ZeRO not as magic, but as a memory-accounting idea: do not replicate expensive state everywhere unless you truly need to.[^3] At a practical level, this is one of the most important graduate-level insights in training systems: optimizer state often dominates memory long before the researcher emotionally notices that it should. ## Activation checkpointing Activation checkpointing is another memory trade-off. Instead of storing every intermediate activation for backward propagation, the system stores fewer of them and recomputes some values during the backward pass. This reduces memory at the cost of extra compute. It is one of the cleanest examples of a systems trade-off in modern deep learning: save memory now, pay recomputation later.[^4] Research you should learn to reason about it quantitatively: - activation memory scales with batch, sequence length, hidden size, and depth - checkpointing lowers the stored-activation term - recomputation raises total arithmetic cost That trade-off is often favorable because memory ceilings are hard constraints while extra compute is sometimes tolerable. ## Communication bottlenecks Distributed training performance is shaped as much by communication as by arithmetic. A cluster may have excellent GPUs, but if collectives are slow or the topology is unfavorable, workers spend time waiting rather than computing. This is why NCCL, interconnect quality, and cluster topology matter. FLOPS alone do not tell the whole story.[^5] You should internalize that all-reduce time is not bookkeeping. It is part of the runtime equation. A cluster with slower interconnects can erase the expected gain from adding more GPUs. ## Overlap, topology, and scheduling Good distributed systems try to overlap communication with useful computation. Whether that works depends on graph structure, kernel timing, and framework implementation details. This is also where topology matters: NVLink-connected devices, PCIe-only boxes, and cross-node Ethernet or InfiniBand setups do not behave the same way even if the GPU SKU looks identical on a listing. This is why serious performance debugging asks: - was communication overlapped or serialized? - were micro-batches large enough to keep the devices busy? - did the input pipeline stall the workers? - did [[Glossary#Checkpoint|checkpoint]] cadence interrupt useful work too often? ## MFU in real clusters [[Glossary#MFU|MFU]] is useful precisely because it compresses many of these realities into one practical question: how much of the theoretical compute are we actually converting into useful model work? A low MFU can be a sign of memory bottlenecks, communication inefficiency, weak input pipelines, or poor kernel efficiency. A good researcher does not just report runtime. They ask why the runtime is what it is. MFU is not a moral score. It is a diagnosis tool. ## Fault tolerance and checkpoint strategy Large runs fail in ways that toy experiments do not. Nodes disappear. Leases end. Processes crash. Filesystems stall. This is why fault tolerance is part of training systems, not project management. Checkpoint strategy matters because: - frequent checkpoints improve recoverability - frequent checkpoints also cost IO time and storage - too few checkpoints can turn a crash into a full-[[Glossary#Loss|loss]] event That is exactly why the `picollm` cloud path emphasizes resumable checkpoints, checkpoint pruning, and explicit recovery logic rather than assuming the machine will stay alive forever. ## Why identical hardware listings can behave differently Two clusters can both advertise high-end accelerators and still train at noticeably different speeds. The reasons include interconnect differences, storage path quality, CPU overhead, software stack maturity, launch configuration, sequence length, batch structure, and checkpoint cadence. This is why serious teams calibrate with real runs rather than relying only on spec-sheet optimism. ## What a research-grade systems writeup should include If you report distributed runtime seriously, you should record: - GPU model and count - node count - interconnect assumptions if known - effective [[Glossary#Batch size|batch size]] - sequence length - precision mode - checkpoint cadence - tokens per second or steps per second - any observed stalls or failure events That is the difference between "we used 8 GPUs" and an actually interpretable systems claim. ## Scope of this course and where to go deeper This note is designed to give a conceptual map of serious distributed training without turning the whole course into a cluster-systems class. The capstone uses multi-GPU training directly, so you should understand the main parallelism families, checkpointing logic, and memory strategies well enough to read papers and interpret runtime differences. A deeper follow-on module could go much further into ZeRO stages, optimizer-state partitioning, fault tolerance, elastic training, topology-aware placement, kernel fusion, and cluster scheduling. If you want to go deeper after this note, the best next steps are: - compare [[Glossary#DDP|DDP]]-style runs against theoretical speedup expectations - read the Megatron-LM, GPipe, and ZeRO papers closely - inspect MFU and throughput differences across real hardware listings - treat distributed runtime as a systems-measurement problem, not just a GPU-count problem <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Optimizer Theory for Transformer Training" href="Optimizer%20Theory%20for%20Transformer%20Training">Optimizer Theory for Transformer Training</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Advanced Inference Systems" href="Advanced%20Inference%20Systems">Advanced Inference Systems</a></div> </div> </div> ## References [^1]: Mohammad Shoeybi et al., NVIDIA, [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) [^2]: Yanping Huang et al., Google, [GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism](https://arxiv.org/abs/1811.06965) [^3]: Samyam Rajbhandari et al., Microsoft, [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054) [^4]: Tianqi Chen et al., [Training Deep Nets with Sublinear Memory Cost](https://arxiv.org/abs/1604.06174) [^5]: NVIDIA, [NCCL](https://developer.nvidia.com/nccl)