Distributed Training and Multi-GPU

> [!info] Course code > Use these repo paths together with this note: > - [picollm/accelerated/speedrun.sh](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/speedrun.sh) > - [picollm/accelerated/pretrain/train.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/pretrain/train.py) > - [picollm/accelerated/common.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/common.py) > - [picollm/accelerated/chat/eval.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/chat/eval.py) ## What This Concept Is Imagine the model no longer fits comfortably on one GPU, or one GPU is simply too slow for the run you want. The first instinct is easy: use more GPUs. The hard part starts right after that sentence. Once work is spread across multiple devices, you have to decide what gets copied, what gets split, and how the devices stay in sync. This note is about that transition from single-device training to coordinated multi-device training. ## Foundation Terms You Need First Start with the simplest multi-GPU picture. A **worker** is one process or device taking part in the run. **Synchronization** is how workers keep the model aligned by sharing gradients or parameters. **Communication overhead** is the time lost moving tensors between workers instead of computing. **[[Glossary#DDP|DDP]]** is the common starting pattern where each worker has a model replica and gradients are synchronized after local backward passes. So the main question in this note is not just "how do we add GPUs?" It is "how do we add GPUs without losing too much time to waiting and coordination?" ## How this lecture maps to picoLLM You should see distributed training not as a side topic, but as part of the serious picoLLM path. - `speedrun.sh` launches the multi-process jobs - `common.py` initializes [[Glossary#DDP|DDP]] and device state - `pretrain/train.py` and `chat/eval.py` expose `rank`, distributed reductions, and worker-local progress directly in the logs That is why this note should be taught together with [[Experiment Tracking and Run Analysis]] and [[Evaluation and Model Quality]]. The distributed story is visible in the logs you will actually read. ## What the `picollm` launch command is actually doing It is easy to see a command like: ```bash torchrun --standalone --nproc_per_node=8 -m picollm.accelerated.pretrain.train ``` and treat it as a magical incantation. It is not. It is just a distributed launcher. - `torchrun` starts one Python worker per GPU - `--nproc_per_node=8` means eight local worker processes - each worker gets its own rank - `compute_init()` in `picollm/accelerated/common.py` initializes NCCL when [[Glossary#CUDA|CUDA]] DDP is active - gradients are synchronized so the model replicas stay aligned So the you should learn to read this command as: "start eight copies of the same training program, shard the work, then keep them numerically consistent." ## Why multiple GPUs can be much faster In data-parallel training, each GPU processes a shard of the batch, computes gradients locally, and then synchronizes those gradients so that all replicas stay consistent. This means more examples can be processed in the same wall-clock interval than on a single GPU. If the workload is well shaped and the interconnect is good, the speedup can be large.[^2] This is the main intuition you need first: multiple GPUs do not change the learning objective. They change how much work can be done per unit time. ## Why scaling is not perfectly linear If eight GPUs always gave exactly eight times the useful training speed of one GPU, distributed training would be easy to reason about. But scaling is never perfectly linear because the GPUs must communicate. Gradients must be reduced, some states must be synchronized, and slow operations can cause all workers to wait. As the model, batch structure, and software complexity grow, communication overhead becomes more visible.[^3] This is why multi-GPU work is partly a systems problem rather than purely a mathematical one. The machine is not just "more GPUs." It is also the interconnect, the topology, the collective communication library, and the efficiency of the framework kernels. ## What you should know about DDP `torchrun` and distributed launchers often feel magical at first, but the core idea of data parallelism is simple enough to teach. Each worker gets data, computes local gradients, and participates in synchronizing those gradients so the model replicas stay aligned. Distributed Data Parallel is therefore not a new optimization algorithm. It is a way of coordinating the same optimization process across several devices.[^4] This is the right level of abstraction for a lecture note: enough clarity to demystify the launcher, without turning the course into a full distributed systems class. ## Why node quality matters, not just GPU count A node with many GPUs is not valuable only because it has many GPUs. It is valuable because those GPUs can often communicate well with each other and because the surrounding CPU, storage, and network stack are strong enough to keep them busy. A weak storage path or poor interconnect can make an expensive machine underperform. This is one reason provider listings show details like bandwidth, NVLink or PCIe characteristics, and system memory.[^3] ## What you should remember The practical lesson is: more GPUs usually mean faster training, but not in a perfectly proportional way. The scientific lesson is: step time and [[Glossary#Throughput|throughput]] must be measured on the real system instead of assumed from GPU count alone. That is why live logs and experiment tracking remain essential even in distributed runs. <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Compute, Time, and Cost of LLMs" href="Compute%2C%20Time%2C%20and%20Cost%20of%20LLMs">Compute, Time, and Cost of LLMs</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Experiment Tracking and Run Analysis" href="Experiment%20Tracking%20and%20Run%20Analysis">Experiment Tracking and Run Analysis</a></div> </div> </div> ## References [^1]: Montekkundan, [llm repository](https://github.com/Montekkundan/llm) [^2]: PyTorch, [Distributed Data Parallel documentation](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) [^3]: NVIDIA, [NCCL](https://developer.nvidia.com/nccl) [^4]: Hugging Face, [Distributed training guides](https://huggingface.co/docs/transformers/perf_train_gpu_many)