Scaling Laws and Compute-Optimal Training

> [!info] Course code > Use these repo paths together with this note: > - [picollm/accelerated/speedrun.sh](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/speedrun.sh) > - [picollm/accelerated/pretrain/train.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/pretrain/train.py) > - [picollm/accelerated/common.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/common.py) ## What This Concept Is People often say "bigger models get better," but that is only the rough headline. The more interesting question is how model size, data tokens, and compute budget should scale together if you want to spend your budget well. This note is about that more disciplined view. Scaling laws matter because they turn guesswork into budgeted reasoning. ## Foundation Terms You Need First A **scale axis** is one quantity you can increase, such as parameters, tokens, or compute. A **loss trend** is how model loss changes as those quantities grow. A **compute-optimal point** is the balance where parameters and token budget are matched well enough that the run is not obviously wasteful. An **undertrained model** is one that is too large for the amount of data or compute it received. So the core question in this note is not "how big can we make it?" It is "given a budget, what mix of size and data makes sense?" ## The Kaplan-era intuition: bigger models look amazing if you keep scaling everything The Kaplan et al. scaling-law work was influential because it showed smooth power-law behavior in language modeling as model size, dataset size, and compute increased. That result helped convince the field that scaling was not a chaotic lottery. There was structure, and that structure made large-model investment scientifically legible.[^1] The simplified takeaway people often remember is "bigger models get better." But that is only half of the story. The deeper lesson was that loss improved predictably as different axes increased, which suggested that model performance could be forecast and budgeted more systematically than before. ## The Chinchilla correction: many large models were undertrained The Chinchilla work changed the conversation by arguing that many earlier large models were too parameter-heavy relative to the number of training tokens they saw. In other words, it was not enough to make a model bigger. If token count did not scale appropriately, the model could be overparameterized for the budget and therefore undertrained.[^2] This is one of the most important ideas you can learn: a model can be too small for the data budget, but it can also be too large for the token budget. Bigger is not automatically more compute-optimal. If you have a fixed compute budget, there is a trade-off between spending it on more parameters and spending it on more training tokens. ## Parameter count versus token count This is the central trade-off in compute-optimal training. More parameters increase capacity. More tokens improve how fully that capacity is used. If you allocate too much compute to parameters and not enough to tokens, the model may never be trained enough to justify its size. If you allocate too much to tokens and too little to model capacity, you may underfit because the model is not expressive enough to absorb the data efficiently.[^2] This is why serious training planning always asks two questions together: - How large is the model? - How many effective training tokens will it see? Those two numbers mean more than either one alone. ## Compute-optimal allocation Compute-optimal training means choosing the ratio of parameters to tokens so that the model uses the available training budget efficiently. It does not mean the largest model that can be instantiated. It means the model-data pairing that is expected to give the best loss or capability return for the available [[Glossary#FLOP / FLOPS|FLOP]] budget.[^2] This is a deeply useful idea because it explains why some impressive-looking runs are actually poor design choices. You might spend a lot of money on a model with too many parameters, then stop training too early because the bill becomes painful. The result is not a "large model." It is a partially trained large model, which is often much worse than a better-balanced smaller run. ## How to think about undertrained versus overparameterized models An undertrained model is one that has not seen enough tokens or optimization time to make good use of its parameters. This often looks like a model that "should be smart" because its architecture is large, but still responds poorly or inconsistently. An overparameterized model in this context is not necessarily bad in the abstract. Overparameterization can aid optimization. But under a fixed compute budget, it can become inefficient if the model is so large that the run cannot provide enough tokens to train it properly. This is exactly the kind of reasoning that Chinchilla-style analysis sharpens.[^2] The right question is therefore not "Is overparameterization good or bad?" The right question is "Given this compute budget, is this parameter count being trained enough to justify itself?" ## Reasoning about token budgets before launching a run This is where scaling-law thinking becomes operational. Before launching a run, you should estimate: - total parameters - tokens per step - total steps - total tokens seen - approximate total training FLOPs - approximate wall-clock time on the actual hardware This lets them ask whether the run is likely to be seriously undertrained before they spend money. That habit is more important than memorizing any one scaling-law exponent. For a course capstone, the best version of this reasoning is modest but disciplined. You do not need to reproduce frontier-scale scaling-law curves. You need to ensure that the architecture, token budget, and runtime form a coherent story. ## Why scaling laws matter even for small runs It is easy to assume scaling laws matter only at OpenAI, Anthropic, Google, or DeepMind scale. That is not true. Even small runs benefit from scaling-law thinking because the same trade-offs still exist. The absolute numbers are smaller, but the design logic is the same: balance model size against the number of tokens and the available compute. This is one of the reasons the capstone should be framed as a calibrated run rather than a random run. The moment you learn to choose a model and token budget together, you start thinking more like a real researcher. <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Vercel AI SDK Chat App" href="Vercel%20AI%20SDK%20Chat%20App">Vercel AI SDK Chat App</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Optimizer Theory for Transformer Training" href="Optimizer%20Theory%20for%20Transformer%20Training">Optimizer Theory for Transformer Training</a></div> </div> </div> ## References [^1]: Jared Kaplan et al., OpenAI, [Scaling Laws for Neural Language Models](https://arxiv.org/abs/2001.08361) [^2]: Jordan Hoffmann et al., DeepMind, [Training Compute-Optimal Large Language Models](https://arxiv.org/abs/2203.15556) [^3]: Hugging Face TB, [The Smol Training Guide](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook) [^4]: Andrej Karpathy, [nanochat](https://github.com/karpathy/nanochat)