Training Configuration and Hyperparameters

> [!info] Course code > Use these code paths together with this note: > - [picollm/accelerated/pretrain/train.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/pretrain/train.py) > - [picollm/accelerated/common.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/common.py) > - [picollm/accelerated/speedrun.sh](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/speedrun.sh) ## What This Concept Is When you first look at a long training command, it is easy to see only a wall of flags and numbers. This note slows that down and turns those knobs back into decisions you can reason about. Each hyperparameter changes some part of the training story: model shape, data flow, optimization, precision, or runtime behavior. The point is not to memorize flags. The point is to stop seeing the command line as noise. ## Foundation Terms You Need First A **[[Glossary#Hyperparameter|hyperparameter]]** is a setting chosen by the operator rather than learned by the model. Some hyperparameters describe the **model shape**, such as depth or width. Others describe the **batch shape**, such as batch size, sequence length, or gradient accumulation. Others describe the **optimization budget**, such as steps, warmup, and learning-rate schedule. If you sort the flags into those buckets while reading, the command line becomes much easier to understand. You stop asking "what does this number do?" and start asking "which part of the system is this controlling?" ## The accelerated picoLLM path The current serious from-scratch path is [picollm/accelerated/pretrain/train.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/pretrain/train.py). It groups the main training decisions like this: - dataset preparation is mostly handled before training - tokenizer training lives in its own explicit stage - model shape is controlled by `--depth`, `--aspect-ratio`, `--head-dim`, `--max-seq-len`, and `--window-pattern` - training horizon is controlled by `--target-param-data-ratio`, `--num-iterations`, or `--target-flops` - device and precision are controlled by `--device-type`, auto-detected compute dtype, and optional `--fp8` So you should read this note through the accelerated stack itself, not through a second older training surface. ## The big picture When you run a command like: ```bash uv run python -m picollm.accelerated.pretrain.train \ --depth 24 \ --aspect-ratio 64 \ --head-dim 128 \ --max-seq-len 2048 \ --target-param-data-ratio 8 \ --device-batch-size 16 \ --warmup-steps 40 \ --fp8 ``` you are making decisions in four layers: 1. what data the model sees 2. how text is tokenized and chunked 3. how large the model is 4. how aggressively and how long it trains That is why changing one number can affect: - memory usage - speed - stability - output quality - whether the run crashes - whether the model becomes more coherent or simply overfits ## Device and precision basics you should know first When you first open the serious run, three device words appear immediately: - `cpu` - `[[Glossary#MPS|mps]]` - `cuda` These are not model architectures. They are execution backends. - `cpu` means the ordinary processor. It is good for correctness checks and tiny demonstrations, but not for serious LLM training. - `mps` is PyTorch's Metal backend for Apple Silicon. It lets MacBooks and Mac Studios run local inference and small experiments on the integrated GPU. - `cuda` is the NVIDIA path. This is the serious training backend for the accelerated stack, and it is the path used on rented H100 or A100 boxes. In the current `picollm` path, this logic lives in [picollm/accelerated/common.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/common.py), mainly in `_detect_compute_dtype()`, `autodetect_device_type()`, and `compute_init()`. Precision words matter just as much: - `fp32`: the classic 32-bit floating-point format. Safest, but slower and larger. - `tf32`: an NVIDIA matmul mode that keeps much of the usability of fp32 while using faster tensor-core paths for matrix multiplication. - `fp16`: 16-bit floating point with a smaller exponent range. Fast, but often needs gradient scaling. - `bf16`: also 16-bit, but with a larger exponent range. On Ampere and Hopper GPUs it is usually the best default training precision. - `fp8`: an even smaller format used only in specialized accelerated training paths. In `picollm`, it is enabled explicitly with `--fp8` and only makes sense on newer NVIDIA hardware such as H100-class GPUs. The course should present precision as a systems choice, not as a prestige badge. Smaller formats are useful because they reduce memory traffic and speed up tensor-core kernels, but they only help when the hardware and software stack are designed to use them safely. ## Dataset selection ### `--dataset-name` This chooses the dataset loader. In our cloud path, examples include: - `HuggingFaceFW/fineweb-edu` - `HuggingFaceTB/everyday-conversations-llama3.1-2k` The most important idea is that dataset choice is not only about size. It also determines what kind of language behavior the model sees. If you use `HuggingFaceFW/fineweb-edu`, the model sees broad natural text. That makes it suitable for base pretraining, where the goal is to teach the model general language statistics before it learns an assistant role. If you use `HuggingFaceTB/everyday-conversations-llama3.1-2k`, the model sees explicit multi-turn chat messages with roles. That makes it useful for the chat-[[Glossary#SFT|SFT]] stage, where you want the already-pretrained [[Glossary#Checkpoint|checkpoint]] to learn: - greetings - short replies - back-and-forth dialogue rhythm So if you change the dataset, you are not just changing the amount of data. You are changing the distribution of language the model tries to imitate. [^4] ### `--dataset-config` Some Hugging Face datasets contain multiple configurations under one dataset name. Earlier, `wikitext` required a config such as: - `wikitext-2-raw-v1` - `wikitext-103-raw-v1` That extra config tells the loader which concrete version of the dataset you want. If a dataset has only one default configuration, you may not need this flag. If it has several, the loader cannot guess safely. [^4] ### `--dataset-split` This chooses which part of the dataset you are using. The standard split meanings are: - `train`: used to update the model weights - `validation`: used to check progress without training on those examples - `test`: used for a final held-out check This matters because using the same examples for both training and evaluation gives you an overly optimistic picture of quality. The simple mental model is: - `train` teaches the model - `validation` monitors the model - `test` audits the finished model In practice, our scripts usually train on `train` and evaluate on `validation`. [^4] ### `--text-column` Datasets often contain multiple fields. This flag says which column should be turned into training text. Examples: - `text` for plain text datasets like `fineweb-edu` - `messages` for chat datasets like `everyday-conversations-llama3.1-2k` If you choose the wrong column, the training script may: - crash - train on empty values - train on metadata instead of language So this is a data-selection flag, not a model-quality trick. ## Chat-structure formatting ### Structured chat roles In the current repo, structured chat roles live directly in: - [picollm/accelerated/tokenizer.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/tokenizer.py) - [picollm/accelerated/chat/sft.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/chat/sft.py) When a dataset row is a list of turns, the pipeline rewrites the row into text with role markers: ```text <|user|> hi <|assistant|> hello <|user|> how are you? <|assistant|> i am good ``` Without this formatting, the model would still see dialogue content, but it would not clearly see who is speaking when. With role markers, the model can learn that: - one turn belongs to the user - the next turn belongs to the assistant - an assistant response should come after a user message That idea matches the broader chat-templating pattern used in modern instruct and chat models. Hugging Face’s chat-template docs explain that chat inputs are turned into token sequences with control tokens such as `<|user|>` and `<|assistant|>` so the model can see the conversation structure. [^5] If you turn this flag off on a dialogue dataset: - the model may still learn local conversational text patterns - role structure becomes easier to blur across turns In the accelerated path, `picollm/accelerated/tokenizer.py` bakes structured chat tokens and rendering rules into the tokenizer and chat pipeline itself. - but it is less clearly taught the boundary between user turns and assistant turns If you turn it on for a plain text dataset with no alternating turns, it does not help and may even produce a strange training format. So this flag is not "always better." It is better when the data is actually conversational. ## Tokenizer knobs ### `--vocab-size` This controls the final size of the tokenizer [[Glossary#Vocabulary|vocabulary]]. Hugging Face’s `BpeTrainer` defines it as the size of the final vocabulary, including tokens and alphabet. [^3] This is one of the most important tokenizer decisions because it changes both: - how text is broken up - how large parts of the model become If `vocab-size` is smaller: - the tokenizer must represent text using fewer learned units - words are more likely to be broken into many smaller pieces - the [[Glossary#Embedding|embedding]] table and output projection are smaller - training and inference may use less memory If `vocab-size` is larger: - the tokenizer can keep more common subwords or whole words intact - average token sequences may become shorter - the embedding table and output layer get larger - memory and parameter count increase A small vocabulary is not automatically bad. In fact, very small local models often need a modest vocabulary because otherwise too many parameters get spent on embeddings instead of the rest of the network. A larger vocabulary is not automatically better either. If the training corpus is small, a very large vocabulary can waste capacity on rare fragments the model never learns well. So in practice: - smaller `vocab-size` trades linguistic granularity for efficiency - larger `vocab-size` trades efficiency for potentially cleaner tokenization ### What if `vocab-size` is too low? If it is too low, you often see: - long token sequences - more fragmented words - harder next-token prediction because the model must compose meaning from smaller pieces ### What if `vocab-size` is too high? If it is too high, you often see: - more memory used by embeddings and output [[Glossary#Logits|logits]] - more parameters for a tiny model to learn - worse parameter efficiency if the corpus is small ## Sequence length ### `--block-size` / `--max-seq-len` This controls the maximum sequence length used in training chunks in our script. It also becomes the GPT-style model’s positional capacity because we pass it into GPT-2 config fields such as `n_positions`. Hugging Face defines `n_positions` as the maximum sequence length the model might be used with. [^2] If `block-size` is smaller: - each example contains less context - memory usage drops - training is faster - the model learns shorter-range dependencies If `block-size` is larger: - each example contains more context - memory usage rises sharply - attention cost increases - the model can learn longer-range relationships For small local models, $256$ is a practical compromise. For the serious `picollm` cloud capstone, $1024$ is more appropriate because it gives the model a more realistic working context while still fitting comfortably on the calibrated H100 recipe. If you raise `block-size` without increasing GPU memory, you can hit out-of-memory errors even if nothing else changes. So this is a context-length knob with major memory consequences. > [!important] > In the `picollm` capstone path, `--batch-size` is per GPU. > The full token budget depends on all four of these together: > $\text{per\_device\_batch\_size} \times \text{num\_gpus} \times \text{grad\_accum} \times \text{block\_size}$. > This is the number you should use when you estimate how long a run will take. ## Model-size knobs These are the core architecture-size controls in a GPT-style model. ### `--layers` / `--depth` This maps to GPT-2 config field `n_layer`, which Hugging Face defines as the number of hidden layers in the Transformer decoder. [^2] If you increase `layers`: - the model becomes deeper - it can represent more complex transformations - parameter count rises - training gets slower - memory use rises If you decrease `layers`: - the model becomes shallower - training is cheaper - the model has less representational depth For a tiny model, depth helps, but only if the rest of the network is also large enough to use it well. ### `--heads` / `--head-dim` This maps to GPT-2 config field `n_head`, defined by Hugging Face as the number of attention heads for each attention layer in the Transformer decoder. [^2] Attention heads let the model compute multiple attention patterns in parallel within each layer. If you increase `heads`: - the model has more parallel attention subspaces - attention computation structure changes - parameter and compute costs can rise depending on the hidden size If you decrease `heads`: - the attention mechanism becomes simpler - the model may lose flexibility in how it distributes attention patterns But `heads` does not live alone. It interacts with hidden size. ### `--hidden-size` / `--aspect-ratio` This maps to GPT-2 config field `n_embd`, which Hugging Face defines as the dimensionality of the embeddings and hidden states. [^2] This is the width of the model. If you increase `hidden-size`: - each token representation becomes wider - model capacity rises - compute and memory rise substantially If you decrease `hidden-size`: - the model becomes narrower and cheaper - parameter count drops - the model may underfit more easily ### How `layers`, `heads`, and `hidden-size` interact It is tempting to ask which one matters most. That is not the most useful question. The real question is whether the model has a balanced budget. For example: - a very wide model with very few layers may be under-deep - a deep model with too little hidden size may be bottlenecked - many heads with an undersized hidden dimension can become awkward because head dimension gets too small So you should stop thinking of these as independent sliders. They form the model’s capacity shape. ## Training-throughput knobs ### `--batch-size` In Hugging Face `TrainingArguments`, `per_device_train_batch_size` is the [[Glossary#Batch size|batch size]] per device, and the global batch size in multi-device setups is $\text{per\_device\_train\_batch\_size} \times \text{number\_of\_devices}$. [^1] In our scripts, `--batch-size` is per device. If you increase it: - each optimizer update sees more examples at once - [[Glossary#Throughput|throughput]] may improve - memory usage increases - gradient estimates may become less noisy If you decrease it: - memory usage drops - training may become noisier - throughput may drop If it is too large for the GPU, the run fails with out-of-memory. ### `--grad-accum` This maps to `gradient_accumulation_steps`. Hugging Face defines it as the number of update steps to accumulate gradients before performing a backward/update pass, which simulates larger batch sizes without additional memory. Effective batch size is $\text{per\_device\_train\_batch\_size} \times \text{num\_devices} \times \text{gradient\_accumulation\_steps}$. [^1] This is one of the most useful knobs when GPU memory is tight. If you increase `grad-accum`: - the effective batch becomes larger - you can simulate a larger batch without fitting it all in memory at once - each optimizer update happens less often - wall-clock time per update may feel slower If you decrease `grad-accum`: - optimizer steps happen more often - effective batch shrinks - training may become noisier So this knob is mainly about trading time for memory. ## Optimization-budget knobs ### `--learning-rate` Hugging Face defines `learning_rate` as the initial learning rate for the optimizer and notes that it is typically the peak learning rate when using a scheduler with [[Glossary#Warmup|warmup]]. [^1] This controls how aggressively the optimizer moves the weights. If it is too high: - [[Glossary#Loss|loss]] may become unstable - training may diverge - the model may never settle If it is too low: - training may be stable but painfully slow - the model may undertrain within the available step budget This is why learning rate is usually discussed together with warmup and total steps. ### `--warmup-steps` Hugging Face defines this as the number of steps for a linear warmup from $0$ to `learning_rate`, and explicitly notes that warmup helps stabilize training in the initial phase. [^1] The idea is simple: - do not start with the full learning rate immediately - ramp up instead If `warmup-steps` is too small: - early optimization can be unstable If it is too large: - too much of the run is spent at a lower-than-intended learning rate For short runs, warmup matters a lot because a few hundred steps can be a non-trivial fraction of the full training budget. ### `--max-steps` Hugging Face defines `max_steps` as the total number of training steps to perform and says that it overrides `num_train_epochs`. For a finite dataset, training keeps iterating through the dataset until `max_steps` is reached. [^1] This is the explicit optimizer-update budget. If you increase `max-steps`: - the model gets more learning opportunities - training takes longer - overfitting becomes more possible on small datasets If you decrease `max-steps`: - training is cheaper and faster - the model may stop before it has learned enough This is why a run can end at a fractional epoch. The stopping rule was about steps, not full dataset passes. ### `--save-steps` Hugging Face defines `save_steps` as the number of update steps before two checkpoint saves when save strategy is `"steps"`. [^1] If you decrease `save-steps`: - you get more checkpoints - recovery is easier - disk usage rises - saving overhead rises If you increase `save-steps`: - you get fewer checkpoints - disk usage drops - but you have fewer recovery points if the run fails So this is mostly a reliability and storage knob, not a quality knob. ## Precision knob ### `--bf16` / auto dtype / `--fp8` In Hugging Face `TrainingArguments`, `bf16=True` enables bfloat16 mixed precision training and is generally preferred over FP16 due to better numerical stability and no loss scaling requirement. [^1] PyTorch’s mixed-precision documentation explains the broader idea: mixed precision uses both full precision and lower precision to improve performance while maintaining accuracy. [^6] If your hardware supports [[Glossary#BF16|BF16]] well, this often gives you: - lower memory use than full FP32 - better throughput - more stable training than naive FP16 in many setups If you turn BF16 off: - training may use more memory - training may be slower - but compatibility can improve on hardware that does not support BF16 properly If you replace BF16 with FP16: - memory and speed can still improve relative to FP32 - but numerical stability can be worse - some setups need gradient scaling So BF16 is not a magic quality booster. It is a precision/performance tradeoff with good stability properties on modern supported hardware. ## A practical way to reason about changes When you change a training command, ask these questions in order: 1. Did I change the data distribution? 2. Did I change the tokenizer granularity? 3. Did I change model capacity? 4. Did I change memory usage? 5. Did I change optimization stability? 6. Did I change training budget? For example: - increasing `vocab-size` changes tokenizer shape and parameter count - increasing `block-size` changes context length and memory use - increasing `hidden-size` changes model capacity and memory use - increasing `batch-size` changes throughput and memory use - increasing `grad-accum` changes effective batch without increasing micro-batch memory - increasing `max-steps` changes the optimization budget - turning on `bf16` changes precision and hardware efficiency That is the level on which these flags should make sense to you. ## Runtime and cost math from a live log Once a run starts, the easiest ETA estimate comes from the progress bar. If the log says: - `21 / 5000` - about `2.84s/it` then the pretraining ETA is roughly: - $5000 \times 2.84\,\text{s} \approx 14{,}200\,\text{s}$ - about $3.95$ hours For the calibrated `8x H100` accelerated preset in `picollm/accelerated/speedrun.sh`, the token math is: - $\text{per\_device\_batch\_size} = 8$ - $\text{num\_gpus} = 8$ - $\text{grad\_accum} = 16$ - $\text{block\_size} = 1024$ So: - $\text{tokens\_per\_step} = 8 \times 8 \times 16 \times 1024 = 1{,}048{,}576$ That means the $5000$-step pretraining stage processes about $5.24\text{B}$ tokens before chat SFT starts. > [!tip] > Keep two questions separate: > 1. "How much language does the model see?" That is about token budget. > 2. "How long will this exact run take?" That is about step time. > The first is mostly about training value. The second is mostly about engineering throughput. For the deeper system explanation, read [[Compute, Time, and Cost of LLMs]]. ## What you should remember most The most important mistake to avoid is treating these values as random defaults copied from a tutorial. They are not random. They are the compact interface for deciding: - what language the model sees - how text is represented - how large the model is - how much memory the run needs - how long optimization lasts That is why changing even one of them can materially change the result. <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Training Loop" href="Training%20Loop">Training Loop</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Inference and Sampling" href="Inference%20and%20Sampling">Inference and Sampling</a></div> </div> </div> ## References [^1]: Hugging Face Transformers, `Trainer` and `TrainingArguments`: [https://huggingface.co/docs/transformers/main_classes/trainer](https://huggingface.co/docs/transformers/main_classes/trainer) [^2]: Hugging Face Transformers, `GPT2Config`: [https://huggingface.co/docs/transformers/en/model_doc/gpt2#transformers.GPT2Config](https://huggingface.co/docs/transformers/en/model_doc/gpt2#transformers.GPT2Config) [^3]: Hugging Face Tokenizers, `BpeTrainer`: [https://huggingface.co/docs/tokenizers/en/api/trainers#tokenizers.trainers.BpeTrainer](https://huggingface.co/docs/tokenizers/en/api/trainers#tokenizers.trainers.BpeTrainer) [^4]: Hugging Face Datasets loading and splits: [https://huggingface.co/docs/datasets/en/loading](https://huggingface.co/docs/datasets/en/loading) [^5]: Hugging Face chat templates: [https://huggingface.co/docs/transformers/main/en/chat_templating](https://huggingface.co/docs/transformers/main/en/chat_templating) [^6]: PyTorch mixed precision notes: [https://docs.pytorch.org/docs/stable/notes/amp_examples.html](https://docs.pytorch.org/docs/stable/notes/amp_examples.html)