This page is a reference list for core terms used across the course. Use it when a note names an unfamiliar concept and you want a short neutral definition rather than a full explanation. ## Core Terms ### Ablation A controlled experiment where one important part of the system is changed so you can study what effect that change actually had. ### Adam An adaptive gradient-based optimizer that maintains running estimates of first and second gradient moments. ### AdamW An Adam-style optimizer with decoupled weight decay, widely used in Transformer training. ### Artifact A saved output from a run, such as a checkpoint, manifest, report, or evaluation file. ### Attention head One parallel attention pathway inside multi-head attention. Different heads can learn different patterns. ### Autoregressive model A model that predicts the next token conditioned on previous tokens. ### Backward pass The stage of training where gradients are computed from the loss and propagated back through the model. ### Base model A model trained on general next-token prediction before it is adapted for chat or instruction following. ### Baseline The reference system, run, or configuration used for comparison in an experiment. ### Backend The server-side part of a system that loads the model, handles requests, and returns results to clients. ### Batch size How many sequences are processed together in one forward pass on one device unless stated otherwise. ### Benchmark A standardized evaluation task or suite used to compare models under a shared protocol. ### BF16 Bfloat16. A 16-bit floating-point format commonly used for modern training because it is faster and smaller than FP32 while usually remaining more stable than naive FP16. ### Bits per byte (BPB) A tokenizer-aware compression-style metric that normalizes performance by byte count rather than token count. ### BPE Byte Pair Encoding. A common subword tokenization method that repeatedly merges frequent symbol pairs. ### Causal mask A mask that prevents attention from seeing future positions. ### Chat template The exact rule used to turn structured messages into one linear token sequence before training or inference. ### Checkpoint A saved training state including model weights and often optimizer, scheduler, tokenizer, and metadata. ### Collapsing runs Replacing repeated whitespace or repeated formatting characters with a simpler canonical version, such as turning many spaces into one space. ### Contamination Leakage of evaluation or benchmark content into the training data, which can make results look better than they should. ### Context window The maximum sequence length the model can process in one forward pass. ### CUDA NVIDIA’s GPU compute platform, commonly used for local or cloud model training and inference. ### DDP Distributed Data Parallel. A common strategy for synchronizing model training across multiple processes or GPUs. ### Decode The autoregressive phase where the model generates one new token at a time. ### Decoder block The repeated building block in a decoder-only Transformer, usually containing masked self-attention, normalization, residual connections, and a feed-forward network. ### Deduplication Removing exact or near-duplicate examples from a dataset so the model does not overlearn repeated content. ### DPO Direct Preference Optimization. A preference-learning method that avoids the full classical RLHF pipeline while still optimizing on chosen-versus-rejected comparisons. ### Embedding A learned vector representation associated with a token or some other discrete feature. ### Feed-forward network (FFN) The per-token nonlinear transformation inside a Transformer block, usually applied after attention. ### FLOP / FLOPS `FLOP` is one floating-point operation. `FLOPS` is the rate of floating-point operations per second. ### Forward pass The stage of training or inference where inputs move through the model to produce activations, logits, and sometimes loss. ### Gradient accumulation Updating gradients over several micro-batches before taking one optimizer step. This increases the effective batch size without requiring the full batch to fit in GPU memory at once. ### Hyperparameter A training or system setting chosen by the operator rather than learned by the model. ### Hypothesis A specific claim or question that an experiment is designed to test. ### KV cache A runtime optimization that stores keys and values from previous decoding steps so the model does not recompute the entire prefix each time. ### Latency The time it takes one request or generation step to complete. ### Layer normalization A normalization layer used to stabilize optimization in deep Transformers. ### Logits Unnormalized scores over the vocabulary before softmax. ### Loss A scalar objective minimized during training. For language modeling this is typically cross-entropy on next-token prediction. ### Matrix A rectangular grid of numbers. Embedding tables, weight matrices, and attention projections are all matrices. ### MFU Model FLOPs utilization. A rough measure of how much of the hardware's theoretical compute you actually achieve during a real training run. ### MPS Apple’s Metal Performance Shaders backend used for local PyTorch acceleration on Apple Silicon. ### NFC Unicode Normalization Form C. A standard way to combine characters into a composed canonical form when possible. For example, a letter plus accent mark can be collapsed into one combined Unicode character. ### NFKC Unicode Normalization Form KC. A more aggressive normalization form that also folds some compatibility variants into a common form. ### Observability The ability to understand what a running training job or serving system is doing by using logs, metrics, traces, and dashboards. ### One-hot vector A vector that is zero everywhere except at one index, where it is one. It is a standard way to represent a discrete choice mathematically. ### OpenAI-compatible API An API surface that follows the OpenAI-style request and response shape so many clients and SDKs can work with it. ### Out-of-vocabulary (OOV) The failure mode where a word or symbol is not directly available in a tokenizer's vocabulary. ### Parameter A learned number inside the model. Embedding tables, projection matrices, and other weights are all made of parameters. ### Perplexity A traditional language-model metric derived from average negative log-likelihood. ### Pipeline parallelism Splitting a model into stages across devices and sending micro-batches through those stages like a pipeline. ### Post-training The stages after base pretraining that reshape a model into a more specialized assistant or product behavior. ### Prefill The first inference phase where the prompt is processed and the initial KV cache is built. ### Protocol The exact setup used to run an evaluation or experiment, including prompts, splits, scoring rules, and reporting choices. ### Quantization Reducing weight or activation precision to lower memory use and sometimes increase inference efficiency. ### Regression test A repeated check used to make sure a new run or checkpoint did not get worse on behaviors that previously worked. ### Repeatability The ability of the same team to rerun the same setup and obtain materially similar results. ### Residual connection An identity shortcut that lets a layer refine a representation instead of replacing it completely. ### Residual stream The running hidden representation passed through the Transformer by residual connections, often central in mechanistic interpretability discussions. ### Reward model A model trained on preference data to estimate which outputs humans are likely to prefer. ### RLHF Reinforcement Learning from Human Feedback. A post-training pipeline that uses demonstrations and preference signals to improve assistant behavior. ### Robustness The degree to which a result or claim still holds under small changes such as seed, hardware, or minor configuration differences. ### Self-attention A mechanism that lets each token mix information from other tokens in the same sequence. ### SentencePiece A tokenizer framework that treats tokenization as a trainable model and often uses explicit whitespace markers. ### SFT Supervised fine-tuning. Training on curated prompt-response pairs so a base model behaves more like an instruction-following assistant. ### Softmax A function that converts logits into a probability distribution. ### Special tokens Control tokens such as `<|user|>`, `<|assistant|>`, or padding markers that tell the model about role boundaries or sequence structure. ### Speculative decoding An inference acceleration strategy where a smaller or auxiliary model proposes tokens and the larger target model verifies them. ### Streaming Returning model output incrementally as it is generated instead of waiting for the full response to finish. ### Stripping Removing unwanted leading or trailing whitespace from text before further processing. ### Temperature A decoding control that sharpens or flattens the output distribution before sampling. ### Tensor parallelism Splitting large tensor operations across multiple devices so one layer is computed jointly by several GPUs. ### Throughput The amount of useful work completed per unit time, often measured in requests per second, tokens per second, or optimizer steps per second. ### Token A discrete symbol ID consumed by the model. Tokens are produced by a tokenizer and are not the same thing as words. ### Token ID The integer index assigned to one token in the vocabulary. The model reads token IDs as inputs before the embedding layer turns them into vectors. ### Tokenizer The system that converts raw text into token IDs and can decode token IDs back into text. ### Top-k A decoding strategy that limits sampling to the `k` most likely next tokens. ### Top-p A decoding strategy that samples from the smallest set of tokens whose cumulative probability exceeds $p$. ### Unigram tokenization A tokenization method that chooses a high-probability segmentation from a learned set of candidate pieces instead of repeatedly merging pairs like BPE. ### Validation loss A loss measured on held-out data that is not used for parameter updates. It is one of the simplest ways to check whether the model is generalizing rather than only fitting the training set. ### Vector An ordered list of numbers treated as one geometric object. In Transformers, vectors are the basic representation used for tokens, positions, hidden states, and outputs. ### Vocabulary The finite set of token IDs the model can read and predict. ### Vocabulary size The number of distinct token entries in the tokenizer vocabulary. A vocabulary size of `50,000` means the model can emit `50,000` token IDs. ### Warmup A short initial training phase where the learning rate ramps up gradually instead of jumping straight to its maximum value. ### WordPiece A subword tokenization family used in models like BERT, with its own merge objective and greedy longest-match decoding behavior. ### ZeRO A family of techniques that shards optimizer state, gradients, and sometimes parameters across workers to reduce memory overhead.