This page is a reference list for core terms used across the course.
Use it when a note names an unfamiliar concept and you want a short neutral definition rather than a full explanation.
## Core Terms
### Ablation
A controlled experiment where one important part of the system is changed so you can study what effect that change actually had.
### Adam
An adaptive gradient-based optimizer that maintains running estimates of first and second gradient moments.
### AdamW
An Adam-style optimizer with decoupled weight decay, widely used in Transformer training.
### Artifact
A saved output from a run, such as a checkpoint, manifest, report, or evaluation file.
### Attention head
One parallel attention pathway inside multi-head attention. Different heads can learn different patterns.
### Autoregressive model
A model that predicts the next token conditioned on previous tokens.
### Backward pass
The stage of training where gradients are computed from the loss and propagated back through the model.
### Base model
A model trained on general next-token prediction before it is adapted for chat or instruction following.
### Baseline
The reference system, run, or configuration used for comparison in an experiment.
### Backend
The server-side part of a system that loads the model, handles requests, and returns results to clients.
### Batch size
How many sequences are processed together in one forward pass on one device unless stated otherwise.
### Benchmark
A standardized evaluation task or suite used to compare models under a shared protocol.
### BF16
Bfloat16. A 16-bit floating-point format commonly used for modern training because it is faster and smaller than FP32 while usually remaining more stable than naive FP16.
### Bits per byte (BPB)
A tokenizer-aware compression-style metric that normalizes performance by byte count rather than token count.
### BPE
Byte Pair Encoding. A common subword tokenization method that repeatedly merges frequent symbol pairs.
### Causal mask
A mask that prevents attention from seeing future positions.
### Chat template
The exact rule used to turn structured messages into one linear token sequence before training or inference.
### Checkpoint
A saved training state including model weights and often optimizer, scheduler, tokenizer, and metadata.
### Collapsing runs
Replacing repeated whitespace or repeated formatting characters with a simpler canonical version, such as turning many spaces into one space.
### Contamination
Leakage of evaluation or benchmark content into the training data, which can make results look better than they should.
### Context window
The maximum sequence length the model can process in one forward pass.
### CUDA
NVIDIA’s GPU compute platform, commonly used for local or cloud model training and inference.
### DDP
Distributed Data Parallel. A common strategy for synchronizing model training across multiple processes or GPUs.
### Decode
The autoregressive phase where the model generates one new token at a time.
### Decoder block
The repeated building block in a decoder-only Transformer, usually containing masked self-attention, normalization, residual connections, and a feed-forward network.
### Deduplication
Removing exact or near-duplicate examples from a dataset so the model does not overlearn repeated content.
### DPO
Direct Preference Optimization. A preference-learning method that avoids the full classical RLHF pipeline while still optimizing on chosen-versus-rejected comparisons.
### Embedding
A learned vector representation associated with a token or some other discrete feature.
### Feed-forward network (FFN)
The per-token nonlinear transformation inside a Transformer block, usually applied after attention.
### FLOP / FLOPS
`FLOP` is one floating-point operation. `FLOPS` is the rate of floating-point operations per second.
### Forward pass
The stage of training or inference where inputs move through the model to produce activations, logits, and sometimes loss.
### Gradient accumulation
Updating gradients over several micro-batches before taking one optimizer step. This increases the effective batch size without requiring the full batch to fit in GPU memory at once.
### Hyperparameter
A training or system setting chosen by the operator rather than learned by the model.
### Hypothesis
A specific claim or question that an experiment is designed to test.
### KV cache
A runtime optimization that stores keys and values from previous decoding steps so the model does not recompute the entire prefix each time.
### Latency
The time it takes one request or generation step to complete.
### Layer normalization
A normalization layer used to stabilize optimization in deep Transformers.
### Logits
Unnormalized scores over the vocabulary before softmax.
### Loss
A scalar objective minimized during training. For language modeling this is typically cross-entropy on next-token prediction.
### Matrix
A rectangular grid of numbers. Embedding tables, weight matrices, and attention projections are all matrices.
### MFU
Model FLOPs utilization. A rough measure of how much of the hardware's theoretical compute you actually achieve during a real training run.
### MPS
Apple’s Metal Performance Shaders backend used for local PyTorch acceleration on Apple Silicon.
### NFC
Unicode Normalization Form C. A standard way to combine characters into a composed canonical form when possible. For example, a letter plus accent mark can be collapsed into one combined Unicode character.
### NFKC
Unicode Normalization Form KC. A more aggressive normalization form that also folds some compatibility variants into a common form.
### Observability
The ability to understand what a running training job or serving system is doing by using logs, metrics, traces, and dashboards.
### One-hot vector
A vector that is zero everywhere except at one index, where it is one. It is a standard way to represent a discrete choice mathematically.
### OpenAI-compatible API
An API surface that follows the OpenAI-style request and response shape so many clients and SDKs can work with it.
### Out-of-vocabulary (OOV)
The failure mode where a word or symbol is not directly available in a tokenizer's vocabulary.
### Parameter
A learned number inside the model. Embedding tables, projection matrices, and other weights are all made of parameters.
### Perplexity
A traditional language-model metric derived from average negative log-likelihood.
### Pipeline parallelism
Splitting a model into stages across devices and sending micro-batches through those stages like a pipeline.
### Post-training
The stages after base pretraining that reshape a model into a more specialized assistant or product behavior.
### Prefill
The first inference phase where the prompt is processed and the initial KV cache is built.
### Protocol
The exact setup used to run an evaluation or experiment, including prompts, splits, scoring rules, and reporting choices.
### Quantization
Reducing weight or activation precision to lower memory use and sometimes increase inference efficiency.
### Regression test
A repeated check used to make sure a new run or checkpoint did not get worse on behaviors that previously worked.
### Repeatability
The ability of the same team to rerun the same setup and obtain materially similar results.
### Residual connection
An identity shortcut that lets a layer refine a representation instead of replacing it completely.
### Residual stream
The running hidden representation passed through the Transformer by residual connections, often central in mechanistic interpretability discussions.
### Reward model
A model trained on preference data to estimate which outputs humans are likely to prefer.
### RLHF
Reinforcement Learning from Human Feedback. A post-training pipeline that uses demonstrations and preference signals to improve assistant behavior.
### Robustness
The degree to which a result or claim still holds under small changes such as seed, hardware, or minor configuration differences.
### Self-attention
A mechanism that lets each token mix information from other tokens in the same sequence.
### SentencePiece
A tokenizer framework that treats tokenization as a trainable model and often uses explicit whitespace markers.
### SFT
Supervised fine-tuning. Training on curated prompt-response pairs so a base model behaves more like an instruction-following assistant.
### Softmax
A function that converts logits into a probability distribution.
### Special tokens
Control tokens such as `<|user|>`, `<|assistant|>`, or padding markers that tell the model about role boundaries or sequence structure.
### Speculative decoding
An inference acceleration strategy where a smaller or auxiliary model proposes tokens and the larger target model verifies them.
### Streaming
Returning model output incrementally as it is generated instead of waiting for the full response to finish.
### Stripping
Removing unwanted leading or trailing whitespace from text before further processing.
### Temperature
A decoding control that sharpens or flattens the output distribution before sampling.
### Tensor parallelism
Splitting large tensor operations across multiple devices so one layer is computed jointly by several GPUs.
### Throughput
The amount of useful work completed per unit time, often measured in requests per second, tokens per second, or optimizer steps per second.
### Token
A discrete symbol ID consumed by the model. Tokens are produced by a tokenizer and are not the same thing as words.
### Token ID
The integer index assigned to one token in the vocabulary. The model reads token IDs as inputs before the embedding layer turns them into vectors.
### Tokenizer
The system that converts raw text into token IDs and can decode token IDs back into text.
### Top-k
A decoding strategy that limits sampling to the `k` most likely next tokens.
### Top-p
A decoding strategy that samples from the smallest set of tokens whose cumulative probability exceeds $p$.
### Unigram tokenization
A tokenization method that chooses a high-probability segmentation from a learned set of candidate pieces instead of repeatedly merging pairs like BPE.
### Validation loss
A loss measured on held-out data that is not used for parameter updates. It is one of the simplest ways to check whether the model is generalizing rather than only fitting the training set.
### Vector
An ordered list of numbers treated as one geometric object. In Transformers, vectors are the basic representation used for tokens, positions, hidden states, and outputs.
### Vocabulary
The finite set of token IDs the model can read and predict.
### Vocabulary size
The number of distinct token entries in the tokenizer vocabulary. A vocabulary size of `50,000` means the model can emit `50,000` token IDs.
### Warmup
A short initial training phase where the learning rate ramps up gradually instead of jumping straight to its maximum value.
### WordPiece
A subword tokenization family used in models like BERT, with its own merge objective and greedy longest-match decoding behavior.
### ZeRO
A family of techniques that shards optimizer state, gradients, and sometimes parameters across workers to reduce memory overhead.