Quantization - lectures

> [!info] Course code > Use the companion repository for runnable notebooks, figures, and implementation references for this lecture: > - [notebooks/quantization/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/quantization/lecture_walkthrough.ipynb) > - [picollm/accelerated/chat/web.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/chat/web.py) > - End-to-end run doc: [picollm/accelerated/README.md](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/README.md) > - [picollm/accelerated/fp8.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/fp8.py) > - [scripts/export_picollm_to_gguf.py](https://github.com/Montekkundan/llm/blob/main/scripts/export_picollm_to_gguf.py) ## What This Concept Is A model can be smart enough to use and still be too large or expensive to run comfortably on real hardware. Quantization is one of the main ways we push back on that problem by storing or computing with fewer bits. This note is about what you gain, what you risk, and why smaller numerical formats matter so much in practice. ## Foundation Terms You Need First **Precision** tells you how many bits are used to represent numbers. Lower precision reduces the **memory footprint**, but it introduces **approximation error** because the numbers are represented less exactly. The interesting question is the **runtime trade-off**: how much size or speed improvement you get before quality drops too far. So as you read, keep one tension in mind: quantization is valuable exactly because it is not free. It is a controlled compromise. ```mermaid flowchart TD A["fp32: highest memory, reference accuracy"] --> B["bf16 / fp16: lower memory, small approximation cost"] B --> C["int8: larger memory savings, more approximation"] C --> D["4-bit: smallest footprint, highest quality risk"] ``` ## How this lecture maps to picoLLM The course should teach quantization in two layers: - the notebook explains the general numerical tradeoff - picoLLM shows where precision choices become real operator decisions In the serious stack: - `fp8.py` covers the aggressive Hopper-class training-side precision path - the runtime and deployment notes explain the serving-side precision tradeoffs - the GGUF export path shows how lower-footprint release artifacts fit into the broader product story ## Core idea Quantization stores or computes with lower-precision numbers than full `float32`. The goal is to reduce: - memory footprint - bandwidth pressure - sometimes inference cost and [[Glossary#Latency|latency]] The tradeoff is approximation error. ## Why this appears late in the course You should first understand: - tokenization - embeddings - attention - decoding - training Only after that foundation does quantization make sense as an engineering tradeoff rather than a mysterious trick. ## Conceptual ladder A clean way to read the quantization ladder is: 1. `float32` as the reference 2. `float16` / `bfloat16` as lower-precision floating-point 3. `int8` and lower-bit quantization as more aggressive compression 4. explain that lower precision always buys efficiency by accepting approximation error ## What the companion notebook shows The notebook demonstrates: - memory savings for different dtypes - toy symmetric quantization on small tensors - reconstruction error after dequantization - why smaller storage is valuable in local serving That is enough for you to see the tradeoff without drowning in kernel details. > [!example] Notebook follow-up > - [`Memory by dtype`](https://github.com/Montekkundan/llm/blob/main/notebooks/quantization/lecture_walkthrough.ipynb#memory-by-dtype) > - [`Toy int8 quantization`](https://github.com/Montekkundan/llm/blob/main/notebooks/quantization/lecture_walkthrough.ipynb#toy-int8-quantization) > Use these notebook sections here to make the precision-versus-reconstruction tradeoff concrete before the systems discussion. ## What to emphasize in lecture Quantization is not magic compression with zero cost. It is a numerical approximation decision. The right question is not "is lower precision good?" The right question is: > what accuracy [[Glossary#Loss|loss]] is acceptable for the memory and serving gain we get? ## Product relevance Quantization is what often makes the difference between: - "this model only fits on a strong GPU" - "this model can run on a laptop or smaller workstation" That makes it directly relevant to the real chatbot workflow. ## What not to over-teach first Do not start with: - kernel fusion - custom GPU backends - highly optimized FP8 pipelines Those are important systems topics, but they are not the first thing you need. For this course, the practical bridge is: - understand quantization as a storage/compute tradeoff - then inspect how the accelerated stack handles precision choices in [picollm/accelerated/fp8.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/fp8.py) - then connect that to serving choices in [picollm/accelerated/chat/web.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/chat/web.py) ## Key takeaway Quantization turns "this model exists" into "this model is feasible to run here," but always by trading numerical fidelity for efficiency. > [!example] Notebook walkthroughs in this lecture > > Use this order: > > 1. [notebooks/quantization/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/quantization/lecture_walkthrough.ipynb) > Use these sections as you read: > - [`Memory by dtype`](https://github.com/Montekkundan/llm/blob/main/notebooks/quantization/lecture_walkthrough.ipynb#memory-by-dtype) > - [`Toy int8 quantization`](https://github.com/Montekkundan/llm/blob/main/notebooks/quantization/lecture_walkthrough.ipynb#toy-int8-quantization) > - [`Platform-aware serving choices`](https://github.com/Montekkundan/llm/blob/main/notebooks/quantization/lecture_walkthrough.ipynb#platform-aware-serving-choices) > - [`Where quantization appears in the serious model path`](https://github.com/Montekkundan/llm/blob/main/notebooks/quantization/lecture_walkthrough.ipynb#where-quantization-appears-in-the-serious-model-path) > 2. [picollm/accelerated/fp8.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/fp8.py) > 3. [picollm/accelerated/README.md](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/README.md) <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Inference Runtime and KV Cache" href="Inference%20Runtime%20and%20KV%20Cache">Inference Runtime and KV Cache</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Serving, Latency, and Observability" href="Serving%2C%20Latency%2C%20and%20Observability">Serving, Latency, and Observability</a></div> </div> </div> ## Further reading - Tim Dettmers et al., "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale," 2022. https://arxiv.org/abs/2208.07339 - Elias Frantar et al., "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers," 2023. https://arxiv.org/abs/2210.17323 - Ji Lin et al., "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration," 2024. https://arxiv.org/abs/2306.00978 - Hugging Face, "Quantization," 2025. https://huggingface.co/docs/transformers/en/main_classes/quantization