Failure Modes and Debugging

> [!info] Course code > Use these repo paths together with this note: > - [picollm/accelerated/chat/cli.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/chat/cli.py) > - [picollm/accelerated/chat/web.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/chat/web.py) > - [picollm/accelerated/report.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/report.py) ## What This Concept Is When an LLM system fails, the visible symptom is often not the real problem. A weird answer might come from bad data, a broken prompt format, a regression in sampling, a checkpoint issue, or a serving bug. This note is about learning to separate the symptom from the cause. In other words, it is a note about disciplined debugging instead of guessing. ## Foundation Terms You Need First A **failure mode** is a repeatable way the model or system breaks down. A **regression** is when something that used to work becomes worse in a newer run. An **observable symptom** is the visible sign, such as NaNs, latency spikes, or broken replies. The **root cause** is the underlying reason those symptoms appear. That distinction matters because good debugging starts by narrowing the layer where the problem lives: data, training, inference, runtime, or product plumbing. ## Gibberish is usually a pipeline symptom, not a mystical property When a model answers with unstable or nonsense text, the cause is often more concrete than you may first assume. Sometimes the [[Glossary#Checkpoint|checkpoint]] is simply too weak or undertrained. Sometimes the serving path is accidentally using the base checkpoint instead of the chat post-trained checkpoint. Sometimes the inference prompt format does not match the formatting used during [[Glossary#SFT|SFT]]. Sometimes decoding settings make the model appear worse than it really is. And sometimes a [[Glossary#Tokenizer|tokenizer]] mismatch or broken special-token setup causes deeper corruption. The important lesson is that gibberish should trigger diagnosis, not panic. Ask first: which checkpoint is being served? Which tokenizer is loaded? Which prompt wrapper is being used? Are we calling the model as a base LM or as a chat-formatted assistant? These questions often solve the mystery faster than staring at the output itself.[^2] ## Repetition loops and low-quality completions Repetition is one of the most common signs that something is off in the training or inference setup. It can come from a weak checkpoint, but it can also come from poor decoding settings, small or repetitive training data, or unstable chat formatting. You should be taught that repetitive output is not a single bug. It is a symptom with multiple possible causes. ## Training instability On the training side, instability often appears as exploding loss, `nan` values, impossible gradient spikes, or abrupt behavior changes after a scheduler transition. When this happens, the debugging questions are different. Is the learning rate too aggressive? Did precision or optimizer state become unstable? Is the batch configuration pushing the system too hard? Did a bad data example enter the stream? The best debug habit is to isolate whether the problem is in optimization, data, or systems. That classification turns a vague "training is broken" complaint into a tractable investigation. ## Production bugs are often boundary bugs In serving systems, failures often happen at the boundaries between components. The model may be fine, but the API wrapper could be wrong. The CLI may be fine, but the web app could be calling the wrong endpoint. The server could be healthy, but the frontend may be rendering stale state. You should learn that a chatbot system is not one object. It is a chain: checkpoint, tokenizer, prompt wrapper, generation settings, server, client. The best debugging pattern is therefore to test the chain one boundary at a time. A small [[Glossary#Regression test|regression test]] suite helps a lot here. If you keep a few stable prompts, outputs, latency checks, or API checks around, you can catch when a new change quietly breaks behavior that used to work. ## What you should remember Failures become much less intimidating once they are categorized. The right question is rarely "why is AI weird?" It is usually something sharper, such as "is this a checkpoint-quality problem, a prompt-format problem, a decoding problem, or a systems-integration problem?" That framing is what turns debugging into a teachable skill. <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="SFT Flow" href="SFT%20Flow">SFT Flow</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Formal Evaluation and Benchmarking" href="Formal%20Evaluation%20and%20Benchmarking">Formal Evaluation and Benchmarking</a></div> </div> </div> ## Further reading - PyTorch, "Reproducibility," 2025. https://docs.pytorch.org/docs/stable/notes/randomness.html - Hugging Face, "Caching," 2025. https://huggingface.co/docs/transformers/main/cache_explanation - vLLM, "Documentation," 2025. https://docs.vllm.ai/ ## References [^1]: Sebastian Raschka, [LLMs-from-scratch](https://github.com/rasbt/LLMs-from-scratch) [^2]: Andrej Karpathy, [nanochat](https://github.com/karpathy/nanochat)