Interpretability and Mechanistic Analysis

> [!info] Course code > Use this note after you already understand the Transformer internals and the `picollm` code paths. ## What This Concept Is Most of the course asks what the model does. This note asks what seems to be happening inside while it does it. Interpretability and mechanistic analysis try to move from output-level description to internal explanation. That makes this note less about product behavior and more about opening the box. ## Foundation Terms You Need First An **activation** is an internal value produced while the model processes a sequence. The **[[Glossary#Residual stream|residual stream]]** is the running representation passed from block to block. A **circuit** is a recurring internal computation pattern made of interacting components. **Mechanistic analysis** is the attempt to trace those internal patterns in a concrete way. So if earlier notes asked "what answer did the model give?" this note asks "what internal pathway helped produce that answer?" ## Probing representations One common approach is probing: train lightweight classifiers or regressors on hidden states to see what information appears recoverable from them. Probing can reveal whether representations encode syntax, parts of speech, factual signals, or task-relevant structure. But probing does not automatically prove that the model "uses" that information causally. It only shows that the information is present or decodable.[^2] This distinction matters because decodability is weaker than mechanism. ## Attention pattern interpretation limits Attention visualizations are tempting because they are easy to plot and intuitively appealing. But "this token attended strongly to that token" is not the same thing as a full explanation of model behavior. Attention is one part of a larger residual-stream computation. Overinterpreting attention maps is therefore a known failure mode in explanation culture.[^3] You should be explicitly warned that attention heatmaps are often descriptive cues, not sufficient explanations. ## Feature circuits and mechanistic interpretability Mechanistic interpretability goes deeper by trying to identify circuits, features, and pathways that implement specific computations. The Transformer Circuits line of work is influential here because it treats the model as a composed algorithmic system that may be partially reverse-engineered rather than only probed statistically.[^4] That shift from "what is encoded?" to "what computation is implemented?" is the main conceptual jump into mechanistic work. ## Residual stream and hidden states Operationally, you should understand that hidden states and residual streams are where information is carried forward across the network. They are not just intermediate tensors for debugging. They are the medium in which features, computations, and later predictions are built. ## Causal intervention methods At advanced level, interpretation should move beyond passive inspection. This is where [[Glossary#Ablation|ablation]], activation patching, and causal intervention ideas become important. The question changes from: - what can I visualize? to: - what changes if I perturb or replace a specific component? That is much closer to the standard of evidence expected in serious interpretability research. ## Sparse features and dictionaries A major modern direction is trying to express distributed hidden activity in terms of more interpretable feature dictionaries, including sparse autoencoder approaches. The appeal is straightforward: if the raw [[Glossary#Residual stream|residual stream]] is too entangled to read directly, perhaps a learned sparse basis reveals reusable human-meaningful features. You should know this direction exists even if they do not implement it in the core course. ## Why interpretability claims are hard Interpretability is hard because models are high-dimensional, distributed systems. Many explanations are partial, local, or intervention-sensitive. This does not make the field useless. It means you should learn humility: interpretability can be informative, but strong claims require careful evidence. If you want a concrete external tool to explore this area, Neuronpedia is worth knowing about. It is not part of the core course workflow, and you do not need it to understand the theory in this note. But it is a strong example of how the interpretability community organizes features, sparse autoencoders, probes, and shared inspection tools around real models.[^5] ## Scope of this course and where to go deeper This course introduces the main interpretability questions and includes a lightweight inspection helper in `picollm/analysis/inspect_activations.py` so you can at least see hidden-state and attention summaries attached to real prompts. That should be understood as an entry point, not as a complete mechanistic interpretability curriculum. A deeper follow-on lecture sequence could focus on probing design, activation patching, causal interventions, sparse autoencoders, logit lens methods, and circuit-level case studies. If you want to go deeper after this note, the best next steps are: - compare hidden-state summaries across prompts and checkpoints - read the Transformer Circuits material carefully - study probing methodology with attention to its limitations - treat every interpretability claim as a hypothesis that still needs evidence <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Advanced Data Engineering for LLMs" href="Advanced%20Data%20Engineering%20for%20LLMs">Advanced Data Engineering for LLMs</a></div> </div> </div> ## References [^1]: Neel Nanda, [A Comprehensive Mechanistic Interpretability Explainer and Glossary](https://www.neelnanda.io/mechanistic-interpretability/glossary) [^2]: Yonatan Belinkov, [Probing Classifiers: Promises, Shortcomings, and Advances](https://direct.mit.edu/coli/article/48/1/207/108847/Probing-Classifiers-Promises-Shortcomings-and) [^3]: Sarthak Jain and Byron C. Wallace, [Attention is not Explanation](https://arxiv.org/abs/1902.10186) [^4]: Anthropic, [A Mathematical Framework for Transformer Circuits](https://transformer-circuits.pub/2021/framework/index.html) [^5]: Neuronpedia, [Introduction](https://docs.neuronpedia.org/)