Feed-Forward Network

> [!info] Course code > Use the companion repository for runnable notebooks, figures, and implementation references for this lecture: > - Theory notebook: [notebooks/feed_forward_network/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/feed_forward_network/lecture_walkthrough.ipynb) > - Serious model anchor: [picollm/accelerated/gpt.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/gpt.py) ## What This Concept Is Attention lets tokens exchange information, but it is not the whole block. After tokens have looked at one another, each token still needs its own local computation step. The feed-forward network is that step: a per-token transformation that expands, reshapes, and compresses the representation. If attention is the part that says "look around," the feed-forward network is the part that says "now do something nonlinear with what you know." ## Foundation Terms You Need First Focus on one token position at a time. The token already has a current representation, which is a vector. A **linear layer** changes that vector by multiplying it with learned weights. A **nonlinearity** makes the transformation more expressive than one plain matrix multiply. The **hidden width** is the larger intermediate space the token moves through before coming back to the model width. What matters here is that the FFN does not mix different sequence positions directly. It takes each token's current representation and transforms it on its own, using the same learned function at every position. <video src="https://assets.montek.dev/lectures/media/llm/concepts/Feed-Forward%20Network/01_attention_vs_ffn_roles.mp4" controls></video> ## How this lecture maps to picoLLM You should first learn the canonical FFN as $d_{\text{model}} \to d_{\text{ff}} \to d_{\text{model}}$. Then inspect `picollm/accelerated/gpt.py` and see that the serious stack uses a more opinionated MLP choice. That is a key course transition: - this note explains the standard FFN role - picoLLM shows how a real system specializes that role Keep the external-reference map simple: - `rasbt` is the concept-first external reference - `picollm` is the course’s real implementation path - `nanochat` is the systems-oriented comparison reference This note covers three connected ideas: - why Transformers still need the FFN even though attention already contains a [[Glossary#Softmax|softmax]] and already computes contextual representations[^2] - the standard FFN, parameter counts, [[Glossary#FLOP / FLOPS|FLOP]] scaling, activation choices, and gated variants like GEGLU and SwiGLU[^3] - the systems and scaling questions around FFNs: why they often dominate parameters, how norm placement affects training, why activation memory matters, and how MoE turns them into sparse capacity engines[^4] ## Why the FFN exists when we already have attention ### Attention mixes tokens, FFNs mix channels The cleanest decomposition of a Transformer block is: - attention handles **token mixing** - the FFN handles **channel mixing plus nonlinearity** Attention tells each token representation what other positions to read from. But after that contextual mixing happens, the model still needs a powerful nonlinear map that reshapes the resulting feature vector. That is the FFN’s job. The original Transformer explicitly calls this sublayer “position-wise” because the same network is applied independently and identically to every position.[^1] The lecture line worth repeating is: > Attention decides where to read. The FFN decides what to compute once you have read it. > [!question] Quick check > If attention already includes a softmax, why is the FFN still needed? >> [!answer] the softmax changes token-to-token weighting, but the FFN adds per-token nonlinear feature construction after contextual mixing. ### The FFN is not a minor accessory One of the most important practical corrections to the phrase “Attention Is All You Need” is that, in modern Transformers, the FFN often dominates both parameter count and a large fraction of the compute budget.[^2] With the common choice $d_{\text{ff}} = 4 d_{\text{model}}$, the FFN contributes roughly: $ 2 d_{\text{model}} d_{\text{ff}} = 8 d_{\text{model}}^2 $ parameters, ignoring biases. That is often substantially larger than the attention projection matrices. So the FFN is not just a helper block. It is one of the main places where model capacity lives. ## The canonical FFN formulation ### Standard two-layer FFN For one token representation $x \in \mathbb{R}^{d_{\text{model}}}$, the classic Transformer FFN is: $ \operatorname{FFN}(x) = W_2 \phi(W_1 x + b_1) + b_2 $ where: - $W_1 \in \mathbb{R}^{d_{\text{ff}} \times d_{\text{model}}}$ - $W_2 \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}}$ - $\phi$ is the activation function In the original Transformer, $\phi$ is ReLU and $d_{\text{ff}} = 2048$ when $d_{\text{model}} = 512$.[^1] For a batch of sequences, the same FFN is applied independently at every position. ### Equivalent $1 \times 1$ convolution view The original paper notes that the position-wise FFN is equivalent to two convolutions with kernel size $1$.[^1] That helps you see the FFN as: - local across positions - dense across channels It does not create new token-token interactions. It transforms feature channels for each token separately. ### Parameter and compute accounting For the dense FFN: - parameters are approximately $2 d_{\text{model}} d_{\text{ff}}$ - forward FLOPs per token are also dominated by those two matrix multiplies - activation memory includes storing the expanded hidden state of width $d_{\text{ff}}$ This is why $d_{\text{ff}}$ is such an important knob. It controls not only capacity, but also activation memory and runtime. > [!question] Quick check > Why does FFN parameter accounting matter as much as attention parameter accounting? >> [!answer] because the FFN often carries a large fraction of the parameters and compute, especially when $d_{\text{ff}}$ is much larger than $d_{\text{model}}$. > [!example] Code for this section > - Notebook: [notebooks/feed_forward_network/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/feed_forward_network/lecture_walkthrough.ipynb) > [!tip] TensorTonic follow-up > - [TensorTonic: Transformers Feed Forward](https://www.tensortonic.com/research/transformer/transformers-feed-forward) > Use it here to practice the width-expansion and per-token MLP contract from this section. ## Why d_ff is usually large ### Expansion buys nonlinear feature capacity Why do we expand from $d_{\text{model}}$ to a larger hidden width instead of keeping the FFN narrow? The short answer is that the expansion gives the model more room to build nonlinear intermediate features before projecting back into the [[Glossary#Residual stream|residual stream]]. This is one of the simplest but most important design moves in deep networks: - go to a wider hidden space - apply a nonlinear transformation there - compress back down In Transformer FFNs, that width is often where a large fraction of the model’s per-token representational capacity lives. ### The FFN is also a parameter-allocation decision Another useful lens is resource allocation: - attention allocates parameters to cross-token mixing - the FFN allocates parameters to local per-token computation Changing $d_{\text{ff}}$ is therefore one way of deciding where the model’s capacity should live. This is one reason FFN design choices often have such large effects on quality and scaling. ## FFNs as memories and pattern detectors ### Mechanistic interpretation Mechanistic interpretability work argues that FFN layers can often be usefully understood as key-value memories.[^2] The rough intuition is: - some neurons or hidden directions detect specific patterns - once activated, they write stereotyped downstream effects back into the residual stream That gives a much sharper interpretation than “the FFN adds more parameters.” It suggests that the FFN stores reusable feature detectors and transformation templates. ### Why this interpretation matters This interpretation helps you understand why a position-wise network can still matter deeply for language modeling even though it never directly attends to other tokens. It operates on already-contextualized vectors, and it can convert those context-rich vectors into new, sharper, more task-relevant features for the next attention layer. ## Activation functions ### ReLU in the original Transformer The original Transformer uses ReLU in the FFN.[^1] That gives the canonical early formula: $ \operatorname{FFN}(x) = W_2 \operatorname{ReLU}(W_1 x + b_1) + b_2 $ ReLU is simple and efficient, but it also introduces hard zero regions and can create dead activations. ### GELU in BERT and GPT-era models Later language models shifted to smoother activations. BERT explicitly uses GELU instead of ReLU, following earlier GPT choices.[^5] GELU can be written as: $ \operatorname{GELU}(x) = x \Phi(x) $ where $\Phi$ is the Gaussian CDF.[^6] In practice, many frameworks use a tanh approximation for speed and consistency. PyTorch documents both the exact and approximate forms.[^7] ### Swish and SiLU Swish, and its closely related SiLU form, uses: $ \operatorname{Swish}(x) = x \sigma(\beta x) $ with the common case $\beta = 1$ corresponding to SiLU.[^8] The key point is that these smoother activations often provide better gradient behavior and empirical quality than plain ReLU in large Transformer stacks. > [!question] Quick check > Why can GELU or Swish behave better than plain ReLU in large Transformer stacks? >> [!answer] their smoother activation curves often produce gentler gradient behavior and less brittle gating than a hard zero cutoff. > [!question] Quick check > Why can GELU or Swish behave better than plain ReLU in large Transformer stacks? >> [!answer] their smoother activation curves often produce gentler gradient behavior and less brittle gating than a hard zero cutoff. > [!tip] TensorTonic follow-up > - [TensorTonic: GPT-2 GELU](https://www.tensortonic.com/research/gpt2/gpt2-gelu) > Use it here to practice the activation behavior that GPT-style FFNs rely on. <video src="https://assets.montek.dev/lectures/media/llm/concepts/Feed-Forward%20Network/02_activation_functions_and_gating.mp4" controls></video> ## Gated FFNs: GEGLU and SwiGLU ### The gating idea A gated FFN replaces the single hidden activation path with two paths: - a content path - a gate path Then the two are multiplied elementwise. A generic gated form looks like: $ \operatorname{FFN}_{\text{gated}}(x) = W_{\text{down}} \left(\phi(W_{\text{gate}} x) \ast (W_{\text{up}} x)\right) $ This creates multiplicative interactions and lets one stream modulate the other. ### Why gating matters The most compact intuition is: Gating lets the FFN decide not only what feature to construct, but also whether and how strongly that feature should pass through. That is one reason gated FFNs can behave like a soft form of conditional computation even without explicit discrete routing. > [!tip] TensorTonic follow-up > - [TensorTonic: GPT-2 FFN](https://www.tensortonic.com/research/gpt2/gpt2-ffn) > Work through it here to inspect the full GPT-style FFN path after the gating and activation discussion. ### GLU variants and the compute-matching trick GLU-style Transformer FFNs add a third weight matrix compared with the standard two-matrix FFN. The key practical trick is to reduce hidden width by a factor of about `2/3` so total compute and parameter count stay roughly matched.[^3] That is one of the best architecture-design lessons in the whole lecture: When you change functional form, compare at matched compute rather than raw width. GEGLU and SwiGLU are especially important because they became common in large modern LLMs. PaLM and LLaMA-style model families explicitly adopt SwiGLU-style FFNs in place of classic ReLU MLPs.[^3][^10] ### Gradient intuition for gating If the FFN output is: $ y = W_{\text{down}}(g \ast u) $ where `g` is the gate branch and `u` is the content branch, then gradients into each branch are modulated by the other. That means: - if the gate is near zero, content gradients get suppressed - if content is weak, the gate branch may also receive little useful signal This is a clean mathematical reason gating can behave like routing. ## Residuals, normalization, and training stability ### Post-LN versus Pre-LN The original Transformer uses the Post-LN pattern: $ \operatorname{LayerNorm}(x + \operatorname{Sublayer}(x)) $ for both attention and FFN sublayers.[^1] Later work shows that normalization placement strongly affects gradient behavior and training stability. Mean-field analyses argue that Post-LN leads to more fragile gradient scaling at initialization and helps explain the need for [[Glossary#Warmup|warmup]], while Pre-LN often trains more stably in deep stacks.[^11] ### Why this matters for FFNs This is not only an “attention issue.” FFNs sit inside the same residual system, so their gradients and activations are shaped by exactly this residual-plus-normalization structure. That means FFN behavior depends on: - activation function - hidden width - residual placement - normalization type all at once. ### RMSNorm and modern LLM stacks Many modern decoder-only LLMs move to pre-normalization and often use RMSNorm rather than LayerNorm. LLaMA is a canonical example, combining pre-normalization, RMSNorm, and SwiGLU-style MLP design.[^10] This is worth stating clearly because it helps you see that the FFN design is part of a larger architecture system, not a plug-in module chosen in isolation. <video src="https://assets.montek.dev/lectures/media/llm/concepts/Feed-Forward%20Network/03_norm_placement_and_memory.mp4" controls></video> ## Memory, compute, and systems behavior ### FFN activation memory is expensive Even though attention gets most of the attention in long-context discussions, FFN activation memory is often a major part of the real training budget. The expanded hidden representation of width $d_{\text{ff}}$ must typically be stored for backward, and $d_{\text{ff}}$ is often much larger than $d_{\text{model}}$. Reformer explicitly calls out FFN activation storage as a significant memory cost and motivates chunking as a way to reduce peak activation memory without changing the mathematical result.[^12] ### Checkpointing and custom backward Large-model training reports also emphasize FFN activation management. LLaMA discusses activation checkpointing and memory-aware backward implementations as part of practical large-scale training.[^10] This gives a strong systems lesson: The FFN is not only a parameter sink. It is also an activation-memory sink. ### Fused kernels matter FFNs are matmul-heavy and bandwidth-sensitive, so fused kernels for: - bias plus activation - gated activation patterns - custom backward paths ## MoE: when FFNs go sparse ### The basic idea Mixture-of-experts (MoE) is best taught as a sparse FFN family. Instead of one dense FFN applied to every token, the model has multiple expert FFNs and routes each token to one or a few experts. That means the model can scale total parameter count without activating all parameters on every token.[^14] This is one of the most important modern ways to think about FFN scaling. ### Switch and GShard Primary-source milestones make the story concrete: - Sparsely-gated MoE introduces the basic routing idea.[^14] - GShard scales Transformer MoE layers to very large systems.[^15] - Switch simplifies routing to top-1 and emphasizes training stability tricks such as selective precision in the router.[^4] - GLaM shows very large MoE language models with strong efficiency framing.[^17] ### Capacity factor and dropped tokens MoE introduces a few new engineering concepts worth knowing: - capacity factor - load balancing - dropped tokens when experts overflow - router entropy - selective higher precision in the router That makes MoE a great case study in how FFN design turns directly into distributed-systems design. > [!question] Quick check > What does a routing histogram reveal in a Mixture-of-Experts FFN? >> [!answer] it shows how unevenly tokens are being assigned across experts, which makes expert imbalance or collapse easier to spot. > [!example] Notebook walkthroughs in this lecture > > If you want to study this note in code, use these notebook sections. If the viewer ignores the fragment, search for the exact heading text in the notebook: > > - [`FFN acts independently at every token`](https://github.com/Montekkundan/llm/blob/main/notebooks/feed_forward_network/lecture_walkthrough.ipynb#ffn-acts-independently-at-every-token) > - [`Parameter count grows with FFN width`](https://github.com/Montekkundan/llm/blob/main/notebooks/feed_forward_network/lecture_walkthrough.ipynb#parameter-count-grows-with-ffn-width) > - [`Activation choice changes the hidden map`](https://github.com/Montekkundan/llm/blob/main/notebooks/feed_forward_network/lecture_walkthrough.ipynb#activation-choice-changes-the-hidden-map) > - [`Gated FFNs add a modulation branch`](https://github.com/Montekkundan/llm/blob/main/notebooks/feed_forward_network/lecture_walkthrough.ipynb#gated-ffns-add-a-modulation-branch) > - [`Attention mixes tokens and FFN mixes channels`](https://github.com/Montekkundan/llm/blob/main/notebooks/feed_forward_network/lecture_walkthrough.ipynb#attention-mixes-tokens-and-ffn-mixes-channels) > > A useful study order is: > > 1. inspect the token-wise FFN contract and shape changes > 2. compare dense and gated parameter counts at matched width > 3. study how activation choice changes the hidden map > 4. connect dense FFNs to MoE routing and systems trade-offs > > <video src="https://assets.montek.dev/lectures/media/llm/concepts/Feed-Forward%20Network/04_moe_sparse_ffn.mp4" controls></video> > [!tip] TensorTonic practice for this lecture > > If you want to practice this lecture in a more implementation-focused format, work through these TensorTonic exercises: > > - [TensorTonic: Transformers Feed Forward](https://www.tensortonic.com/research/transformer/transformers-feed-forward) > - [TensorTonic: GPT-2 GELU](https://www.tensortonic.com/research/gpt2/gpt2-gelu) > - [TensorTonic: GPT-2 FFN](https://www.tensortonic.com/research/gpt2/gpt2-ffn) > > They are good follow-ups because they let you isolate the main moving parts of the MLP path: > > - expansion from $d_{\text{model}}$ to $d_{\text{ff}}$ > - nonlinear activation in the hidden layer > - projection back to model width > - the specific role GELU plays in GPT-style FFNs > > can materially affect [[Glossary#Throughput|throughput]]. PyTorch, Megatron-Core, and inference systems like vLLM all expose or rely on fused activation patterns for exactly this reason.[^7][^13] > <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Multi-head Attention" href="Multi-head%20Attention">Multi-head Attention</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Layer Normalization" href="Layer%20Normalization">Layer Normalization</a></div> </div> </div> ### References [^1]: Ashish Vaswani et al., "Attention Is All You Need," 2017. https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf [^2]: Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy, "Transformer Feed-Forward Layers Are Key-Value Memories," 2021. https://aclanthology.org/2021.emnlp-main.446.pdf [^3]: Noam Shazeer, "GLU Variants Improve Transformer," 2020. https://arxiv.org/abs/2002.05202 [^4]: William Fedus, Barret Zoph, and Noam Shazeer, "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity," 2022. https://jmlr.org/papers/v23/21-0998.html [^5]: Jacob Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," 2019. https://aclanthology.org/N19-1423.pdf [^6]: Dan Hendrycks and Kevin Gimpel, "Gaussian Error Linear Units (GELUs)," 2016. https://arxiv.org/abs/1606.08415 [^7]: PyTorch, "torch.nn.GELU," 2025. https://docs.pytorch.org/docs/stable/generated/torch.nn.GELU.html [^8]: Prajit Ramachandran, Barret Zoph, and Quoc V. Le, "Searching for Activation Functions," 2017. https://arxiv.org/abs/1710.05941 [^10]: Hugo Touvron et al., "LLaMA: Open and Efficient Foundation Language Models," 2023. https://arxiv.org/abs/2302.13971 [^11]: Ruibin Xiong et al., "On Layer Normalization in the Transformer Architecture," 2020. https://arxiv.org/abs/2002.04745 [^12]: Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya, "Reformer: The Efficient Transformer," 2020. https://openreview.net/forum?id=rkgNKkHtvB [^13]: NVIDIA, "Megatron Core fusions," 2025. https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/fusions.html [^14]: Noam Shazeer et al., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer," 2017. https://arxiv.org/abs/1701.06538 [^15]: Dmitry Lepikhin, HyoukJoong Lee, and Yuanzhong Xu, "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding," 2020. https://arxiv.org/abs/2006.16668 [^17]: Nan Du et al., "GLaM: Efficient Scaling of Language Models with Mixture-of-Experts," 2022. https://proceedings.mlr.press/v162/du22c/du22c.pdf