06 / block
Wrap it in a block, stack four
Attention by itself isn't quite enough. The output of an attention head is just a re-mix of the input values — it can decide where to look, but each position's output is still a linear combination of the input vectors. To learn richer patterns, we need to add some non-linear processing on top.
That's the transformer block: attention, then a small feed-forward network. With two stabilizing tricks wrapped around them — residual connections and LayerNorm — that make deep stacks trainable.
One block looks like this. We stack 4 of them.
class Block(nn.Module):
"""One transformer block: multi-head attention + feed-forward,
with residuals + LayerNorm."""
def __init__(self):
super().__init__()
self.attn = MultiHead()
self.ffwd = nn.Sequential(
nn.Linear(N_EMBD, 4 * N_EMBD),
nn.ReLU(),
nn.Linear(4 * N_EMBD, N_EMBD),
)
self.ln1 = nn.LayerNorm(N_EMBD)
self.ln2 = nn.LayerNorm(N_EMBD)
def forward(self, x):
x = x + self.attn(self.ln1(x)) # attention with residual
x = x + self.ffwd(self.ln2(x)) # feed-forward with residual
return xWhat's happening in those two lines
xln1(x)attn(ln1(x))x + attn(ln1(x))ln2(...)ffwd(ln2(...))x + ffwd(...)Stack them up — that's the model
self.blocks = nn.Sequential(*[Block() for _ in range(N_LAYER)])That's it. 4identical blocks in a row. The input flows through block 0, then block 1, then block 2, then block 3. Each block sees the previous block's output and refines it a little further. By the time we reach the top, each position has had 4 chances to look around the sequence and 4 chances to think.
The full parameter inventory
Here is every learned weight in the model — pulled directly from kid.pt. Total: 824,897 numbers. That's the entire kid.