06 / block

Wrap it in a block, stack four

Attention by itself isn't quite enough. The output of an attention head is just a re-mix of the input values — it can decide where to look, but each position's output is still a linear combination of the input vectors. To learn richer patterns, we need to add some non-linear processing on top.

That's the transformer block: attention, then a small feed-forward network. With two stabilizing tricks wrapped around them — residual connections and LayerNorm — that make deep stacks trainable.

One block looks like this. We stack 4 of them.

train.pylines 117–133

class Block(nn.Module):
    """One transformer block: multi-head attention + feed-forward,
    with residuals + LayerNorm."""
    def __init__(self):
        super().__init__()
        self.attn = MultiHead()
        self.ffwd = nn.Sequential(
            nn.Linear(N_EMBD, 4 * N_EMBD),
            nn.ReLU(),
            nn.Linear(4 * N_EMBD, N_EMBD),
        )
        self.ln1 = nn.LayerNorm(N_EMBD)
        self.ln2 = nn.LayerNorm(N_EMBD)

    def forward(self, x):
        x = x + self.attn(self.ln1(x))   # attention with residual
        x = x + self.ffwd(self.ln2(x))   # feed-forward with residual
        return x

What's happening in those two lines

x

input — 128 numbers per position

ln1(x)

normalize so each vector has mean 0, var 1

attn(ln1(x))

multi-head attention — 4 heads in parallel

x + attn(ln1(x))

add the attention output back to the original (residual)

ln2(...)

normalize again

ffwd(ln2(...))

feed-forward: 128 → 512 → ReLU → 128

x + ffwd(...)

add it back again (residual)

Stack them up — that's the model

train.pyline 141

self.blocks = nn.Sequential(*[Block() for _ in range(N_LAYER)])

That's it. 4identical blocks in a row. The input flows through block 0, then block 1, then block 2, then block 3. Each block sees the previous block's output and refines it a little further. By the time we reach the top, each position has had 4 chances to look around the sequence and 4 chances to think.

The full parameter inventory

Here is every learned weight in the model — pulled directly from kid.pt. Total: 824,897 numbers. That's the entire kid.

Notice the pattern.Each of the four blocks has the same structure and the same parameter count (~206K). The vast majority of parameters live in the FFN's 128 → 512 → 128 weight matrices. Real LLMs scale by adding more blocks and growing the embedding dimension; the basic pattern stays the same.

What's happening in those two lines

Stack them up — that's the model

The full parameter inventory

Project back to letters — turn the final vector into 65 probabilities