09 / process

Behind the scenes

We built two things, in this order: a one-file training script called tiny-llm that trains the kid, and this site — peek— that walks people through how it works. Both were small. Both taught us something we didn't expect.

The training script

The whole training pipeline lives in one Python file, about 200 lines. It downloads tinyshakespeare, builds the vocab, defines the model, trains for 5,000 steps, and saves the result to kid.pt. One file, no config, runnable on a laptop in a few minutes. We treated it as a teaching artifact: every line should be readable, every choice should have an obvious reason.

The constraint we set: the only piece you write yourself is attention. Everything else is plumbing — embeddings, the block, the training loop, the optimizer. Those have one obvious right answer. Attention is the actual interesting idea, so we made it the only thing the reader has to type out.

train.pylines 70–96 — the only piece you write yourself

# ─────────────────────────────────────────────────────────────────────
#  THE ONE THING YOU WRITE
# ─────────────────────────────────────────────────────────────────────
class Head(nn.Module):
    """Single self-attention head — the core of the transformer."""
    def __init__(self, head_size):
        super().__init__()
        self.key   = nn.Linear(N_EMBD, head_size, bias=False)
        self.query = nn.Linear(N_EMBD, head_size, bias=False)
        self.value = nn.Linear(N_EMBD, head_size, bias=False)
        self.register_buffer("mask", torch.tril(torch.ones(BLOCK_SIZE, BLOCK_SIZE)))
        self.head_size = head_size

    def forward(self, x):
        B, T, C = x.shape
        q = self.query(x)
        k = self.key(x)
        v = self.value(x)
        scores = q @ k.transpose(-2, -1) / (self.head_size ** 0.5)
        scores = scores.masked_fill(self.mask[:T, :T] == 0, float("-inf"))
        weights = F.softmax(scores, dim=-1)
        return weights @ v

Things that surprised us

How fast it learns. By step 100 the kid had already figured out that uppercase letters cluster and that colons follow them — i.e. the structure of Shakespeare character labels (ROMEO:, JULIET:). After 100 batches. We hadn't told it anything about characters or labels.
How small the model is. 825,000 numbers fit in a 4 MB file. A single Word document of poetry is bigger than our entire LLM.
How visible everything is.Every weight is inspectable. Every attention head's pattern is a 12 × 12 matrix you can stare at. There is no "magic sauce" — just numbers, multiplied and added in a particular order.
The val loss starts diverging from train loss late. Around step 4,000 the train loss keeps falling but val loss stops improving. Classic overfitting onset. We could probably fight it with dropout, but for a teaching kid we left it in — it's honest about what training actually looks like.

The site

For the website, the design constraint was: every page shows real data. Not toy examples, not made-up numbers — the literal weights and intermediates from kid.pt. That meant building an export step (export_for_web.py) that runs the model once, captures everything we want to show (embeddings, position vectors, attention weights for a sample prompt, the parsed training log), and dumps it as JSON into /public/data/.

The site is a static Next.js app — no server, no database, no inference at runtime. Everything you see is precomputed. That choice keeps it fast and free to host, and it forces us to be deliberate about what data we show.

Run it yourself

The full source is two repos. To train your own kid:

git clone <tiny-llm repo>
cd tiny-llm
python -m venv .venv && source .venv/bin/activate
pip install torch
python train.py        # ~5 min on M-series Mac, ~15 min on CPU
python show_model.py   # peek inside what you trained

To rebuild the site against your own kid:

python export_for_web.py   # dumps JSON into ../peek/public/data/
cd ../peek
bun install && bun dev
open http://localhost:3000

What's next

We shipped one — try the playground. A few more directions this could grow:

Watch a single weight learn. Save not just text samples at each checkpoint but a single weight value. Animate its trajectory over the 5,000 steps.
Bigger kid, same explanation. Run a 10M-param version overnight, see whether the explanations still hold up.

Thanks for reading.If anything was confusing, that's our fault — open an issue on the repo and we'll take another pass at it. The whole point of building this was to remove the magic from LLMs, and explanation that's itself opaque doesn't move the ball.

← Back to the start