09 / process

Behind the scenes

We built two things, in this order: a one-file training script called tiny-llm that trains the kid, and this site — peek— that walks people through how it works. Both were small. Both taught us something we didn't expect.

The training script

The whole training pipeline lives in one Python file, about 200 lines. It downloads tinyshakespeare, builds the vocab, defines the model, trains for 5,000 steps, and saves the result to kid.pt. One file, no config, runnable on a laptop in a few minutes. We treated it as a teaching artifact: every line should be readable, every choice should have an obvious reason.

The constraint we set: the only piece you write yourself is attention. Everything else is plumbing — embeddings, the block, the training loop, the optimizer. Those have one obvious right answer. Attention is the actual interesting idea, so we made it the only thing the reader has to type out.

train.pylines 70–96 — the only piece you write yourself
# ─────────────────────────────────────────────────────────────────────
#  THE ONE THING YOU WRITE
# ─────────────────────────────────────────────────────────────────────
class Head(nn.Module):
    """Single self-attention head — the core of the transformer."""
    def __init__(self, head_size):
        super().__init__()
        self.key   = nn.Linear(N_EMBD, head_size, bias=False)
        self.query = nn.Linear(N_EMBD, head_size, bias=False)
        self.value = nn.Linear(N_EMBD, head_size, bias=False)
        self.register_buffer("mask", torch.tril(torch.ones(BLOCK_SIZE, BLOCK_SIZE)))
        self.head_size = head_size

    def forward(self, x):
        B, T, C = x.shape
        q = self.query(x)
        k = self.key(x)
        v = self.value(x)
        scores = q @ k.transpose(-2, -1) / (self.head_size ** 0.5)
        scores = scores.masked_fill(self.mask[:T, :T] == 0, float("-inf"))
        weights = F.softmax(scores, dim=-1)
        return weights @ v

Things that surprised us

The site

For the website, the design constraint was: every page shows real data. Not toy examples, not made-up numbers — the literal weights and intermediates from kid.pt. That meant building an export step (export_for_web.py) that runs the model once, captures everything we want to show (embeddings, position vectors, attention weights for a sample prompt, the parsed training log), and dumps it as JSON into /public/data/.

The site is a static Next.js app — no server, no database, no inference at runtime. Everything you see is precomputed. That choice keeps it fast and free to host, and it forces us to be deliberate about what data we show.

Run it yourself

The full source is two repos. To train your own kid:

git clone <tiny-llm repo>
cd tiny-llm
python -m venv .venv && source .venv/bin/activate
pip install torch
python train.py        # ~5 min on M-series Mac, ~15 min on CPU
python show_model.py   # peek inside what you trained

To rebuild the site against your own kid:

python export_for_web.py   # dumps JSON into ../peek/public/data/
cd ../peek
bun install && bun dev
open http://localhost:3000

What's next

We shipped one — try the playground. A few more directions this could grow:

Thanks for reading.If anything was confusing, that's our fault — open an issue on the repo and we'll take another pass at it. The whole point of building this was to remove the magic from LLMs, and explanation that's itself opaque doesn't move the ball.
← Back to the start