08 / training

Watch it learn

We have all the machinery — embeddings, attention, blocks, the output projection. But every weight is still random. Hand the kid a prompt right now and you'll get nonsense.

Training is the part where we actually nudge those 825K weights into something useful. Show the kid a batch of Shakespeare, measure how wrong its predictions were, compute which direction each weight should move to be a tiny bit less wrong, take a tiny step, repeat. 5,000 times.

We saved the kid's sample for the prompt ROMEO: at every checkpoint along the way. Drag the slider below to see it go from gibberish to almost-Shakespeare.

train.pylines 179–198

for step in range(MAX_STEPS + 1):
    if step in CHECKPOINTS:
        # ... eval + save a sample ...

    xb, yb = get_batch("train")
    _, loss = model(xb, yb)
    opt.zero_grad(set_to_none=True)
    loss.backward()
    opt.step()

What changes between checkpoints? Step 0: uniformly random characters. Step 100: it has learned that uppercase letters cluster, that :follows uppercase runs (Shakespeare names!). Step 1000: words are mostly the right length, vowels and consonants alternate. Step 5000: it's producing recognizable scene structures, character names like WARWICK and HENRY, and grammatical-ish English. Same model. Same weights. Just nudged 5,000 times.

Behind the scenes — how this came together