08 / training
Watch it learn
We have all the machinery — embeddings, attention, blocks, the output projection. But every weight is still random. Hand the kid a prompt right now and you'll get nonsense.
Training is the part where we actually nudge those 825K weights into something useful. Show the kid a batch of Shakespeare, measure how wrong its predictions were, compute which direction each weight should move to be a tiny bit less wrong, take a tiny step, repeat. 5,000 times.
We saved the kid's sample for the prompt ROMEO: at every checkpoint along the way. Drag the slider below to see it go from gibberish to almost-Shakespeare.
for step in range(MAX_STEPS + 1):
if step in CHECKPOINTS:
# ... eval + save a sample ...
xb, yb = get_batch("train")
_, loss = model(xb, yb)
opt.zero_grad(set_to_none=True)
loss.backward()
opt.step()What changes between checkpoints? Step 0: uniformly random characters. Step 100: it has learned that uppercase letters cluster, that
:follows uppercase runs (Shakespeare names!). Step 1000: words are mostly the right length, vowels and consonants alternate. Step 5000: it's producing recognizable scene structures, character names like WARWICK and HENRY, and grammatical-ish English. Same model. Same weights. Just nudged 5,000 times.Next →
09