Build your own LLM,
one Shakespeare-flavored step at a time.

We trained a tiny language model on the complete works of Shakespeare. 825,000 weights, saved to a 4 MB file, same architecture as GPT-4 — just smaller.

We're going to call it the kid. It shows up empty-headed — every weight a random number — and we hand it 5,000 batches of Shakespeare, one tiny correction at a time, until it learns to write the stuff itself. The file we save at the end is named kid.pt for a reason.

This walkthrough is the actual journey we took to raise it — every step, every snippet of code from train.py, every formula, the real numbers that came out. If you follow along, you'll be able to train your own.

ROMEO:
01

Pick a task

We give the kid one job: read some Shakespeare, then guess the next character. That's it. The whole rest of the model is in service of getting better at this.

02

Build a vocabulary

Computers can't do math on letters. So we list every distinct character in our text — 65 of them — and give each one a number.

03

Give each letter meaning

An ID like 7 isn't useful by itself. We hand each letter a vector of 128 numbers. They start random; training shapes them into something meaningful.

04

Tell it where things are

An 'E' at the start of a line should mean something different from an 'E' five chars in. We add a second 128-number vector for each position.

05

Let positions look at each other

The heart of the transformer — and the one piece of code we wrote ourselves. Every position decides which earlier positions to look at, and how much.

06

Wrap it in a block, stack four

Attention plus a small feed-forward network plus residual connections plus LayerNorm = one transformer block. Stack four of them.

07

Project back to letters

After all the blocks, we turn the final 128-number vector back into 65 scores — one per possible next letter. Softmax turns scores into probabilities.

08

Watch it learn

Random weights produce random gibberish. We show it 5,000 batches of Shakespeare and nudge the weights each time. Watch the same prompt get smarter at every checkpoint.

09

Behind the scenes

How we built this — the one-file training script, the 4 MB saved model, what surprised us, and how to run it yourself.

10

Now you try

Everything you've read, running in your browser. Type a prompt, slide the temperature, watch the kid write you something. The model is loaded into your tab; nothing is sent to a server.

The model in this site has 825,000 parameters. GPT-4 is rumored to have ~1.8 trillion. Same playbook — same ten steps — at very different scale.