02 / vocab

Build a vocabulary

We have 1.1 MB of Shakespeare. The model is going to do math on it — matrix multiplications, gradient descent, all of that. Math doesn't work on letters; it works on numbers. So step one of the actual code is to give every distinct character a number.

Sort the unique characters alphabetically, then assign each one its index. The kid we trained ended up with 0 characters total — every distinct symbol that appears anywhere in the Shakespeare file.

train.pylines 48–53

chars = sorted(set(text))
VOCAB = len(chars)
stoi = {c: i for i, c in enumerate(chars)}
itos = {i: c for i, c in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]
decode = lambda ids: "".join(itos[i] for i in ids)

Try it yourself

Type some text below and watch encode() turn it into integer IDs. Each badge shows a character and its assigned number from our vocab.

Tokens (0)

Show full vocabulary (0 chars)

Note: Real LLMs use byte-pair encoding — tokens are word-pieces averaging ~4 characters each. Same idea, different granularity. Character-level keeps the vocab tiny so we can show every weight on a single screen.

Try it yourself

Give each letter a meaning — 128 numbers per character