02 / vocab
Build a vocabulary
We have 1.1 MB of Shakespeare. The model is going to do math on it — matrix multiplications, gradient descent, all of that. Math doesn't work on letters; it works on numbers. So step one of the actual code is to give every distinct character a number.
Sort the unique characters alphabetically, then assign each one its index. The kid we trained ended up with 0 characters total — every distinct symbol that appears anywhere in the Shakespeare file.
chars = sorted(set(text))
VOCAB = len(chars)
stoi = {c: i for i, c in enumerate(chars)}
itos = {i: c for i, c in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]
decode = lambda ids: "".join(itos[i] for i in ids)Try it yourself
Type some text below and watch encode() turn it into integer IDs. Each badge shows a character and its assigned number from our vocab.
Tokens (0)
Show full vocabulary (0 chars)
Note: Real LLMs use byte-pair encoding — tokens are word-pieces averaging ~4 characters each. Same idea, different granularity. Character-level keeps the vocab tiny so we can show every weight on a single screen.
Next →
03