03 / embeddings

Give each letter a meaning

An ID like 7 isn't useful by itself. The number 7 is one integer apart from 6 and 8 — but the character at position 7 in our vocab has nothing in common with positions 6 and 8.

So we hand each letter 128 numbers— a vector. They start out completely random; training nudges them until each letter's 128 numbers somehow encode what that letter "means." The full table is 65 × 128 = 8,320 learned numbers.

We don't pick what the dimensions mean. The kid invents them. Slot 0 means whatever is most useful for predicting the next character; same for slot 1, slot 2, all 128.

train.pyline 139

self.tok_emb = nn.Embedding(VOCAB, N_EMBD)        # what each char means

That one line creates the table. The forward pass is just an indexing operation: hand it a character ID and it returns that row of 128 numbers.

See the actual numbers

Pick two letters. Each one's 128-dim vector is shown as 128 vertical bars — taller bar = stronger signal on that dimension, up = positive, down = negative. The same dimension means the same thing for every letter. That's why you can compare them.

Letter A

dim 0dim 127

Letter B

dim 0dim 127

What you're seeing: the literal weights stored in kid.pt. We don't know what each dimension means in human terms — it might be "is this a consonant" or "does this start a proper noun" or something we can't name. But whatever the kid invented, it was useful enough to bring the loss from 4.33 down to 1.48.

See the actual numbers

Tell the model where things are — position embeddings