04 / position

Tell the model where things are

Embeddings give every letter a meaning vector. But we've lost something important: order. The letter "E" five characters into a line should mean something different from an "E" at the very start of one. Right now they get the same 128 numbers.

The fix is simple and slightly unbelievable: make a second embedding table — this one indexed by position instead of by character. Then add the two vectors together. That's it. Each position number 0, 1, 2, … 127 gets its own learned 128-number vector that gets added on top.

Why does adding work? Because the model has 4 layers of attention after this to disentangle the "what" from the "where." Both signals are baked into the same vector, and training figures out how to use them.

train.pylines 139–140, 145–148
self.tok_emb = nn.Embedding(VOCAB, N_EMBD)        # what each char means
self.pos_emb = nn.Embedding(BLOCK_SIZE, N_EMBD)   # where it is in the sequence

# in forward():
pos = torch.arange(T, device=idx.device)
x = self.tok_emb(idx) + self.pos_emb(pos)

See it for one prompt

Below is the prompt "ROMEO: To be". Click any character to see (a) its token embedding, (b) the position embedding for that slot, and (c) the sum that actually enters the first transformer block.

tok_emb[?]
character "R"
+
pos_emb[0]
position 0 in the sequence
=
x[t]
what the first transformer block actually sees
Notice: the position embeddings are learned, just like everything else — they were random at the start of training. Some other transformer designs use fixed sinusoidal patterns for positions (the original 2017 paper did). We do it Karpathy-style and let the kid invent its own.
Next →
05

Let positions look at each other — the heart of the transformer