* / atlas

Atlas — the kid, all in one picture

Every operation. Every weight. Every shape. The full forward pass for one character of generation, on a single page. Use the chapters for why; this is the what, all in one view.

ROMEO:

Text in

The user types a prompt.

6 chars
prompt = "ROMEO:"
302725172710

Tokenize

Each character becomes its index in the 65-char vocabulary.

ids = [vocab.indexOf(c) for c in text]
6 ints
vocab.json (65 entries)

Embed + Position

Look up each char's meaning vector and add a position vector that says where it sits in the sequence.

x = tok_emb[ids] + pos_emb[positions]
6 × 128 matrix
tok_emb: 8,320
pos_emb: 16,384
×4
LayerNorm4-head self-attn+ residualLayerNormMLP 128→512→128+ residual

Transformer block ×4

LayerNorm. 4-head causal self-attention. + residual. LayerNorm. 2-layer MLP (128→512→128) with ReLU. + residual. Repeat 4 times.

for blk in blocks: x = blk(x)
# blk(x) = x + attn(ln1(x))
# blk(x) = x + mlp(ln2(x))
4 blocks
~197,888 weights / block
791,552 total

Final LayerNorm

Stabilize the last position's vector before the projection to vocab.

h = ln_f(x[-1])
128 numbers
ln_f: 256 weights

Project to vocab (lm_head)

Multiply the 128-vector by the 65×128 lm_head matrix and add a 65-vector bias. One score per possible next character.

logits = h @ lm_head.weight.T + lm_head.bias
65 logits
lm_head: 8,385 weights

Softmax

Turn logits into probabilities. Subtract max for numerical stability, then exp(x_i) / sum(exp(x_j)).

probs = softmax(logits / temperature)
65 probabilities
temperature scales spread
·

Sample

Draw a random number; walk the cumulative probability mass. At temperature 0, this collapses to argmax.

next_id = sample(probs)
1 character
feeds back into stage 1

Total weights: 824,897

tok_emb 8,320 · pos_emb 16,384 · 4 blocks 791,552 · ln_f 256 · lm_head 8,385

Training is what fills these in. We start them at random and nudge them 5,000 times until the model can write Shakespeare-flavored text. → training

← Back to the start