* / atlas

Atlas — the kid, all in one picture

Every operation. Every weight. Every shape. The full forward pass for one character of generation, on a single page. Use the chapters for why; this is the what, all in one view.

ROMEO:

Text in

The user types a prompt.

6 chars

prompt = "ROMEO:"

302725172710

Tokenize

Each character becomes its index in the 65-char vocabulary.

ids = [vocab.indexOf(c) for c in text]

6 ints

vocab.json (65 entries)

→ tokens

…

Embed + Position

Look up each char's meaning vector and add a position vector that says where it sits in the sequence.

x = tok_emb[ids] + pos_emb[positions]

6 × 128 matrix

tok_emb: 8,320

pos_emb: 16,384

→ embeddings → position

×4

LayerNorm4-head self-attn+ residualLayerNormMLP 128→512→128+ residual

Transformer block ×4

LayerNorm. 4-head causal self-attention. + residual. LayerNorm. 2-layer MLP (128→512→128) with ReLU. + residual. Repeat 4 times.

for blk in blocks: x = blk(x)
# blk(x) = x + attn(ln1(x))
# blk(x) = x + mlp(ln2(x))

4 blocks

~197,888 weights / block

791,552 total

→ attention → block

Final LayerNorm

Stabilize the last position's vector before the projection to vocab.

h = ln_f(x[-1])

128 numbers

ln_f: 256 weights

Project to vocab (lm_head)

Multiply the 128-vector by the 65×128 lm_head matrix and add a 65-vector bias. One score per possible next character.

logits = h @ lm_head.weight.T + lm_head.bias

65 logits

lm_head: 8,385 weights

→ prediction

Softmax

Turn logits into probabilities. Subtract max for numerical stability, then exp(x_i) / sum(exp(x_j)).

probs = softmax(logits / temperature)

65 probabilities

temperature scales spread

→ prediction

Sample

Draw a random number; walk the cumulative probability mass. At temperature 0, this collapses to argmax.

next_id = sample(probs)

1 character

feeds back into stage 1

→ playground

Total weights: 824,897

tok_emb 8,320 · pos_emb 16,384 · 4 blocks 791,552 · ln_f 256 · lm_head 8,385

Training is what fills these in. We start them at random and nudge them 5,000 times until the model can write Shakespeare-flavored text. → training

← Back to the start