* / atlas
Atlas — the kid, all in one picture
Every operation. Every weight. Every shape. The full forward pass for one character of generation, on a single page. Use the chapters for why; this is the what, all in one view.
Text in
The user types a prompt.
Tokenize
Each character becomes its index in the 65-char vocabulary.
ids = [vocab.indexOf(c) for c in text]Embed + Position
Look up each char's meaning vector and add a position vector that says where it sits in the sequence.
x = tok_emb[ids] + pos_emb[positions]Transformer block ×4
LayerNorm. 4-head causal self-attention. + residual. LayerNorm. 2-layer MLP (128→512→128) with ReLU. + residual. Repeat 4 times.
for blk in blocks: x = blk(x)
# blk(x) = x + attn(ln1(x))
# blk(x) = x + mlp(ln2(x))Final LayerNorm
Stabilize the last position's vector before the projection to vocab.
h = ln_f(x[-1])Project to vocab (lm_head)
Multiply the 128-vector by the 65×128 lm_head matrix and add a 65-vector bias. One score per possible next character.
logits = h @ lm_head.weight.T + lm_head.biasSoftmax
Turn logits into probabilities. Subtract max for numerical stability, then exp(x_i) / sum(exp(x_j)).
probs = softmax(logits / temperature)Sample
Draw a random number; walk the cumulative probability mass. At temperature 0, this collapses to argmax.
next_id = sample(probs)Total weights: 824,897
tok_emb 8,320 · pos_emb 16,384 · 4 blocks 791,552 · ln_f 256 · lm_head 8,385
Training is what fills these in. We start them at random and nudge them 5,000 times until the model can write Shakespeare-flavored text. → training