Deep Dive
From raw text to coherent language — the mechanics behind the magic. Interactive demos included.
01 — The Problem
Before transformers, the dominant approach was Recurrent Neural Networks (RNNs). They processed text one word at a time, left to right, carrying a "hidden state" — a compressed memory of everything seen so far. This created two fatal problems:
The second problem: RNNs are inherently sequential. Each step depends on the previous one. You can't parallelize across a sequence. With modern hardware (GPUs with thousands of cores), this means most of your compute sits idle.
In 2017, Google Brain published "Attention Is All You Need" and threw away the recurrence entirely. The transformer processes all tokens simultaneously and lets every token directly "look at" every other token. That's the fundamental shift.
02 — Step One
LLMs don't see words or characters — they see tokens. A token is typically a chunk of 3–4 characters, but it varies. "transformers" might become ["transform", "ers"]. "unhappiness" becomes ["un", "happiness"]. Punctuation is usually its own token.
The most common algorithm is Byte-Pair Encoding (BPE): start with individual characters, then greedily merge the most frequent pairs until you reach a vocabulary size (GPT-4 uses ~100,000 tokens). This handles any language and rare words gracefully by falling back to subwords.
Note: This is a simplified demonstration. Real BPE tokenizers use learned merge rules over a fixed vocabulary.
Each token is assigned an integer ID. "The" might be token 464. These IDs are the model's actual input — nothing else. The entire model operates on these numbers.
03 — Step Two
An integer ID like 464 carries no meaning on its own. The first layer of a transformer is an embedding lookup table — a learned matrix where each row is a high-dimensional vector (e.g., 768 dimensions for GPT-2, 12,288 for GPT-4).
Token 464 ("The") maps to some vector [0.21, -0.08, 0.94, ...]. This vector is learned during training and encodes semantic information. Famous result: in the embedding space, king - man + woman ≈ queen. Geometry encodes meaning.
Why 768 dimensions? More dimensions = more capacity to encode distinctions, but more compute. It's a hyperparameter tuned during model design. Think of each dimension as a learned "feature axis" — the model decides what those axes mean.
04 — Step Three
Since transformers process all tokens simultaneously (no left-to-right order), they'd have no idea where in the sequence each token sits. "The cat chased the dog" and "The dog chased the cat" would look identical without position information.
The solution: add a position embedding to each token embedding. The original transformer used a clever mathematical formula — sines and cosines at different frequencies:
Different dimensions of the embedding vector oscillate at different frequencies — like a binary counter but continuous. This lets the model infer relative distances between tokens. Modern LLMs use learned positional embeddings or more sophisticated schemes like RoPE (Rotary Position Embedding).
Each row = one position in the sequence. Each column = one dimension. Color encodes sine/cosine value from -1 (dark) to +1 (light). Notice how earlier dimensions change slowly (low frequency) while later dimensions oscillate rapidly.
05 — The Magic Sauce
This is where transformers are fundamentally different from everything before. For each token, self-attention asks: which other tokens in this sequence are relevant to understanding me? And it answers that question for every token simultaneously.
Each token embedding is linearly projected into three vectors — Q, K, V — using learned weight matrices:
The analogy: imagine a library. A Query is your search request. Each book has a Key (its index card). You compare your Query against all Keys to get a relevance score. Then you retrieve the book's content — its Value. The attention output is a weighted sum of all Values, weighted by how well each Key matched your Query.
Click any word to see what it attends to. Brighter green = stronger attention weight.
One set of Q/K/V matrices can only capture one kind of relationship. Real transformers run multiple attention heads in parallel — each with its own W_Q, W_K, W_V matrices. GPT-2 has 12 heads, GPT-3 has 96.
Different heads specialize spontaneously during training. Research has found heads that track subject-verb agreement, coreference (pronoun → noun), positional dependencies, and syntactic structure — none of this was programmed in.
06 — Full Picture
Self-attention alone isn't enough. Each transformer "block" (or "layer") wraps attention with two critical additions: a feed-forward network and residual connections + layer normalization.
Each sub-layer's output is added back to its input: output = sublayer(x) + x. This creates a "highway" for gradients to flow directly through deep networks without passing through every transformation. Without residuals, 96-layer models like GPT-3 would be untrainable.
Normalizes activations within each layer to have zero mean and unit variance. Stabilizes training by preventing any single layer from producing explosively large or vanishingly small values.
After attention mixes information across tokens, each token is processed individually through a 2-layer MLP with a non-linearity (usually GELU):
GPT-2: 12 layers. GPT-3: 96 layers. GPT-4: ~120 layers (estimated). Each layer refines the representation — early layers handle syntax and local context, later layers handle semantics, reasoning, and world knowledge.
07 — How It Learns
The training objective is devastatingly simple: given the previous tokens, predict the next one. That's it. No labels, no human annotations needed — just raw text from the internet.
Raw pre-trained models are good predictors but bad assistants — they complete text, they don't answer questions helpfully. Post-training aligns them:
Supervised Fine-Tuning (SFT): Train on human-written (prompt, ideal response) pairs. Teaches the model to act as an assistant.
RLHF (Reinforcement Learning from Human Feedback): Humans rank model outputs. A reward model is trained on these rankings. The policy (LLM) is then optimized with PPO to maximize the reward model's score. This is what makes Claude prefer helpful, honest, harmless responses.
08 — Inference
At inference time, the model produces a vector of raw scores (logits) over the entire vocabulary for the next position. These get converted to probabilities via softmax — then one token is sampled. That token is appended, and the process repeats (this is called autoregressive generation).
How you sample from the probability distribution dramatically changes the output. The key parameter is temperature:
Context: "The capital of France is" — adjust temperature to see how the next-token probability distribution changes.
Instead of sampling from the full vocabulary, restrict to the smallest set of tokens whose cumulative probability exceeds p (e.g., 0.9). This avoids the long tail of absurd tokens while still allowing diversity.
Only sample from the k most likely tokens (e.g., top 40). Simpler than top-p but less adaptive to the distribution's shape.
09 — The Deep Question
Something strange happens when you scale transformers. Capabilities that simply don't exist at smaller scales appear suddenly at larger ones — this is called emergence. Multi-step arithmetic: absent at 1B params, present at 100B. Chain-of-thought reasoning: absent small, present large.
The billions of floating-point numbers in a model's weights encode:
We don't fully understand how this information is stored and retrieved. We can observe attention heads and probe activations, but the mapping from weights to capabilities is opaque — this is the core challenge of interpretability research.
Models can fail on simple tasks while succeeding at far harder ones. They can be confident and wrong. They can be right for the wrong reasons. The next frontier is making these systems understand rather than just predict — and we're still figuring out what that distinction even means.