How Transformers and LLMs Actually Work

From raw text to coherent language — the mechanics behind the magic. Interactive demos included.

Why did we need something new?

Before transformers, the dominant approach was Recurrent Neural Networks (RNNs). They processed text one word at a time, left to right, carrying a "hidden state" — a compressed memory of everything seen so far. This created two fatal problems:

The vanishing gradient problem: When you train on sequences hundreds of tokens long, gradients — the signal used to update weights — become infinitesimally small by the time they travel back to early tokens. The network literally forgets the beginning of long sequences.

The second problem: RNNs are inherently sequential. Each step depends on the previous one. You can't parallelize across a sequence. With modern hardware (GPUs with thousands of cores), this means most of your compute sits idle.

In 2017, Google Brain published "Attention Is All You Need" and threw away the recurrence entirely. The transformer processes all tokens simultaneously and lets every token directly "look at" every other token. That's the fundamental shift.

The key insight: Instead of passing information sequentially through time, let every word in a sentence attend directly to every other word — in parallel — and let the model learn which relationships matter.

Tokenization: breaking text into pieces

LLMs don't see words or characters — they see tokens. A token is typically a chunk of 3–4 characters, but it varies. "transformers" might become ["transform", "ers"]. "unhappiness" becomes ["un", "happiness"]. Punctuation is usually its own token.

The most common algorithm is Byte-Pair Encoding (BPE): start with individual characters, then greedily merge the most frequent pairs until you reach a vocabulary size (GPT-4 uses ~100,000 tokens). This handles any language and rare words gracefully by falling back to subwords.

Interactive — Tokenizer Playground

Note: This is a simplified demonstration. Real BPE tokenizers use learned merge rules over a fixed vocabulary.

Each token is assigned an integer ID. "The" might be token 464. These IDs are the model's actual input — nothing else. The entire model operates on these numbers.


Embeddings: tokens become vectors

An integer ID like 464 carries no meaning on its own. The first layer of a transformer is an embedding lookup table — a learned matrix where each row is a high-dimensional vector (e.g., 768 dimensions for GPT-2, 12,288 for GPT-4).

Token 464 ("The") maps to some vector [0.21, -0.08, 0.94, ...]. This vector is learned during training and encodes semantic information. Famous result: in the embedding space, king - man + woman ≈ queen. Geometry encodes meaning.

# Embedding lookup E = nn.Embedding(vocab_size=50257, d_model=768) token_ids = [464, 47356, 2746, ...] # ["The", "transform", "er", ...] x = E(token_ids) # shape: [seq_len, 768]

Why 768 dimensions? More dimensions = more capacity to encode distinctions, but more compute. It's a hyperparameter tuned during model design. Think of each dimension as a learned "feature axis" — the model decides what those axes mean.

Important: Unlike word2vec which trains embeddings separately, transformer embeddings are trained end-to-end with the rest of the model. They shift and adapt to what makes the whole system work best.

Positional encoding: where are you in the sequence?

Since transformers process all tokens simultaneously (no left-to-right order), they'd have no idea where in the sequence each token sits. "The cat chased the dog" and "The dog chased the cat" would look identical without position information.

The solution: add a position embedding to each token embedding. The original transformer used a clever mathematical formula — sines and cosines at different frequencies:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Different dimensions of the embedding vector oscillate at different frequencies — like a binary counter but continuous. This lets the model infer relative distances between tokens. Modern LLMs use learned positional embeddings or more sophisticated schemes like RoPE (Rotary Position Embedding).

Interactive — Positional Encoding Visualization

Each row = one position in the sequence. Each column = one dimension. Color encodes sine/cosine value from -1 (dark) to +1 (light). Notice how earlier dimensions change slowly (low frequency) while later dimensions oscillate rapidly.


Self-Attention: the mechanism that changed everything

This is where transformers are fundamentally different from everything before. For each token, self-attention asks: which other tokens in this sequence are relevant to understanding me? And it answers that question for every token simultaneously.

Query, Key, Value

Each token embedding is linearly projected into three vectors — Q, K, V — using learned weight matrices:

Q = X @ W_Q # "What am I looking for?" K = X @ W_K # "What do I contain?" V = X @ W_V # "What do I contribute if selected?"

The analogy: imagine a library. A Query is your search request. Each book has a Key (its index card). You compare your Query against all Keys to get a relevance score. Then you retrieve the book's content — its Value. The attention output is a weighted sum of all Values, weighted by how well each Key matched your Query.

Attention(Q, K, V) = softmax( Q @ K^T / sqrt(d_k) ) @ V # sqrt(d_k) = scaling factor to prevent vanishing gradients # softmax converts raw scores to probabilities that sum to 1 # Result: for each token, a weighted blend of all other tokens' values
Why this works: The model learns W_Q, W_K, W_V so that tokens that are semantically related end up with high Q·K dot products. "It" in "The cat sat because it was tired" learns to produce a Query that matches the Key of "cat". This is learned from data — not hardcoded.

Interactive Attention Heatmap

Click any word to see what it attends to. Brighter green = stronger attention weight.

Attention Explorer — "The cat sat because it was tired"
Low
High Row = attending FROM. Col = attending TO.
Click a word label on the left to highlight its attention pattern.

Multi-Head Attention

One set of Q/K/V matrices can only capture one kind of relationship. Real transformers run multiple attention heads in parallel — each with its own W_Q, W_K, W_V matrices. GPT-2 has 12 heads, GPT-3 has 96.

Different heads specialize spontaneously during training. Research has found heads that track subject-verb agreement, coreference (pronoun → noun), positional dependencies, and syntactic structure — none of this was programmed in.

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) @ W_O head_i = Attention(Q @ W_Qi, K @ W_Ki, V @ W_Vi)

The Transformer Block

Self-attention alone isn't enough. Each transformer "block" (or "layer") wraps attention with two critical additions: a feed-forward network and residual connections + layer normalization.

Output Tokens (Logits)
Linear + Softmax
N x Transformer Blocks
Layer Norm
Feed-Forward (MLP)
+ Residual
Layer Norm
Multi-Head Self-Attention
+ Residual
Token + Position Embeddings
Input Tokens (IDs)

Residual Connections

Each sub-layer's output is added back to its input: output = sublayer(x) + x. This creates a "highway" for gradients to flow directly through deep networks without passing through every transformation. Without residuals, 96-layer models like GPT-3 would be untrainable.

Layer Normalization

Normalizes activations within each layer to have zero mean and unit variance. Stabilizes training by preventing any single layer from producing explosively large or vanishingly small values.

Feed-Forward Network

After attention mixes information across tokens, each token is processed individually through a 2-layer MLP with a non-linearity (usually GELU):

FFN(x) = GELU(x @ W_1 + b_1) @ W_2 + b_2 # Inner dimension is 4x larger than d_model # GPT-3: d_model=12288, inner dim=49152 # This is where ~2/3 of parameters live — it stores "factual" knowledge
What does FFN actually do? Research suggests it acts as a key-value memory store — particular input patterns activate specific "memories" (facts, patterns) stored in the weights. Attention figures out what to look up; FFN does the lookup.

Stack N layers deep

GPT-2: 12 layers. GPT-3: 96 layers. GPT-4: ~120 layers (estimated). Each layer refines the representation — early layers handle syntax and local context, later layers handle semantics, reasoning, and world knowledge.


Training: next-token prediction at massive scale

The training objective is devastatingly simple: given the previous tokens, predict the next one. That's it. No labels, no human annotations needed — just raw text from the internet.

Why does next-token prediction lead to reasoning? To accurately predict "The Eiffel Tower is located in", the model must "know" Paris. To predict the next word in a complex argument, it must "understand" the argument. Prediction forces compression of world knowledge.

Fine-tuning and RLHF

Raw pre-trained models are good predictors but bad assistants — they complete text, they don't answer questions helpfully. Post-training aligns them:

Supervised Fine-Tuning (SFT): Train on human-written (prompt, ideal response) pairs. Teaches the model to act as an assistant.

RLHF (Reinforcement Learning from Human Feedback): Humans rank model outputs. A reward model is trained on these rankings. The policy (LLM) is then optimized with PPO to maximize the reward model's score. This is what makes Claude prefer helpful, honest, harmless responses.


Generation: how models choose the next word

At inference time, the model produces a vector of raw scores (logits) over the entire vocabulary for the next position. These get converted to probabilities via softmax — then one token is sampled. That token is appended, and the process repeats (this is called autoregressive generation).

How you sample from the probability distribution dramatically changes the output. The key parameter is temperature:

P(token) = softmax(logits / temperature) # temperature = 1.0 → use model probabilities as-is # temperature < 1.0 → sharpen distribution (more confident/repetitive) # temperature > 1.0 → flatten distribution (more creative/random)
Interactive — Temperature & Sampling

Context: "The capital of France is" — adjust temperature to see how the next-token probability distribution changes.

Temperature: 1.0

Top-p (Nucleus) Sampling

Instead of sampling from the full vocabulary, restrict to the smallest set of tokens whose cumulative probability exceeds p (e.g., 0.9). This avoids the long tail of absurd tokens while still allowing diversity.

Top-k Sampling

Only sample from the k most likely tokens (e.g., top 40). Simpler than top-p but less adaptive to the distribution's shape.

Greedy decoding (always pick the highest-probability token) sounds like the safest choice, but it produces repetitive, boring text. A little randomness is essential for natural-feeling generation.

Scale, emergence, and what we don't fully understand

Something strange happens when you scale transformers. Capabilities that simply don't exist at smaller scales appear suddenly at larger ones — this is called emergence. Multi-step arithmetic: absent at 1B params, present at 100B. Chain-of-thought reasoning: absent small, present large.

Scaling laws (Kaplan et al., 2020; Chinchilla, 2022): Model performance follows a power law with compute, parameters, and data. Double the compute, get predictable improvement. This has held for 8+ orders of magnitude — from tiny models to GPT-4.

What are the weights actually storing?

The billions of floating-point numbers in a model's weights encode:

What's still mysterious

We don't fully understand how this information is stored and retrieved. We can observe attention heads and probe activations, but the mapping from weights to capabilities is opaque — this is the core challenge of interpretability research.

Models can fail on simple tasks while succeeding at far harder ones. They can be confident and wrong. They can be right for the wrong reasons. The next frontier is making these systems understand rather than just predict — and we're still figuring out what that distinction even means.

Bottom line: Transformers are a beautiful marriage of linear algebra (matrix multiplications), probability theory (softmax, sampling), and optimization (gradient descent) — applied to language at a scale that produces genuinely surprising emergent behavior. The "magic" is that we don't fully understand where intelligence ends and statistical pattern-matching begins.