The Paper That Changed Everything
The 2017 paper "Attention Is All You Need" by Vaswani et al. introduced the transformer architecture, which has become the foundation of virtually every state-of-the-art model in NLP, computer vision, audio processing, and beyond. Understanding the original transformer — not just using it — is essential for anyone working in modern AI.
This article walks through the transformer's key innovations, explains why they work, and connects the mathematical formulation to intuitive understanding. We'll use the original Seq2Seq machine translation task as context.
The Problem with Recurrent Networks
Before transformers, sequence modeling used recurrent neural networks (RNNs), particularly LSTMs and GRUs. These process sequences token by token, maintaining a hidden state that theoretically captures all past context. The problem: in practice, information from the beginning of a long sequence often degrades by the time it reaches the end — the "vanishing gradient" problem. RNNs also can't be parallelized during training because each step depends on the previous one.
Attention mechanisms (first proposed for machine translation by Bahdanau et al. in 2015) allowed models to "look back" at previous tokens directly. The transformer discarded the recurrence entirely, relying wholly on attention. Hence: "attention is all you need."
Scaled Dot-Product Attention
The core operation of a transformer is scaled dot-product attention:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
Unpacking this:
- Q (queries): What is this token looking for?
- K (keys): What does each token offer?
- V (values): What information does each token carry?
The dot product Q·K^T measures compatibility: how relevant is each key to each query? We scale by sqrt(d_k) to prevent dot products from growing too large in high dimensions (which would push softmax into saturation). The softmax produces a probability distribution over values. The output is a weighted sum: for each query position, collect information from all value positions, weighted by how compatible their keys were.
This is attention as information retrieval: you have a query, you look up which keys match, and you retrieve the associated values.
Multi-Head Attention
A single attention operation captures one "type" of relationship between tokens. Multi-head attention runs h independent attention operations in parallel (the "heads"), each with different learned projection matrices W_Q, W_K, W_V. The outputs are concatenated and projected:
MultiHead(Q,K,V) = Concat(head_1, ..., head_h) W_O
where head_i = Attention(Q W_Qi, K W_Ki, V W_Vi)
Different heads learn to attend to different relationships: syntax, coreference, positional proximity, semantic similarity. A transformer with 12 heads can simultaneously track 12 different types of token relationships.
Positional Encoding
Attention is permutation-invariant: if you shuffle the tokens, the attention scores change but nothing in the mechanism itself knows positions. To give the model positional information, the original transformer adds sinusoidal position encodings to the input embeddings:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Modern LLMs use learned positional encodings (GPT-2), relative position encodings (T5, Transformer-XL), or rotary position embeddings (RoPE, used in Llama, GPT-NeoX, and most modern LLMs). RoPE has the advantage of generalizing to longer contexts than seen during training.
Feed-Forward Networks and Layer Normalization
Each transformer layer has two sub-layers: multi-head attention and a position-wise feed-forward network (FFN):
FFN(x) = max(0, x W_1 + b_1) W_2 + b_2
The FFN expands the representation to a higher dimension (typically 4× the model dimension), applies a nonlinearity, then projects back down. Modern models use SwiGLU or GELU instead of ReLU. Residual connections and layer normalization (pre-norm in modern models) stabilize training.
The Encoder-Decoder Architecture
The original transformer had both encoder and decoder. The encoder processes the input sequence with self-attention; the decoder generates the output sequence with masked self-attention (can't look at future tokens) and cross-attention (attends to encoder output). This architecture works well for translation, summarization, and other Seq2Seq tasks.
Modern LLMs are predominantly decoder-only (GPT family), which uses only the masked decoder stack. BERT uses encoder-only. T5 uses the original encoder-decoder. Each has different inductive biases suited to different tasks.
Why It Works
Transformers succeed for three reasons. First, they're highly parallelizable: all attention operations can run simultaneously across the sequence, enabling efficient GPU utilization during training. Second, they capture long-range dependencies directly: a token can attend to any other token with equal ease regardless of distance. Third, they scale: transformer performance continues to improve with more data and parameters in a remarkably predictable way, as documented by the scaling laws literature.
The attention mechanism specifically excels at tasks requiring flexible information routing — deciding on the fly which previous context is relevant to the current computation. This is the key capability that makes transformers so general.