Positional Encoding: RoPE, ALiBi, and How Models Handle Position

Position in Transformers

Self-attention is inherently position-agnostic: if you shuffle the tokens, the attention patterns change but the architecture itself doesn't "know" positions. For tasks where order matters — which is basically all of language — you need to inject positional information. The choice of positional encoding determines how well the model handles long sequences and generalizes to sequence lengths beyond those seen during training.

Sinusoidal Encoding (Original Transformer)

The original transformer used fixed sinusoidal encodings added to token embeddings. The frequency pattern allows the model to compute relative positions through dot products, and the sinusoidal structure theoretically supports extrapolation to longer sequences. In practice, the original sinusoidal encoding doesn't generalize well to sequences longer than those seen during training.

Learned Absolute Positions

GPT-2 and early BERT models used learned absolute position embeddings: each position 1, 2, ..., N gets a trainable embedding vector that is added to the token embedding. Simple and effective within the training context length, but hard limit: position N+1 has no embedding. Long-context generalization requires either extrapolation (poor results) or further training on longer sequences.

Relative Positional Encodings

Rather than encoding absolute positions, relative encodings represent the distance between pairs of tokens. T5 uses learned relative bias terms added to attention logits; the model learns different attention biases for tokens that are 1 apart, 2 apart, 4 apart, etc. (with bucketing for large distances). This generalizes better to longer sequences because the model only needs to understand relative positions, not absolute ones.

RoPE: Rotary Position Embedding

RoPE (Su et al., 2021) applies a rotation to query and key vectors based on their absolute position. The mathematical insight: the dot product between rotated query and rotated key naturally captures relative position — the rotation angle difference equals the relative position. This makes RoPE a clean, computationally efficient way to inject position information.

RoPE has become the dominant positional encoding for modern LLMs: Llama 2/3, GPT-NeoX, Falcon, Mistral, and most frontier models use it. Key advantages: works within the standard attention mechanism without modifying its structure; scales well to long sequences; supports "YaRN" and "RoPE scaling" for context length extension.

ALiBi: Attention with Linear Biases

ALiBi (Press et al., 2022) adds a fixed, non-learned negative bias to attention logits, proportional to the distance between tokens. No positional vectors are added to embeddings; the position information is entirely in the attention bias. Key property: tokens that are far apart receive stronger negative attention bias, naturally discouraging attention across very long distances. This makes ALiBi particularly suited to long-context applications and enables training on shorter sequences then generalizing to longer ones — a major advantage over absolute encoding schemes. Used in BLOOM and several MPT models.

Context Length Extension with RoPE Scaling

A model trained with RoPE up to 4K tokens can be extended to 128K+ tokens by scaling the RoPE base frequency. Methods include NTK-aware scaling (preserving high-frequency components), YaRN (Yet another RoPE extensioN), and LongRoPE. In practice, some additional fine-tuning on long sequences is required for reliable long-context performance, but the RoPE scaling provides a much better initialization than starting from scratch.