Context Windows: From 4K to 1M Tokens and What It Changes

The Length Obsession

In 2020, GPT-3 had a context window of 2,048 tokens — enough for a few paragraphs. In 2023, Claude 2 announced 100,000 tokens. In 2024, Gemini 1.5 Pro demonstrated 1 million tokens. Context length has become one of the primary competitive dimensions in LLM development, and understanding what these numbers mean — and don't mean — is important for anyone building with these systems.

What a Context Window Is

The context window is the maximum amount of text a model can consider at once. Everything the model "knows" about the current interaction — system prompt, conversation history, retrieved documents, user input — must fit within this window. Text outside the window is invisible to the model.

Context length is measured in tokens, not words or characters. At typical tokenizer rates for English, 1,000 tokens ≈ 750 words ≈ 3-4 pages of text. So 128K tokens ≈ 96,000 words ≈ a long novel.

The Quadratic Attention Problem

Standard self-attention is O(n²) in sequence length: computing attention scores for n tokens requires n² comparisons. This was the primary constraint on context length until recently. A 32K context window requires 16× more attention compute than an 8K window, and a 128K window requires 256× more compute.

Several techniques address this: Flash Attention (computes attention in memory-efficient blocks without materializing the full attention matrix), grouped-query attention (GQA, reduces KV cache memory), sliding window attention (attend only to local context), and streaming LLM (keep only the first few tokens plus recent tokens in KV cache). These optimizations made long contexts economically viable.

Long-Context Evaluation: The "Lost in the Middle" Problem

Having a 1M token context window doesn't mean the model uses all 1M tokens equally. Research (Liu et al., 2023: "Lost in the Middle") showed that LLMs tend to use information at the beginning and end of the context much more effectively than information in the middle. For very long contexts, relevant information buried in the middle may effectively be invisible to the model.

Practical implication: when constructing prompts with long retrieved context, put the most important information first and last. If you need the model to attend to specific middle content, use explicit "look at document 3" instructions.

What Long Context Actually Enables

Practically useful applications of long-context LLMs:

  • Whole-document analysis: Analyze a 200-page report or a complete software repository without chunking
  • Long conversation history: Maintain context across an entire day's or week's conversation without summarization
  • In-context learning: Provide dozens or hundreds of few-shot examples rather than the typical 3-5
  • Full codebase context: Include an entire medium-sized codebase when answering development questions
  • Legal and medical document analysis: Process complete contracts, medical records, or clinical trial reports

Long Context vs. RAG: A Real Tradeoff

The key tension: long contexts are expensive and may not use middle content effectively; RAG retrieves relevant chunks but may miss relevant information not retrieved. The optimal architecture depends on the application:

  • If you can afford to stuff the full document set into context and still stay under budget: do it. Quality is typically higher.
  • If documents are too large or numerous: use RAG with good retrieval.
  • Hybrid: retrieve candidate chunks with RAG, then use a longer context to re-read them together with broader document context.

Cost matters. A single query with a 1M token context to Claude 3.5 Sonnet costs ~$3. A RAG-based equivalent with 10K tokens retrieved costs ~$0.03. For high-volume applications, the 100× cost difference dominates the quality comparison.