RAG Architecture: When to Retrieve, When to Generate

The Knowledge Problem

Every LLM has a knowledge cutoff: it knows what was in its training data, nothing more. For many applications, this is fine — you're asking about general knowledge that doesn't change. But for applications involving current events, proprietary information, personal data, or rapidly changing domains, static training knowledge isn't enough.

Retrieval-Augmented Generation (RAG) solves the knowledge problem by giving the model access to a document store that can be queried at inference time. Rather than training the knowledge into the model's weights, you retrieve it on demand and include it in the prompt context. This is the dominant architecture for enterprise AI applications.

Basic RAG Architecture

The simplest RAG system has four components:

Document store: A collection of documents (PDFs, web pages, database records) chunked into segments of appropriate size (typically 200-500 tokens).
Retriever: A component that, given a user query, returns the most relevant document chunks. Can be BM25, dense retrieval (vector search), or hybrid.
Prompt assembler: Combines retrieved chunks with the user query into an LLM prompt.
Generator: The LLM that produces the final response from the assembled prompt.

Chunking Strategy: The Hidden Complexity

How you chunk your documents profoundly affects RAG quality. The basic tradeoff: small chunks are more precise (easier to match a specific query) but may lack the context the LLM needs to answer. Large chunks provide more context but are harder to retrieve precisely and quickly exceed context window budgets.

Common strategies:

Fixed-size: Chunk every 512 tokens with 50-token overlap. Simple, but ignores document structure.
Semantic: Chunk at paragraph or section boundaries. Preserves context but produces variable-size chunks.
Hierarchical: Index both summaries (for retrieval) and full sections (for context), retrieving at the summary level but feeding the full section to the LLM. Works well for long documents.
Sentence window: Retrieve at the sentence level (precise), but feed neighboring sentences as context. Good for dense reference material.

Retrieval Quality: The Bottleneck

A RAG system is only as good as its retrieval. If the retriever fails to return documents relevant to the user's query, the LLM has nothing to work with. The quality failure modes:

BM25 misses conceptually related documents that use different vocabulary
Dense retrieval (vector search) misses exact matches for technical terms
Both fail when the question is vague or under-specified

Hybrid retrieval — combining BM25 and dense retrieval via reciprocal rank fusion — outperforms either alone on most realistic query distributions. Query expansion (using an LLM to generate related terms or sub-questions) further improves recall. The Certificate in AI-Powered Search at Meridian covers these techniques in depth.

Advanced RAG: When Simple Isn't Enough

Basic RAG fails for complex questions that require reasoning across multiple documents, synthesis of conflicting information, or multi-hop inference (A → B → C). Advanced patterns:

Multi-query RAG: Decompose complex questions into sub-questions, retrieve for each, then synthesize.
Iterative retrieval: The LLM decides when it needs more information and requests additional retrieval rounds.
Graph RAG: Build a knowledge graph over the document set; traverse it for multi-hop queries.
Self-consistency: Run multiple retrieval-generation cycles and check agreement.

When to Use Context Window vs. RAG

With million-token context windows, is RAG even necessary? The answer is nuanced. For small document sets (under ~50 pages), stuffing everything into context is simpler and often performs better. For larger corpora, context costs become prohibitive and retrieval is necessary for cost control. The decision matrix:

Under 100 pages, high-priority queries: Full context. Cost is manageable and precision is maximum.
100-10,000 pages: Hybrid RAG. Retrieve top-K then synthesize.
Over 10,000 pages: RAG is necessary. Vector search with hierarchical indexing.
Need for citations: RAG, always. Context-stuffing makes attribution impossible.