Building RAG Systems That Actually Work

Why Most RAG Systems Disappoint

RAG (Retrieval-Augmented Generation) is conceptually simple: retrieve relevant documents, include them in the prompt, generate an answer. In practice, production RAG systems frequently disappoint: they miss relevant information, generate plausible-sounding wrong answers, fail on complex multi-hop questions, and hallucinate citations. Building RAG that actually works requires careful attention to each component of the pipeline.

The Retrieval Problem: The Most Common Failure

Most RAG failures are retrieval failures: the relevant information was in the corpus but wasn't retrieved. Common causes:

Chunking strategy doesn't preserve context around relevant sentences
Dense retrieval fails on technical terms and named entities
Query is ambiguous or under-specified
Relevant information is spread across multiple documents that aren't individually retrieved

Diagnostics: log what's retrieved for each query. A significant fraction of user complaints should be attributable to retrieval failures, not generation failures. If the answer is in the retrieved context but the LLM missed it, that's a generation problem. If the answer was in the corpus but not retrieved, that's a retrieval problem — and more common.

Hybrid Retrieval in Practice

The practical recommendation: use BM25 + dense retrieval in parallel, combine with reciprocal rank fusion:

bm25_results = bm25.search(query, k=50)
dense_results = vector_db.search(embed(query), k=50)
combined = rrf_merge(bm25_results, dense_results)
top_k = combined[:20]

RRF (Reciprocal Rank Fusion) is surprisingly effective: a simple weighted combination of rankings that outperforms most learned fusion approaches on standard benchmarks. The intuition: a document ranked highly by both systems is very likely to be relevant.

The Generation Problem: Hallucination and Attribution

Even with perfect retrieval, LLMs can hallucinate: generate factual claims not supported by the retrieved context. Mitigation strategies:

Explicit grounding instructions: "Only use information from the provided documents. If the answer isn't in the documents, say so."
Citation enforcement: "For each claim, cite the specific document it comes from."
Faithfulness checking: Use a second LLM call to verify that each claim in the response is supported by the retrieved context.
Conservative default: Instruct the model to prefer saying "I don't know" over speculating.

Evaluation: The Hardest Part

RAG evaluation requires measuring both retrieval quality (did we retrieve relevant documents?) and generation quality (did we generate a correct, faithful answer?). Standard approaches: RAGAS (RAG Assessment) framework evaluates faithfulness, answer relevance, context recall, and context precision. LLM-as-judge for generation quality evaluation scales better than human evaluation but introduces its own biases. Build a "golden set" of 100-200 (question, correct answer, source document) triples and measure against it systematically.