Why Most RAG Systems Disappoint
RAG (Retrieval-Augmented Generation) is conceptually simple: retrieve relevant documents, include them in the prompt, generate an answer. In practice, production RAG systems frequently disappoint: they miss relevant information, generate plausible-sounding wrong answers, fail on complex multi-hop questions, and hallucinate citations. Building RAG that actually works requires careful attention to each component of the pipeline.
The Retrieval Problem: The Most Common Failure
Most RAG failures are retrieval failures: the relevant information was in the corpus but wasn't retrieved. Common causes:
- Chunking strategy doesn't preserve context around relevant sentences
- Dense retrieval fails on technical terms and named entities
- Query is ambiguous or under-specified
- Relevant information is spread across multiple documents that aren't individually retrieved
Diagnostics: log what's retrieved for each query. A significant fraction of user complaints should be attributable to retrieval failures, not generation failures. If the answer is in the retrieved context but the LLM missed it, that's a generation problem. If the answer was in the corpus but not retrieved, that's a retrieval problem — and more common.
Hybrid Retrieval in Practice
The practical recommendation: use BM25 + dense retrieval in parallel, combine with reciprocal rank fusion:
bm25_results = bm25.search(query, k=50)
dense_results = vector_db.search(embed(query), k=50)
combined = rrf_merge(bm25_results, dense_results)
top_k = combined[:20]
RRF (Reciprocal Rank Fusion) is surprisingly effective: a simple weighted combination of rankings that outperforms most learned fusion approaches on standard benchmarks. The intuition: a document ranked highly by both systems is very likely to be relevant.
The Generation Problem: Hallucination and Attribution
Even with perfect retrieval, LLMs can hallucinate: generate factual claims not supported by the retrieved context. Mitigation strategies:
- Explicit grounding instructions: "Only use information from the provided documents. If the answer isn't in the documents, say so."
- Citation enforcement: "For each claim, cite the specific document it comes from."
- Faithfulness checking: Use a second LLM call to verify that each claim in the response is supported by the retrieved context.
- Conservative default: Instruct the model to prefer saying "I don't know" over speculating.
Evaluation: The Hardest Part
RAG evaluation requires measuring both retrieval quality (did we retrieve relevant documents?) and generation quality (did we generate a correct, faithful answer?). Standard approaches: RAGAS (RAG Assessment) framework evaluates faithfulness, answer relevance, context recall, and context precision. LLM-as-judge for generation quality evaluation scales better than human evaluation but introduces its own biases. Build a "golden set" of 100-200 (question, correct answer, source document) triples and measure against it systematically.