The Vocabulary Mismatch Problem
The fundamental challenge in information retrieval is vocabulary mismatch: users and documents use different words to describe the same concept. A user searching for "myocardial infarction" might find documents that only use "heart attack." A user searching for "how to delete my account" might miss the relevant page titled "Account Deactivation Instructions." Traditional BM25 search cannot bridge these gaps; it finds documents containing the query terms, not documents about the query concept.
Query expansion is the classic IR solution: broaden the query to include related terms that might appear in relevant documents. Modern LLM-based expansion makes this dramatically more effective.
Classic Query Expansion
Traditional expansion methods:
- Thesaurus expansion: Add WordNet synonyms for each query term. Simple but limited — thesauri don't capture domain-specific synonymy or informal language.
- Pseudo-relevance feedback (PRF): Retrieve the top K results for the original query, extract the most important terms from those results, add them to the query, re-retrieve. Improves recall for good initial results but amplifies errors for bad ones.
- Query log mining: Learn query reformulations from users who reformulate unsuccessful queries. Requires large query logs.
LLM-Based Query Expansion
LLMs can generate query expansions that incorporate the query's semantic meaning, domain context, and likely user intent. A well-prompted LLM given "cancel subscription" can generate: "unsubscribe, cancel membership, stop billing, terminate account, end automatic renewal, deactivate subscription, stop recurring payment." These expansions go beyond word-level synonymy to capture conceptual and task-level alternatives.
The Scolta approach uses a specific prompt format that asks the model to consider what a user with this query is trying to accomplish, what vocabulary the relevant documents might use, and what alternative phrasings might find the relevant content. The resulting expansion terms are concatenated to the original query for BM25 retrieval.
Semantic Reranking
Even after query expansion, BM25 ranking may not place the most semantically relevant documents at the top. Cross-encoder rerankers address this: given the (query, document) pair, a cross-encoder reads both together and scores relevance more accurately than BM25's term-matching approach.
Typical workflow: BM25+expansion retrieves 100 candidates; a cross-encoder (typically a BERT-family model fine-tuned on MS MARCO) reranks the top 20. This "retrieve-then-rerank" pipeline achieves near-vector-search quality with BM25's indexing infrastructure.
When Expansion Hurts
Query expansion can degrade performance for navigational queries (the user wants a specific page) and for very specific technical queries where the exact terms matter. An expansion of "git-log" might add "version control history" and "commit log," which could surface less relevant documentation pages. Controlling expansion aggressiveness — expanding more for natural-language queries, less for technical term queries — is an important part of production query expansion implementation.