Understand the history of search from TF-IDF to modern AI-enhanced systems. Implement BM25 from scratch. Explain how neural reranking improves search quality. Analyze Scolta's architecture as a worked example of practical AI search.
1. The Vocabulary Problem
Information retrieval's core challenge: users and documents use different words for the same concept. A user searching "heart attack symptoms" might miss a document titled "Signs of Myocardial Infarction." This vocabulary mismatch limits every keyword-based search system.
The evolution of search is largely the evolution of approaches to this problem: from user-driven synonym lists, through statistical analysis of co-occurrence, to neural models that understand semantic meaning.
2. TF-IDF: The Classic Foundation
import math
from collections import Counter
def tf_idf(query_terms, documents):
"""
TF-IDF scoring for a query against a document collection.
documents: list of tokenized documents
Returns: {doc_id: score} sorted by relevance
"""
N = len(documents)
# IDF: how rare is each term across the corpus?
df = Counter() # document frequency
for doc in documents:
for term in set(doc):
df[term] += 1
idf = {term: math.log(N / (freq + 1)) for term, freq in df.items()}
scores = {}
for i, doc in enumerate(documents):
doc_len = len(doc)
tf = Counter(doc)
score = 0
for term in query_terms:
if term in tf:
# TF: how often does the term appear in this document?
term_freq = tf[term] / doc_len
score += term_freq * idf.get(term, 0)
scores[i] = score
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
TF-IDF is remarkably effective for its simplicity. Its limitations: no understanding of synonymy ("car" ≠ "automobile"), no handling of query term order, and sensitivity to document length normalization.
3. BM25: The Production Standard
BM25 (Best Match 25) addresses TF-IDF's length sensitivity and term saturation problems:
def bm25_score(query_terms, doc, documents, k1=1.5, b=0.75):
"""
BM25 scoring for one document.
k1: controls term frequency saturation (typical: 1.2-2.0)
b: controls length normalization (0=none, 1=full)
"""
N = len(documents)
avgdl = sum(len(d) for d in documents) / N
doc_len = len(doc)
tf = Counter(doc)
# Compute IDF (BPM25 variant)
df = Counter()
for d in documents:
for term in set(d):
df[term] += 1
score = 0
for term in query_terms:
if term not in tf:
continue
# TF with saturation and length normalization
tf_normalized = (tf[term] * (k1 + 1)) / (
tf[term] + k1 * (1 - b + b * doc_len / avgdl))
# IDF component
idf = math.log((N - df[term] + 0.5) / (df[term] + 0.5) + 1)
score += idf * tf_normalized
return score
BM25's improvements: term frequency saturation (adding more occurrences of a term has diminishing returns) and explicit length normalization. These make it more robust than raw TF-IDF on realistic document collections.
4. Neural Reranking with Cross-Encoders
BM25 retrieves fast; cross-encoders rerank with precision. The difference: BM25 looks at query and document terms separately (O(n) retrieval); a cross-encoder reads query and document together (more expensive but more accurate).
from sentence_transformers import CrossEncoder
# Load a cross-encoder trained for reranking
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank(query, candidates):
"""
candidates: list of (doc_id, text) tuples from BM25 retrieval
Returns: reranked list
"""
pairs = [(query, text) for _, text in candidates]
scores = reranker.predict(pairs)
# Sort by cross-encoder score
reranked = sorted(zip(scores, candidates), reverse=True)
return [(doc_id, text, score) for score, (doc_id, text) in reranked]
5. Scolta: A Worked Example
Scolta integrates these ideas with LLM-based query expansion and AI overview generation. The full pipeline:
def scolta_search(user_query: str, corpus: list[dict]) -> dict:
"""Simplified Scolta pipeline."""
# 1. LLM Query Expansion
expanded_query = llm_expand(
user_query,
prompt="Generate 5-10 alternative phrasings, synonyms, and related terms "
f"for this search query: '{user_query}'. Return as comma-separated list."
)
search_query = user_query + " " + expanded_query
# 2. BM25 retrieval on expanded query
candidates = bm25_search(search_query, corpus, k=20)
# 3. AI overview from top results
top_texts = [c['text'][:500] for c in candidates[:5]]
overview = llm_summarize(
user_query,
top_texts,
prompt=f"Based on these search results, answer: '{user_query}'"
)
return {
"overview": overview,
"results": candidates,
"expanded_query": expanded_query
}
Build a minimal search system for a small corpus (100 Wikipedia article paragraphs). Implement BM25 from scratch. Measure nDCG@10 on 20 queries with known relevant documents. Then add a simple LLM query expansion step (using the Anthropic API). Measure how much nDCG improves. What types of queries improve most? Why?