Lecture: How Search Engines Use AI

How AI-powered search engines work — global data network

Learning Objectives

Understand the history of search from TF-IDF to modern AI-enhanced systems. Implement BM25 from scratch. Explain how neural reranking improves search quality. Analyze Scolta's architecture as a worked example of practical AI search.

1. The Vocabulary Problem

Information retrieval's core challenge: users and documents use different words for the same concept. A user searching "heart attack symptoms" might miss a document titled "Signs of Myocardial Infarction." This vocabulary mismatch limits every keyword-based search system.

The evolution of search is largely the evolution of approaches to this problem: from user-driven synonym lists, through statistical analysis of co-occurrence, to neural models that understand semantic meaning.

2. TF-IDF: The Classic Foundation

import math
from collections import Counter

def tf_idf(query_terms, documents):
"""
TF-IDF scoring for a query against a document collection.

documents: list of tokenized documents
Returns: {doc_id: score} sorted by relevance
"""
N = len(documents)

# IDF: how rare is each term across the corpus?
df = Counter()   # document frequency
for doc in documents:
    for term in set(doc):
        df[term] += 1

idf = {term: math.log(N / (freq + 1)) for term, freq in df.items()}

scores = {}
for i, doc in enumerate(documents):
    doc_len = len(doc)
    tf = Counter(doc)

    score = 0
    for term in query_terms:
        if term in tf:
            # TF: how often does the term appear in this document?
            term_freq = tf[term] / doc_len
            score += term_freq * idf.get(term, 0)

    scores[i] = score

return sorted(scores.items(), key=lambda x: x[1], reverse=True)

TF-IDF is remarkably effective for its simplicity. Its limitations: no understanding of synonymy ("car" ≠ "automobile"), no handling of query term order, and sensitivity to document length normalization.

3. BM25: The Production Standard

BM25 (Best Match 25) addresses TF-IDF's length sensitivity and term saturation problems:

def bm25_score(query_terms, doc, documents, k1=1.5, b=0.75):
"""
BM25 scoring for one document.
k1: controls term frequency saturation (typical: 1.2-2.0)
b: controls length normalization (0=none, 1=full)
"""
N = len(documents)
avgdl = sum(len(d) for d in documents) / N
doc_len = len(doc)
tf = Counter(doc)

# Compute IDF (BPM25 variant)
df = Counter()
for d in documents:
    for term in set(d):
        df[term] += 1

score = 0
for term in query_terms:
    if term not in tf:
        continue

    # TF with saturation and length normalization
    tf_normalized = (tf[term] * (k1 + 1)) / (
        tf[term] + k1 * (1 - b + b * doc_len / avgdl))

    # IDF component
    idf = math.log((N - df[term] + 0.5) / (df[term] + 0.5) + 1)

    score += idf * tf_normalized

return score

BM25's improvements: term frequency saturation (adding more occurrences of a term has diminishing returns) and explicit length normalization. These make it more robust than raw TF-IDF on realistic document collections.

4. Neural Reranking with Cross-Encoders

BM25 retrieves fast; cross-encoders rerank with precision. The difference: BM25 looks at query and document terms separately (O(n) retrieval); a cross-encoder reads query and document together (more expensive but more accurate).

from sentence_transformers import CrossEncoder

# Load a cross-encoder trained for reranking
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank(query, candidates):
"""
candidates: list of (doc_id, text) tuples from BM25 retrieval
Returns: reranked list
"""
pairs = [(query, text) for _, text in candidates]
scores = reranker.predict(pairs)

# Sort by cross-encoder score
reranked = sorted(zip(scores, candidates), reverse=True)
return [(doc_id, text, score) for score, (doc_id, text) in reranked]

5. Scolta: A Worked Example

Scolta integrates these ideas with LLM-based query expansion and AI overview generation. The full pipeline:

def scolta_search(user_query: str, corpus: list[dict]) -> dict:
"""Simplified Scolta pipeline."""

# 1. LLM Query Expansion
expanded_query = llm_expand(
    user_query,
    prompt="Generate 5-10 alternative phrasings, synonyms, and related terms "
           f"for this search query: '{user_query}'. Return as comma-separated list."
)
search_query = user_query + " " + expanded_query

# 2. BM25 retrieval on expanded query
candidates = bm25_search(search_query, corpus, k=20)

# 3. AI overview from top results
top_texts = [c['text'][:500] for c in candidates[:5]]
overview = llm_summarize(
    user_query,
    top_texts,
    prompt=f"Based on these search results, answer: '{user_query}'"
)

return {
    "overview": overview,
    "results": candidates,
    "expanded_query": expanded_query
}

Exercise

Build a minimal search system for a small corpus (100 Wikipedia article paragraphs). Implement BM25 from scratch. Measure nDCG@10 on 20 queries with known relevant documents. Then add a simple LLM query expansion step (using the Anthropic API). Measure how much nDCG improves. What types of queries improve most? Why?