Resources | Meridian AI

Policy Gradients: The Math Behind Modern RL

The Optimization Problem

Reinforcement learning seeks to maximize expected cumulative reward. Formally: find a policy π(a|s) — a probability distribution over actions given states — that maximizes:

J(π) = E[Σ_t γ^t r_t | π]

Where γ is the discount factor and r_t is the reward at time t. Policy gradient methods optimize J directly by computing its gradient with respect to the policy parameters and ascending the gradient — standard optimization, but for this specific objective.

The Policy Gradient Theorem

The key result (Sutton et al., 1999) shows how to compute ∇J(π) without knowing the environment dynamics:

∇J(π_θ) = E[Q^π(s,a) ∇ log π_θ(a|s)]

Where Q^π(s,a) is the action-value function under π. This tells us: the gradient of expected return equals the expected gradient of the log-policy, weighted by the action-value. Intuitively: increase the probability of actions that lead to high value, decrease the probability of actions that lead to low value.

This "likelihood ratio" trick is fundamental: it allows us to differentiate through expectations of functions of samples from the policy, even though we don't know the environment dynamics that generate the rewards.

REINFORCE and Its Problems

REINFORCE (Williams, 1992) estimates Q^π with Monte Carlo returns: complete an episode, compute the total return from each state-action pair, use these as estimates. Simple but very high variance: returns are noisy because many other actions in the episode affect the total return. High variance means slow, unstable learning.

Actor-Critic Methods

Actor-critic methods reduce variance by using a learned value function (the "critic") to estimate Q^π rather than Monte Carlo returns. The critic is trained to minimize TD error; the actor uses the critic's estimates to compute policy gradients. This significantly reduces variance at the cost of introducing bias from the critic's approximation errors.

A2C/A3C (Asynchronous Advantage Actor-Critic) uses A^π(s,a) = Q^π(s,a) - V^π(s) (the "advantage" — how much better is this action than average?) as the policy gradient weight, further reducing variance. PPO (Proximal Policy Optimization) adds a clipping constraint that prevents the policy from changing too much in any single update, dramatically improving stability. PPO is the current default algorithm for most RL applications including RLHF.

SAC: Entropy Maximization

SAC (Soft Actor-Critic) adds an entropy term to the objective: the policy should not only maximize expected return but also maximize entropy (stay as random as possible while still achieving high reward). This encourages exploration, prevents premature convergence to suboptimal deterministic policies, and improves robustness. SAC is the dominant algorithm for continuous control tasks in robotics and simulation.

Positional Encoding: RoPE, ALiBi, and How Models Handle Position

Position in Transformers

Self-attention is inherently position-agnostic: if you shuffle the tokens, the attention patterns change but the architecture itself doesn't "know" positions. For tasks where order matters — which is basically all of language — you need to inject positional information. The choice of positional encoding determines how well the model handles long sequences and generalizes to sequence lengths beyond those seen during training.

Sinusoidal Encoding (Original Transformer)

The original transformer used fixed sinusoidal encodings added to token embeddings. The frequency pattern allows the model to compute relative positions through dot products, and the sinusoidal structure theoretically supports extrapolation to longer sequences. In practice, the original sinusoidal encoding doesn't generalize well to sequences longer than those seen during training.

Learned Absolute Positions

GPT-2 and early BERT models used learned absolute position embeddings: each position 1, 2, ..., N gets a trainable embedding vector that is added to the token embedding. Simple and effective within the training context length, but hard limit: position N+1 has no embedding. Long-context generalization requires either extrapolation (poor results) or further training on longer sequences.

Relative Positional Encodings

Rather than encoding absolute positions, relative encodings represent the distance between pairs of tokens. T5 uses learned relative bias terms added to attention logits; the model learns different attention biases for tokens that are 1 apart, 2 apart, 4 apart, etc. (with bucketing for large distances). This generalizes better to longer sequences because the model only needs to understand relative positions, not absolute ones.

RoPE: Rotary Position Embedding

RoPE (Su et al., 2021) applies a rotation to query and key vectors based on their absolute position. The mathematical insight: the dot product between rotated query and rotated key naturally captures relative position — the rotation angle difference equals the relative position. This makes RoPE a clean, computationally efficient way to inject position information.

RoPE has become the dominant positional encoding for modern LLMs: Llama 2/3, GPT-NeoX, Falcon, Mistral, and most frontier models use it. Key advantages: works within the standard attention mechanism without modifying its structure; scales well to long sequences; supports "YaRN" and "RoPE scaling" for context length extension.

ALiBi: Attention with Linear Biases

ALiBi (Press et al., 2022) adds a fixed, non-learned negative bias to attention logits, proportional to the distance between tokens. No positional vectors are added to embeddings; the position information is entirely in the attention bias. Key property: tokens that are far apart receive stronger negative attention bias, naturally discouraging attention across very long distances. This makes ALiBi particularly suited to long-context applications and enables training on shorter sequences then generalizing to longer ones — a major advantage over absolute encoding schemes. Used in BLOOM and several MPT models.

Context Length Extension with RoPE Scaling

A model trained with RoPE up to 4K tokens can be extended to 128K+ tokens by scaling the RoPE base frequency. Methods include NTK-aware scaling (preserving high-frequency components), YaRN (Yet another RoPE extensioN), and LongRoPE. In practice, some additional fine-tuning on long sequences is required for reliable long-context performance, but the RoPE scaling provides a much better initialization than starting from scratch.

Probability Theory for Deep Learning

Why Probability Is Non-Negotiable

Deep learning is fundamentally probabilistic: models output probability distributions, loss functions come from probability theory, regularization has Bayesian interpretations, and uncertainty quantification requires probabilistic reasoning. You can write code without understanding probability, but you can't debug, evaluate, or improve systems without it.

Basic Probability

A probability distribution assigns probabilities to events such that all probabilities are non-negative and sum to 1. Key distributions for ML:

Gaussian (Normal): Most common. Central limit theorem says sums of random variables converge to Gaussian. Shows up in weight initialization, noise models, VAE priors.
Bernoulli/Categorical: Binary and multi-class classification outputs. Categorical cross-entropy is the natural loss for categorical distributions.
Uniform: Used for random initialization, some regularization.
Dirichlet: Distribution over probability vectors; used in topic models and Bayesian neural networks.

Bayes' Theorem

Bayes' theorem connects prior beliefs to updated beliefs after observing data:

p(θ|D) = p(D|θ) p(θ) / p(D)

Posterior ∝ Likelihood × Prior. Maximum likelihood estimation (what most deep learning does) maximizes p(D|θ) ignoring the prior. Maximum a posteriori (MAP) estimation maximizes p(D|θ)p(θ) — which is ML with regularization. L2 regularization corresponds to a Gaussian prior on weights; L1 regularization corresponds to a Laplace prior.

Expectation, Variance, and Moments

Expected value E[X] = Σ x p(x) is the probability-weighted average. Variance Var[X] = E[(X-E[X])²] = E[X²] - (E[X])² measures spread. Standard deviation is the square root of variance.

In training: batch statistics (used in batch normalization) are estimates of population means and variances. Gradient variance is a key optimization concept — high variance means unreliable gradient estimates and unstable training. Adam's adaptive learning rates reduce the effect of variance in gradient estimates.

Monte Carlo Methods

Many integrals and expectations in ML can't be computed analytically. Monte Carlo methods approximate them by sampling: E[f(X)] ≈ (1/N) Σ f(xᵢ) where xᵢ ~ p(X). The law of large numbers guarantees convergence. Applications: SGD (Monte Carlo estimate of gradient), REINFORCE (Monte Carlo estimate of policy gradient), variational inference (Monte Carlo estimation of the ELBO).

# ============================================================ # SCHOOL OF SOCIETY & GOVERNANCE # ============================================================

Prompt Engineering Is Software Engineering: Patterns That Work

Why "Just Ask It" Is Not a Strategy

When LLMs first became accessible to developers, many assumed the interface was simple: write a question, get an answer. This worked for demos but fails in production. Building reliable LLM-powered applications requires systematic prompt engineering — the art and science of writing instructions that consistently elicit the behavior you want.

Prompt engineering is software engineering. It has idioms, patterns, anti-patterns, and testing methodologies. The gap between an amateur's prompt and a professional's prompt can be the difference between a system that works 60% of the time and one that works 95% of the time.

The System Prompt Is Your API Contract

Every production LLM application has a system prompt — the persistent instruction set that frames every user interaction. Think of it as the constructor for your LLM object. A good system prompt:

Defines the model's role and persona precisely: "You are a customer service agent for Acme Corp. You help customers with billing, shipping, and returns. You do not provide technical support."
Specifies the output format: "Always respond in JSON with keys: {answer: string, confidence: 'high'|'medium'|'low', sources: string[]}"
Establishes constraints: "Do not speculate about matters not in the provided context. If you don't know, say so."
Provides relevant context: background information, terminology, tone guidelines

A weak system prompt produces inconsistent behavior. A strong one creates a reliable interface.

Chain-of-Thought: Show Your Work

Chain-of-thought (CoT) prompting instructs the model to reason step by step before producing a final answer. The two canonical approaches:

Zero-shot CoT: "Think step by step." — simply appending this instruction to a prompt dramatically improves performance on reasoning tasks.
Few-shot CoT: Provide examples of step-by-step reasoning before the actual question. The model learns the desired reasoning style from examples.

CoT works because it forces the model to allocate computation to intermediate reasoning steps rather than jumping directly to an output. The scratchpad improves accuracy on arithmetic, logic, and multi-step inference tasks. The tradeoff: CoT uses more tokens (and therefore costs more).

Structured Output: JSON Mode and Schema Validation

If your application needs to parse the model's output programmatically, you need structured output. The naive approach (asking for JSON and hoping) fails too often. The robust approaches:

JSON mode: Most major APIs (OpenAI, Anthropic, Google) support forcing JSON output. Enable it.
Schema in the prompt: Include an explicit JSON schema in your prompt. "Respond with JSON matching this schema: {name: string, score: number, reasons: string[]}"
Pydantic/structured extraction: Libraries like instructor and outlines constrain model output to match a Pydantic schema, using constrained decoding or retry loops.
Validation + retry: Validate parsed output against your schema; if validation fails, retry with the error message in the prompt.

Retrieval Augmentation: Giving the Model Facts

LLMs have training data cutoffs and no access to your proprietary information. RAG (Retrieval-Augmented Generation) solves this by retrieving relevant documents and including them in the prompt context. The pattern:

system: You are a helpful assistant. Use the provided documents to answer questions.
user: [retrieved documents]
---
Question: {user_question}

Key prompt engineering considerations for RAG: tell the model explicitly to only use the provided context; instruct it to say "I don't know" when the context doesn't answer the question; include citation instructions if attribution is important.

Few-Shot Examples: The Most Underused Technique

Providing 3-5 examples of the input-output mapping you want is often the single most effective prompt improvement. Few-shot examples teach the model your format, style, and edge case handling in a way that instructions alone often can't. Guidelines: use diverse examples that cover different cases; use your actual data (not synthetic examples); include at least one "hard" example where the correct behavior is non-obvious.

Anti-Patterns to Avoid

Vague instructions: "Be professional" is meaningless. "Use formal language, avoid contractions, respond in 2-3 sentences" is actionable.
Overloaded prompts: Asking the model to do 12 things simultaneously reduces performance on each. Break complex tasks into steps.
No output format specification: If you don't specify format, you'll get unpredictable format.
Testing on two examples: A prompt that works on your two test cases may fail on production inputs. Evaluate on at least 50 diverse examples before shipping.
Treating prompts as write-once: Production prompts need version control, testing, and iteration. Treat them like code.

Testing Your Prompts

The professional approach to prompt engineering treats prompts as code that requires testing:

Build an eval suite of 50-200 input examples with expected outputs
Score each output (exact match, LLM-as-judge, or human review depending on the task)
Run the eval suite before and after every prompt change
Track scores over time; treat prompt regressions as bugs

RAG Architecture: When to Retrieve, When to Generate

The Knowledge Problem

Every LLM has a knowledge cutoff: it knows what was in its training data, nothing more. For many applications, this is fine — you're asking about general knowledge that doesn't change. But for applications involving current events, proprietary information, personal data, or rapidly changing domains, static training knowledge isn't enough.

Retrieval-Augmented Generation (RAG) solves the knowledge problem by giving the model access to a document store that can be queried at inference time. Rather than training the knowledge into the model's weights, you retrieve it on demand and include it in the prompt context. This is the dominant architecture for enterprise AI applications.

Basic RAG Architecture

The simplest RAG system has four components:

Document store: A collection of documents (PDFs, web pages, database records) chunked into segments of appropriate size (typically 200-500 tokens).
Retriever: A component that, given a user query, returns the most relevant document chunks. Can be BM25, dense retrieval (vector search), or hybrid.
Prompt assembler: Combines retrieved chunks with the user query into an LLM prompt.
Generator: The LLM that produces the final response from the assembled prompt.

Chunking Strategy: The Hidden Complexity

How you chunk your documents profoundly affects RAG quality. The basic tradeoff: small chunks are more precise (easier to match a specific query) but may lack the context the LLM needs to answer. Large chunks provide more context but are harder to retrieve precisely and quickly exceed context window budgets.

Common strategies:

Fixed-size: Chunk every 512 tokens with 50-token overlap. Simple, but ignores document structure.
Semantic: Chunk at paragraph or section boundaries. Preserves context but produces variable-size chunks.
Hierarchical: Index both summaries (for retrieval) and full sections (for context), retrieving at the summary level but feeding the full section to the LLM. Works well for long documents.
Sentence window: Retrieve at the sentence level (precise), but feed neighboring sentences as context. Good for dense reference material.

Retrieval Quality: The Bottleneck

A RAG system is only as good as its retrieval. If the retriever fails to return documents relevant to the user's query, the LLM has nothing to work with. The quality failure modes:

BM25 misses conceptually related documents that use different vocabulary
Dense retrieval (vector search) misses exact matches for technical terms
Both fail when the question is vague or under-specified

Hybrid retrieval — combining BM25 and dense retrieval via reciprocal rank fusion — outperforms either alone on most realistic query distributions. Query expansion (using an LLM to generate related terms or sub-questions) further improves recall. The Certificate in AI-Powered Search at Meridian covers these techniques in depth.

Advanced RAG: When Simple Isn't Enough

Basic RAG fails for complex questions that require reasoning across multiple documents, synthesis of conflicting information, or multi-hop inference (A → B → C). Advanced patterns:

Multi-query RAG: Decompose complex questions into sub-questions, retrieve for each, then synthesize.
Iterative retrieval: The LLM decides when it needs more information and requests additional retrieval rounds.
Graph RAG: Build a knowledge graph over the document set; traverse it for multi-hop queries.
Self-consistency: Run multiple retrieval-generation cycles and check agreement.

When to Use Context Window vs. RAG

With million-token context windows, is RAG even necessary? The answer is nuanced. For small document sets (under ~50 pages), stuffing everything into context is simpler and often performs better. For larger corpora, context costs become prohibitive and retrieval is necessary for cost control. The decision matrix:

Under 100 pages, high-priority queries: Full context. Cost is manageable and precision is maximum.
100-10,000 pages: Hybrid RAG. Retrieve top-K then synthesize.
Over 10,000 pages: RAG is necessary. Vector search with hierarchical indexing.
Need for citations: RAG, always. Context-stuffing makes attribution impossible.

Red-Teaming AI Systems: Methodology and Practice

What Red-Teaming Is

Red-teaming (borrowing military terminology for adversarial testing) is the practice of systematically trying to break AI systems before deployment, finding harmful behaviors, safety failures, and misuse vectors before real users do. It's a critical part of responsible AI deployment and increasingly required by regulatory frameworks (the EU AI Act mandates adversarial testing for high-risk and systemic-risk AI systems).

Types of Red-Teaming

Manual red-teaming: Human "attackers" attempt to elicit harmful behavior through creative prompting. Effective at finding novel attack vectors that automated methods miss; limited in coverage.
Automated red-teaming: AI models generate attack prompts systematically. Scales coverage but may miss attacks requiring genuine creativity or social engineering.
Structured red-teaming: Testing against a predefined taxonomy of harms (OWASP LLM Top 10, NIST AI RMF threat categories). Ensures coverage of known failure types but may miss novel ones.
Domain expert red-teaming: Subject matter experts (medical professionals, security researchers, lawyers) test AI in their domain. Essential for high-stakes deployments.

Common Attack Vectors for LLMs

Jailbreaking: Prompt constructions that bypass safety training. Many types: role-play framing, "do anything now" (DAN) prompts, gradual escalation, hypothetical framing.
Prompt injection: Malicious instructions in retrieved content (documents, web pages) that override the system prompt. Critical risk for RAG systems and agents with internet access.
Indirect harm: Eliciting outputs that are harmful not by being dangerous themselves but by being misleading, privacy-violating, or used as part of a larger harmful workflow.
Multi-turn attacks: Building context across a conversation to gradually elicit harmful outputs that would be refused in a single turn.

Dr. Okafor's Automated Red-Teaming Framework

Meridian AI's Associate Professor James Okafor developed an automated red-teaming framework that uses one LLM to systematically generate attack prompts against another. The "attacker" model is trained to find prompts that the "target" model refuses, then iteratively refines attacks based on refusal feedback. The framework has been adopted by several major AI labs and is available as open-source software. SAF-220 covers the technical details and hands-on implementation.

# ============================================================ # SCHOOL OF APPLIED INTELLIGENCE # ============================================================

RLHF vs DPO: Aligning Language Models with Human Preferences

The Alignment Tax

A pretrained language model is a powerful but wild thing: it will complete any text, whether the completion is helpful, harmful, or nonsensical. To make it useful, you have to align it — train it to be helpful and to avoid harmful outputs. The two dominant methods for doing this are Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO). Understanding both — when each works, why they differ, and what they share — is essential for anyone building or evaluating production AI systems.

RLHF: The Three-Stage Pipeline

RLHF consists of three stages:

Supervised Fine-Tuning (SFT): Train the model on a curated set of (prompt, high-quality response) pairs to establish a baseline of good behavior.
Reward Model Training: Collect human preference data: for the same prompt, present two model outputs and ask which is better. Train a separate reward model to predict human preferences from (prompt, response) pairs.
RL Training: Use the reward model as a reward signal to fine-tune the SFT model via reinforcement learning (typically PPO). The policy (language model) learns to generate outputs that the reward model scores highly, while a KL divergence penalty prevents it from drifting too far from the SFT model.

RLHF's appeal is that it trains directly on human preferences, not on human demonstrations. This allows the model to discover outputs better than anything in the training data — a model trained with RLHF can theoretically learn to write better than its best training examples if the reward model's judgments are accurate.

RLHF's Problems

Despite its successes (ChatGPT, Claude), RLHF has significant practical challenges:

Reward hacking: The policy learns to game the reward model rather than truly satisfy human preferences. Sycophancy — agreeing with the user regardless of accuracy — is a classic example. The reward model likes confident, agreeable responses; the policy learns to produce them.
Instability: PPO is finicky. Hyperparameter sensitivity is high; KL coefficient must be tuned carefully; training can collapse or diverge.
Cost: Three separate training stages, plus the overhead of running the reward model during PPO training, makes RLHF expensive compared to supervised learning.
Off-policy issues: As the policy drifts during training, the reward model is increasingly asked to score outputs from a distribution it wasn't trained on.

DPO: Solving the Preference Problem Directly

Direct Preference Optimization (Rafailov et al., 2023) achieves comparable alignment quality to RLHF without the RL stage. The key insight: there is a closed-form relationship between the reward model and the optimal policy in RLHF. Given that relationship, you can directly optimize the policy to satisfy the preference data, bypassing the need to explicitly train a reward model or run PPO.

The DPO loss for a preference pair (chosen, rejected) is:

L_DPO = -log sigmoid(β (log π_θ(y_w|x)/π_ref(y_w|x) - log π_θ(y_l|x)/π_ref(y_l|x)))

Where π_θ is the policy being trained, π_ref is the reference model, y_w is the preferred ("won") response, y_l is the dispreferred ("lost") response, and β controls the KL penalty. This is a standard supervised classification loss on preference pairs — no RL required.

DPO's Tradeoffs

DPO is simpler and more stable than RLHF. It trains in one stage (assuming SFT is already done), requires only one model in memory at training time, and is far less sensitive to hyperparameters. It has become the default alignment method for open-source model fine-tuning.

Where DPO falls short: it cannot discover better responses than those in the training data. RLHF's reward model can evaluate novel generations; DPO only learns from the preference pairs you provide. This means DPO is better suited to improving an already-reasonable model than to transforming a poor one into a great one.

Several DPO variants address specific limitations: IPO (Identity Preference Optimization) for out-of-distribution stability; KTO for unpaired preferences; SimPO for simplified reference-free training. The alignment field is moving fast; expect new variants regularly.

Constitutional AI and RLAIF

Anthropic's Constitutional AI (CAI) approach replaces human preference labelers with AI feedback. A "constitution" — a set of principles like "be helpful, harmless, and honest" — is used to generate AI preference judgments at scale, which are then used in RLAIF (Reinforcement Learning from AI Feedback). CAI scales preference data collection dramatically and addresses labeler fatigue and value drift, at the cost of inheriting any biases in the AI judge.

Choosing in Practice

For most practitioners: start with DPO. It's simpler, more stable, and produces good results for most alignment tasks. Use RLHF (or its modern successors) when you need the model to discover outputs better than your training data, or when you have the resources to run the full pipeline and benefit from online preference collection.

SAM and the Democratization of Image Segmentation

The Segmentation Problem

Image segmentation — identifying which pixels belong to which object — has historically required large amounts of labeled training data specific to each application domain. A model trained on medical images can't segment satellite imagery. A model trained on indoor scenes doesn't work on microscopy slides. This data requirement has been a major barrier to deploying segmentation in new applications.

The Segment Anything Model (SAM), from Meta AI Research in 2023, changed this by building a foundation model for segmentation — one that generalizes to new domains and objects without retraining.

SAM's Architecture

SAM consists of three components: an image encoder (heavyweight ViT that runs once per image), a prompt encoder (lightweight, for interactive prompts), and a mask decoder (lightweight transformer that outputs segmentation masks). The image encoder produces a feature representation of the entire image; the prompt encoder handles point, box, mask, or text prompts that indicate what to segment; the mask decoder generates the actual segmentation mask.

The separation of heavy (image encoding) from light (prompt + decode) is key for interactive use: you precompute the expensive image encoding once, then respond to user prompts at real-time speeds.

The SA-1B Dataset

SAM's capabilities come from scale: it was trained on SA-1B, a dataset of 1 billion segmentation masks across 11 million images, assembled through a "data engine" that used SAM itself (in progressively capable versions) to annotate images with human review. No previously available segmentation dataset came close to this scale.

Zero-Shot Generalization

SAM was tested on 23 diverse segmentation benchmarks it had never seen during training, ranging from underwater footage to microscopy to satellite imagery. It outperformed prior state-of-the-art segmentation models on 16 of 23 benchmarks, without any fine-tuning. This zero-shot performance demonstrates genuine generalization rather than task-specific optimization.

SAM 2: Video Segmentation

SAM 2 (2024) extended SAM to video: given a prompt in any frame, SAM 2 propagates the segmentation through the entire video, handling occlusion and reappearance. This is significantly harder than per-frame segmentation because objects change appearance, get occluded, and require tracking memory. SAM 2 sets state-of-the-art results on video object segmentation benchmarks while maintaining the interactive, promptable interface of the original SAM.

Scolta: A Case Study in Practical AI Search

The Problem Scolta Solves

Most websites have mediocre search. Users type queries, get keyword-matched results that miss the semantic meaning, and give up. The obvious solution — vector search and dense retrieval — requires significant infrastructure investment: embedding pipelines, vector databases, reranking services, and ongoing maintenance. For the millions of websites on CMS platforms like Drupal and WordPress, this is often impractical.

Scolta, developed by Tag1 Consulting, takes a different approach: layer AI capabilities onto existing keyword search infrastructure without requiring vector databases. The result is a system that dramatically improves search quality for real-world content websites, built on technology that teams without specialized ML infrastructure can deploy and maintain.

Architecture Overview

Scolta's architecture has three core components built on top of Pagefind (a static site search system):

LLM Query Expansion: Before the query hits the search index, Scolta sends it to the Claude API to generate a richer query: synonyms, related terms, alternative phrasings, and domain-specific vocabulary the user might not have used. A query for "cancel subscription" becomes a richer query that also matches "terminate membership," "stop renewal," and "deactivate account."
Pagefind BM25 Search: The expanded query runs against a static Pagefind index — fast, client-side, no server required. The expansion dramatically improves BM25 recall without requiring semantic embeddings.
AI Overview Generation: After retrieval, Scolta sends the top results to Claude to generate a synthesized AI overview — a paragraph-length answer to the user's question, with citations to specific pages. This is what users see first, above the ranked results list.

What Makes Scolta Distinctive

The key architectural insight is pragmatic: most AI search papers optimize for absolute performance on academic benchmarks. Scolta optimizes for practicality for real-world deployments. This means:

No embedding infrastructure: Pagefind is static site search — it runs in the browser with no server-side indexing infrastructure. Scolta adds AI without adding complexity.
Incremental cost: One LLM API call per search query for expansion + one for the AI overview. Total marginal cost: under $0.05 per query at 2026 API prices.
Works with existing content: No changes to how content is structured or stored. Scolta builds on the existing Drupal (or other CMS) content model.
Transparent AI: The AI overview cites specific pages. Users see where the answer came from and can click through to verify.

Evaluation: Does It Work?

Internal evaluation comparing Scolta search to baseline Pagefind-only search on representative query sets showed: 35-45% improvement in NDCG@10 (a standard IR quality metric) and significantly higher user satisfaction ratings. The improvement is largest for natural-language and conceptual queries — exactly the cases where pure BM25 performs worst.

Teaching With Scolta

The IR-290 capstone at Meridian AI uses Scolta's open architecture as both a learning tool and a deployment option. Students who deploy Scolta implement LLM-based query expansion and AI overview generation from scratch, then compare their results to the production Scolta system. This hands-on approach produces deep understanding of how AI search systems work — from the theory through the implementation through the evaluation.

The full technical approach to AI-powered search, including Scolta's design, is documented in Tag1 Consulting's "Practical Path to AI Search" series at tag1.com/how-to/.

Search Beyond Keywords: Query Expansion and Semantic Understanding

The Vocabulary Mismatch Problem

The fundamental challenge in information retrieval is vocabulary mismatch: users and documents use different words to describe the same concept. A user searching for "myocardial infarction" might find documents that only use "heart attack." A user searching for "how to delete my account" might miss the relevant page titled "Account Deactivation Instructions." Traditional BM25 search cannot bridge these gaps; it finds documents containing the query terms, not documents about the query concept.

Query expansion is the classic IR solution: broaden the query to include related terms that might appear in relevant documents. Modern LLM-based expansion makes this dramatically more effective.

Classic Query Expansion

Traditional expansion methods:

Thesaurus expansion: Add WordNet synonyms for each query term. Simple but limited — thesauri don't capture domain-specific synonymy or informal language.
Pseudo-relevance feedback (PRF): Retrieve the top K results for the original query, extract the most important terms from those results, add them to the query, re-retrieve. Improves recall for good initial results but amplifies errors for bad ones.
Query log mining: Learn query reformulations from users who reformulate unsuccessful queries. Requires large query logs.

LLM-Based Query Expansion

LLMs can generate query expansions that incorporate the query's semantic meaning, domain context, and likely user intent. A well-prompted LLM given "cancel subscription" can generate: "unsubscribe, cancel membership, stop billing, terminate account, end automatic renewal, deactivate subscription, stop recurring payment." These expansions go beyond word-level synonymy to capture conceptual and task-level alternatives.

The Scolta approach uses a specific prompt format that asks the model to consider what a user with this query is trying to accomplish, what vocabulary the relevant documents might use, and what alternative phrasings might find the relevant content. The resulting expansion terms are concatenated to the original query for BM25 retrieval.

Semantic Reranking

Even after query expansion, BM25 ranking may not place the most semantically relevant documents at the top. Cross-encoder rerankers address this: given the (query, document) pair, a cross-encoder reads both together and scores relevance more accurately than BM25's term-matching approach.

Typical workflow: BM25+expansion retrieves 100 candidates; a cross-encoder (typically a BERT-family model fine-tuned on MS MARCO) reranks the top 20. This "retrieve-then-rerank" pipeline achieves near-vector-search quality with BM25's indexing infrastructure.

When Expansion Hurts

Query expansion can degrade performance for navigational queries (the user wants a specific page) and for very specific technical queries where the exact terms matter. An expansion of "git-log" might add "version control history" and "commit log," which could surface less relevant documentation pages. Controlling expansion aggressiveness — expanding more for natural-language queries, less for technical term queries — is an important part of production query expansion implementation.

Sim-to-Real: Training Robots in Virtual Worlds

The Training Data Problem for Robots

Neural network training requires data — lots of it. A robot learning to walk might need to attempt the task millions of times before achieving consistent performance. In the real world, a million attempts would take months, wear out hardware, and likely involve many destructive falls. Simulation solves this: in a physics simulator, a million training steps might take hours, cost nothing in hardware, and enable parallel training across hundreds of virtual instances simultaneously.

The catch is the "reality gap": simulated environments are approximations. Physics simulators don't perfectly model friction, contact dynamics, material deformation, or sensor noise. A policy trained in simulation may fail when deployed on real hardware because the simulator was wrong about something important.

Domain Randomization

The dominant technique for crossing the reality gap is domain randomization: randomize the simulation parameters during training so the robot learns a policy that works across a wide distribution of environments, rather than the specific parameters of any single simulator. Randomize: physics parameters (friction, mass, inertia), visual parameters (lighting, texture, camera noise), and dynamic parameters (actuator delays, gravity magnitude, object placement).

The intuition: if the simulation distribution includes the real world (or is close to it), the policy trained on the distribution will work in reality. More specifically: the real environment is just one more point in the distribution the policy has been trained to handle.

Dr. Chakraborty's lab at Meridian AI has developed adaptive domain randomization — automatically adjusting which parameters to randomize and by how much based on the difficulty of the current training step, prioritizing parameters that actually affect task performance.

Isaac Lab and GPU-Accelerated Simulation

NVIDIA's Isaac Lab enables GPU-accelerated physics simulation, running thousands of robot environments in parallel on a single A100. This changes the economics: a sim-to-real training run that previously took weeks now takes hours. The quality of the physics simulation has also improved dramatically: contact dynamics, deformable objects, and fluid simulation are now fast enough to use in training loops.

Privileged Information and Asymmetric Actor-Critic

A clever sim-to-real technique: during training, give the policy access to information that's available in simulation (exact contact forces, precise state of hidden objects) but not in the real world. This "privileged information" makes the training problem easier and produces better policy networks. At deployment, train a "student" policy that maps only real-sensor observations to the actions the privileged "teacher" would take. The student learns to infer what the teacher knew directly.

Results at Meridian AI

The Threshold Robotics Lab's work on indoor navigation (Project: Indoor Navigation with World Models) has achieved 87% success on novel layouts in sim-to-real transfer using this approach. For manipulation tasks, the lab's diffusion policy implementation achieves sub-centimeter grasping accuracy on objects seen only in simulation during training. These results indicate that the sim-to-real gap has narrowed to the point where many practical tasks can be solved entirely in simulation.

# ============================================================ # SCHOOL OF FOUNDATIONS & MATHEMATICS # ============================================================

Text-to-Image in 2026: FLUX, Midjourney, DALL-E, and Open Alternatives

The State of Image Generation

In 2022, a text-to-image system that could generate photorealistic images from arbitrary descriptions was science fiction for most users. By 2024, it was a commodity. By 2026, the technology is embedded in creative workflows across advertising, entertainment, design, and media production. This article provides a practical overview of the major systems and how to choose between them.

FLUX (Black Forest Labs)

FLUX.1 (2024) represents the current state of the art in open-weight text-to-image generation. Its architecture uses flow matching with a transformer backbone (MMDiT — Multimodal Diffusion Transformer) that jointly processes text and image tokens. Key strengths: excellent text rendering in images (historically a weakness of diffusion models), superior prompt adherence, highly realistic human faces and anatomy, and strong compositional understanding.

FLUX.1 Pro (via API) and FLUX.1 Dev (open weights, non-commercial) are the primary variants. FLUX.1 Schnell is a distilled version for faster generation at modest quality cost. The open weights have spawned a large ecosystem of fine-tunes and community models.

Midjourney

Midjourney (V6, V7) prioritizes artistic quality and aesthetic appeal. Its proprietary training and model architecture produce images with a distinctive "Midjourney look" — often described as painterly, dramatic, and visually striking — that differs from the photorealistic default of FLUX and DALL-E. Midjourney is the dominant choice for artistic and creative work where aesthetic quality matters more than photographic accuracy.

Midjourney is available only via API (Discord-based or web interface); no open weights. Strong community around prompting guides and style references.

DALL-E 3 (OpenAI)

DALL-E 3 is notable for its tight integration with ChatGPT, which rewrites user prompts to be more effective before sending them to the image model. This "prompt improvement" step significantly helps non-expert users but can frustrate power users who want precise control. DALL-E 3's quality is excellent, with particular strength in following complex compositional instructions. Available via OpenAI API and ChatGPT Plus.

Stable Diffusion Ecosystem

The Stable Diffusion ecosystem (Stability AI plus community) remains the choice when: you need full control and privacy (self-hosted), want to fine-tune on specific styles or subjects, or need to integrate with custom workflows. The ComfyUI and Automatic1111 UIs provide extensive control over sampling parameters, LoRA composition, ControlNet conditioning, and pipeline customization. Stable Diffusion XL (SDXL) and SD3 are the current flagship models; community fine-tunes cover essentially every style and subject imaginable.

Practical Guidance

Best quality, photorealistic: FLUX.1 Pro or DALL-E 3
Artistic / aesthetic: Midjourney V7
Self-hosted / full control: FLUX.1 Dev or SDXL
Character consistency: FLUX fine-tunes or Midjourney character reference
Text in images: FLUX (best in class)
Budget / batch generation: SDXL or Schnell