Resources

The Counterintuitive Process

Diffusion models generate images by learning to reverse a process of gradual noise addition. The insight seems backwards: to learn to create clean images, you first learn to destroy them. But this counterintuitive approach has produced the most capable image generation systems in history, from Stable Diffusion to DALL-E 3 to FLUX.

The Forward Process: Adding Noise

Training starts with real images. A diffusion model's forward process takes a clean image x₀ and adds Gaussian noise in T steps, producing increasingly noisy images x₁, x₂, ..., xₜ, until at step T the image is pure noise. The process is mathematically defined as a Markov chain:

q(xₜ|xₜ₋₁) = N(xₜ; √(1-βₜ)xₜ₋₁, βₜI)

Where βₜ is a noise schedule that controls how much noise is added at each step. After T steps (typically T=1000), the original image is completely unrecognizable. The key property of this process: for any t, we can directly sample xₜ from x₀ without running all t steps:

q(xₜ|x₀) = N(xₜ; √ᾱₜ x₀, (1-ᾱₜ)I)

The Reverse Process: Denoising

The reverse process learns to undo the noise addition step by step, starting from pure noise xₜ and progressively cleaning it toward a coherent image x₀. A neural network (typically a UNet for DDPM, or a transformer-based architecture for newer models) is trained to predict the noise that was added at each step, given the noisy image and the timestep.

During inference, we start with random Gaussian noise and repeatedly apply the learned denoising step, iteratively refining the image. After T denoising steps, we have a sample from the distribution of real images — but a novel image, not a copy of anything in the training data.

DDIM: Faster Sampling

DDPM's 1000-step denoising is slow. DDIM (Denoising Diffusion Implicit Models) reformulates the sampling process as a non-Markovian chain, enabling sampling in 10-50 steps with minimal quality loss. Most practical diffusion systems use DDIM or its variants rather than the original DDPM sampling schedule.

Latent Diffusion: The Key to Scalability

Applying diffusion directly in pixel space is expensive: a 512×512 image has 786,432 pixels. Latent Diffusion Models (LDM), the basis of Stable Diffusion, first encode the image into a lower-dimensional latent representation using a trained variational autoencoder (VAE), then apply diffusion in this latent space. The VAE encodes images to latents ~64× smaller than the original; diffusion in this space is dramatically faster.

Text Conditioning

Text-to-image generation requires conditioning the denoising process on a text description. Most systems use CLIP or a T5-family text encoder to convert the text prompt into a sequence of embeddings, which are then injected into the denoising UNet via cross-attention layers. The UNet attends to text features at each denoising step, guiding the image toward the described content.

Classifier-free guidance (CFG) scales the influence of the text condition: with CFG scale 7, the image strongly adheres to the prompt; with CFG scale 1, it follows the prompt loosely. Higher CFG scale produces more "prompt-accurate" but sometimes less photorealistic images.

FLUX and Flow Matching

FLUX (Black Forest Labs, 2024) uses flow matching rather than diffusion. Instead of learning to denoise Gaussian noise, the model learns to transport samples along straight-line trajectories between the noise distribution and the data distribution. Flow matching is mathematically cleaner and enables better sample quality with fewer function evaluations. FLUX's architecture also uses a "multimodal diffusion transformer" (MMDiT) that processes text and image tokens jointly, producing superior text rendering in generated images compared to earlier architectures.

Why Edge Matters

Most AI discussion assumes cloud inference: send data to a server, get results back. This works for many applications but fails for others: when latency is critical (real-time robot control), when connectivity is unreliable (drones in remote areas), when privacy is required (medical devices that can't send patient data to the cloud), or when cost makes continuous cloud API calls impractical (10 million IoT sensors).

Edge AI runs models on the device — phone, drone, microcontroller, embedded system. Getting useful AI to run efficiently on constrained hardware is a significant engineering challenge with specialized tools and techniques.

The Hardware Landscape

  • Mobile SoCs: Apple A18, Snapdragon 8 Gen 4, Samsung Exynos — all include dedicated Neural Processing Units (NPUs) that provide 10-40 TOPS (tera-operations per second) of neural network compute. Modern smartphones can run 7B LLMs (quantized) at 5-15 tokens/second.
  • Edge GPUs: NVIDIA Jetson Orin (2-275 TOPS depending on module); designed for robotics, drones, and smart cameras. Full PyTorch and TensorRT support.
  • Microcontrollers (MCU): STM32, Arduino Nano 33, Raspberry Pi Pico — milliwatts of power, kilobytes of RAM. TinyML models run classification and anomaly detection; not LLMs.
  • FPGAs: Xilinx/AMD and Intel FPGAs enable custom neural network inference at extremely low latency and power. Used in specialized industrial and medical applications.

Quantization for Edge

Standard ML models use 32-bit or 16-bit floating point weights. Edge hardware often works best with 8-bit integer (INT8) or 4-bit quantization. INT8 quantization reduces memory 4× (vs float32) and enables specialized integer MAC units that are 4-8× more efficient than float units. For LLMs on mobile, GGUF format (llama.cpp) with 4-bit or 5-bit quantization is standard.

Model Architecture for Edge

Edge deployment benefits from architectures designed for efficiency: MobileNet, EfficientNet, MobileViT for vision; MobileLLM, Phi-3-mini for language. Knowledge distillation — training a smaller "student" model to mimic a larger "teacher" — produces compact models that punch above their parameter count. Neural architecture search (NAS) finds Pareto-optimal architectures for specific hardware targets.

Practical Deployment

Key tools: ONNX (model export format), TensorFlow Lite (mobile and embedded), Core ML (Apple devices), ONNX Runtime (cross-platform), llama.cpp (LLM inference on CPUs and Apple Silicon). For Android, Google's AI Edge (formerly LiteRT) provides hardware-accelerated inference on Qualcomm, MediaTek, and Google Tensor NPUs. Dr. Lin's research group at Meridian AI focuses on NAS methods for these edge hardware targets.

The Mountain in the Dark

Optimization in machine learning is like trying to find the lowest point in a mountain range while blindfolded. You can feel the slope under your feet (the gradient), but you can't see the landscape. Gradient descent is the simplest strategy: always step in the direction that slopes downward. This simple idea, combined with careful tuning and clever variants, is how virtually all neural networks are trained.

Vanilla Gradient Descent

The basic update rule:

θ ← θ - η ∇_θ L(θ)

Where η is the learning rate and ∇_θ L is the gradient of the loss with respect to parameters. Take small steps in the direction of steepest descent. The learning rate η is the most critical hyperparameter: too large and training diverges; too small and training is impossibly slow.

For a quadratic loss (the simplest case), gradient descent converges geometrically: the error multiplies by (1 - ηλ) at each step, where λ is the smallest eigenvalue of the Hessian. Optimal convergence rate requires η = 2/(λ_min + λ_max), which requires knowing the Hessian's spectrum — not practical for neural networks.

Stochastic Gradient Descent

Computing the exact gradient requires evaluating the loss over the full dataset — expensive for millions of examples. SGD (Stochastic Gradient Descent) approximates the gradient using a random mini-batch:

∇̃L ≈ (1/B) Σ_{i∈batch} ∇L(xᵢ, yᵢ; θ)

This is noisy but much faster per step than full-batch gradient descent. Surprisingly, SGD's noise is often beneficial: it acts as regularization, helps escape sharp minima, and is thought to bias optimization toward flatter minima that generalize better (the "flat minima" hypothesis).

Momentum

SGD oscillates in directions of high curvature. Momentum smooths this by accumulating a velocity vector that decays over time:

v ← μv - η∇L
θ ← θ + v

With momentum μ = 0.9, the optimizer accumulates gradients from the past ~10 steps, dampening oscillations and enabling larger effective learning rates. Nesterov momentum computes the gradient slightly ahead of the current position for better theoretical convergence.

Adam and AdamW

Adam adapts the learning rate per parameter based on estimates of first and second moments of the gradient:

m ← β₁m + (1-β₁)g      # first moment (mean)
v ← β₂v + (1-β₂)g²     # second moment (variance)
m̂ = m/(1-β₁ᵗ)          # bias correction
v̂ = v/(1-β₂ᵗ)          # bias correction
θ ← θ - η m̂/(√v̂ + ε)

Parameters with consistently large gradients get smaller effective learning rates (v̂ grows large). Parameters with small or noisy gradients get larger effective learning rates. This automatic adaptation is why Adam usually works better than SGD without careful tuning.

AdamW (Loshchilov & Hutter, 2019) fixes a subtle bug in L2 regularization with Adam: instead of adding the regularization term to the gradient (which Adam downscales), AdamW applies weight decay directly to the parameters. This produces better regularization and is the default optimizer for most large model training.

Beyond the Hype: AI as Operational Reality

Most AI case studies describe spectacular demonstrations. Fewer document the more interesting story: how an organization systematically integrates AI into its actual operations, identifies what works, manages the failures, and builds institutional knowledge about applying AI effectively. Tag1 Consulting, the company behind Scolta, has published extensively about their applied AI journey (tag1.com/blog/tag1-applied-ai/), making them an unusually transparent case study for how a technology consultancy uses AI in the real world.

AI for Software Development

Tag1 uses Claude Code and GitHub Copilot across their development team for code generation, code review, test writing, and documentation. Their published findings (consistent with industry surveys): AI pair programming increases individual developer productivity by 20-40% on routine tasks; the gains are most pronounced for boilerplate, documentation, and test cases; senior developers benefit as much as junior ones, but for different tasks (complex refactoring vs. routine implementation).

They've also noted the risks: AI-generated code requires review; models confidently produce plausible but incorrect implementations; junior developers who rely too heavily on AI may not develop the debugging instincts that come from writing code manually. Their approach: AI is a tool, not a replacement for engineering judgment.

AI for Client Delivery

Scolta itself is Tag1's primary AI product — an AI-enhanced search system for Drupal and other CMS platforms. The development of Scolta was itself an exercise in applying AI: using LLMs to evaluate search quality, generate test queries, and identify gaps in coverage. The result is a product where AI was used throughout the development process as well as in the final product.

Beyond Scolta, Tag1 uses AI for: generating content migration scripts (tedious and error-prone by hand), automated accessibility auditing (LLMs can evaluate WCAG compliance more systematically than manual review), and performance optimization recommendations (static analysis + LLM review catches issues that automated tools miss).

AI for Knowledge Management

Like most professional services firms, Tag1 accumulates institutional knowledge that's hard to transfer. They've experimented with AI-assisted knowledge retrieval: building a private RAG system over their internal documentation, project notes, and code repositories that allows team members to query the organization's accumulated expertise. Early results are promising for onboarding and knowledge retrieval; less clear for truly novel problem solving.

What the Case Study Teaches

Three lessons from Tag1's AI integration journey that generalize beyond their specific context:

  1. Measure before and after: Impressions of productivity gain are not measurements. Tag1 tracks specific metrics (time to complete tasks, defect rates, client satisfaction) before and after AI tool adoption.
  2. The workflow matters as much as the tool: The same LLM produces very different outcomes depending on how it's integrated into existing workflows. Treating AI as a chat interface produces different results than integrating it into specific workflow steps.
  3. Skills compound: Teams that invest in prompt engineering and AI workflow design build compound advantages. The first month of AI adoption is often disappointing; the sixth month, after workflow optimization, is dramatically better.

This article is assigned reading in MLE-401 and as supplementary material in ETH-440 as an example of responsible, measured AI adoption in a professional context.

The Invisible Foundation

Before a large language model processes a single word, it processes tokens. Tokenization — the process of splitting text into chunks a model can work with — is among the most consequential and least-discussed decisions in LLM design. The way a tokenizer carves up language shapes what arithmetic the model can perform, which languages it handles gracefully, whether it can spell, and even how it reasons about code.

Most practitioners use a pre-trained tokenizer without much thought. But if you want to understand why GPT-3 can't reliably count characters, why some models handle Japanese better than others, or why coding models often tokenize differently from chat models, you need to understand how tokenization works and what it constrains.

What a Token Is (and Isn't)

A token is not a word. It is not a character. It is a unit determined by the tokenizer's training process — typically a subword unit that appears frequently enough in the training corpus to earn its own entry in the vocabulary.

In GPT-2's vocabulary (50,257 tokens), common English words like "the", "of", and "and" are single tokens. Less common words are split into multiple tokens: "tokenization" becomes ["token", "ization"]. Very rare words fragment further: "Φωτιά" (Greek for fire) might become 7 or 8 tokens using a tokenizer trained primarily on English text.

The vocabulary size matters enormously. A vocabulary of 32,000 tokens (like Llama 2) uses 16-bit token IDs. A vocabulary of 100,000+ tokens (like some multilingual models) requires more memory per token but can represent more content with fewer tokens. The sweet spot depends on the target language distribution.

Byte-Pair Encoding: How Vocabularies Are Built

The most widely used tokenization algorithm is Byte-Pair Encoding (BPE), originally developed for data compression and adapted for NLP in 2016. The algorithm starts with individual bytes (or characters) and iteratively merges the most frequent adjacent pair until the desired vocabulary size is reached.

Starting vocabulary: {h, e, l, o, w, r, d}

After learning "el" is frequent: {h, el, l, o, w, r, d, el}

After learning "ello" is frequent: {h, ello, l, o, w, r, d, ello}

After learning "hello" is frequent: {hello, h, ello, l, o, w, r, d}

The merge rules, applied in order, constitute the tokenizer. Encoding new text means applying these rules left-to-right to find the longest match at each position. The Tag1 Feste LLM engine (documented at tag1.com/how-to/, Part 1) includes a detailed Rust implementation of BPE that demonstrates exactly how these merge rules work in practice.

The Fertility Problem

Tokenizer "fertility" measures how many tokens a given amount of text requires. For English text, a well-trained BPE tokenizer typically produces about 4 characters per token. For Spanish, it's similar. For Arabic, Hebrew, or Chinese, the ratio is often much worse — 1.5–2.5 characters per token — because these languages are underrepresented in training corpora, so their subwords don't appear frequently enough to earn their own vocabulary entries.

High fertility has three downstream consequences: higher inference cost (more tokens = more compute), shorter effective context length (a context window that holds 4,000 English words might hold only 2,000 Arabic words), and worse model performance (the model is asked to do more with less information per token).

This is a concrete way that the choice of tokenizer encodes a choice about which languages get high-quality service. Multilingual models like mBERT, XLM-R, and the Llama 3 family use expanded vocabularies and balanced training corpora to improve fertility across languages.

Tokenization and Arithmetic

LLMs famously struggle with arithmetic. Part of why is tokenization. Consider: "1234" might be tokenized as a single token. "12345" might be tokenized as ["123", "45"]. "999 + 1" might tokenize as ["999", " +", " 1"]. The model has no way to reliably "see" the individual digits that arithmetic requires unless they're separate tokens.

Models trained heavily on code (like DeepSeek Coder) often tokenize numbers digit-by-digit for exactly this reason. Language-focused models may not. The practical takeaway: if you're building a system that requires arithmetic reasoning, examine how your tokenizer handles numbers.

Tokenization and Spelling

Ask GPT-3 to count the letters in "strawberry" and it will often get it wrong. Why? Because "strawberry" is typically a single token, and the model never "sees" the individual letters during processing. It learns statistical patterns about what follows "strawberry" in text, not what letters compose it. For the same reason, LLMs struggle to reverse strings, identify the third letter of a word, or perform any character-level operation — these tasks require access to information that tokenization discards.

This is a fundamental limitation, not a quirk. It explains why newer models that tokenize at the character or byte level, or that use special tokenization schemes for letter-level tasks, perform dramatically better on these benchmarks.

SentencePiece and WordPiece

BPE is not the only game in town. SentencePiece (used by T5, ALBERT, and many multilingual models) implements BPE and Unigram Language Model tokenization in a single framework. It treats the input as a byte stream rather than Unicode code points, avoiding the need for language-specific pre-tokenization rules. This makes it more portable across writing systems.

WordPiece (used in BERT and early Google models) is similar to BPE but uses a likelihood-based merge criterion rather than frequency-based. In practice, the differences in downstream performance are modest for English but can be significant for morphologically complex languages.

Practical Implications for Engineers

If you're building with LLMs, the tokenizer matters in several practical ways:

  • Cost: Token count = API cost. For multilingual use cases, measure the token-to-character ratio in your actual content. It may be much higher than you expect.
  • Context: A 128K context window's effective capacity depends heavily on language. Budget accordingly.
  • Prompt design: Some prompt patterns require token efficiency. Others need specific tokens to be atomic (not split). Understanding your tokenizer lets you optimize for both.
  • Special tokens: Every model has special tokens ([CLS], [SEP], <|endoftext|>, <|im_start|>, etc.) that affect how prompts are parsed. Know your tokenizer's special tokens before building production systems.

Further Reading

For a complete implementation walkthrough, see the Feste blog series at tag1.com/how-to/ (Part 1 covers tokenizer implementation in Rust). For the original SentencePiece paper, see Kudo & Richardson (2018). For multilingual tokenization analysis, see the XTREME benchmark paper and analyses of fertility across languages.

Why Information Theory Matters for ML

Shannon's information theory, developed in 1948 for communications engineering, turns out to be deeply connected to machine learning. Cross-entropy — the standard loss function for classification — is an information-theoretic concept. The KL divergence that appears in VAEs, RL, and alignment training is from information theory. The mutual information used to analyze what neural networks learn is from information theory. Understanding these connections gives you a much deeper understanding of why standard ML methods work.

Entropy: Measuring Uncertainty

Shannon entropy measures the average information content of a random variable:

H(X) = -Σ_x p(x) log p(x)

High entropy = high uncertainty. A fair coin has entropy 1 bit. A two-headed coin has entropy 0 (no uncertainty). A uniform distribution over 256 outcomes has entropy 8 bits.

In ML: model uncertainty (how uncertain is the model about this prediction?) is related to entropy of the output distribution. Entropy maximization in SAC (reinforcement learning) explicitly rewards high-entropy policies. The entropy of the training data distribution affects how hard the learning problem is.

Cross-Entropy: The Standard Loss Function

Cross-entropy H(p, q) measures the expected log-loss when using distribution q to encode events drawn from distribution p:

H(p, q) = -Σ_x p(x) log q(x)

In classification: p is the true distribution (one-hot label), q is the model's predicted distribution. Minimizing cross-entropy is exactly maximum likelihood estimation — you're finding model parameters that make the training data most likely. Cross-entropy = entropy + KL divergence: H(p, q) = H(p) + KL(p||q).

KL Divergence: Measuring Distribution Difference

KL divergence (Kullback-Leibler divergence) measures how much one distribution differs from another:

KL(p||q) = Σ_x p(x) log(p(x)/q(x))

It's not symmetric (KL(p||q) ≠ KL(q||p)) and can be interpreted as the extra bits needed to encode samples from p when using an optimal code for q. KL appears everywhere in ML: RLHF/PPO uses KL as a constraint between the policy being trained and the reference policy; VAEs minimize KL between the posterior and the prior; information bottleneck theory uses KL to formalize compression.

Mutual Information: What's Shared Between Variables

Mutual information measures how much information X and Y share:

I(X;Y) = KL(p(x,y) || p(x)p(y)) = H(X) - H(X|Y)

High mutual information = knowing Y tells you a lot about X. Research on "what neural networks learn" often uses mutual information between representations and input/output: good representations should have high I(representation; label) and low I(representation; spurious features). The information bottleneck principle proposes that good representations compress away irrelevant information while retaining task-relevant information.

The Opacity Problem

Modern neural networks are "black boxes": they map inputs to outputs through billions of learned parameters in ways that resist human interpretation. A neural network that denies a loan application doesn't "explain" its decision — it produces a number. This opacity creates problems for trust, accountability, debugging, and safety: we can measure a model's average accuracy but don't understand when and why it will fail.

Interpretability research seeks to understand what AI systems are doing internally — not just what outputs they produce, but what computations they perform, what concepts they represent, and why they make specific predictions.

Post-Hoc Explanations

Post-hoc explanation methods explain individual predictions without modifying the model:

  • LIME: Fits a locally linear model around a specific prediction. Perturb the input slightly, see how the output changes, build an interpretable proxy.
  • SHAP: Distributes a prediction's value across features based on Shapley values from cooperative game theory. More principled than LIME but computationally expensive for large models.
  • Gradient-based attribution: The gradient of the output with respect to each input feature measures how much that feature contributes to the prediction. GradCAM for images produces saliency maps highlighting which image regions the model uses.

Caveat: post-hoc explanations are explanations of a simpler proxy model, not of the original neural network. They can be misleading, particularly for features that are correlated or for models with non-linear feature interactions.

Mechanistic Interpretability

Mechanistic interpretability attempts to reverse-engineer neural networks: identify specific circuits (groups of neurons and weights) that implement specific algorithms. This is harder than post-hoc explanation but more truthful: you're actually understanding the model's internal computation.

Key results: Elhage et al. identified "induction heads" in transformer models — circuits that perform in-context learning by finding and copying patterns. Multiple groups have identified circuits for addition, indirect object identification, and simple factual recall in language models. The field has developed tools like activation patching (ablate specific neurons and see which predictions change), probing classifiers (train a linear probe to decode specific features from activations), and logit attribution (identify which attention heads most influence specific output tokens).

Superposition and the Limits of Current Methods

A major challenge: neurons are polysemantic — individual neurons respond to multiple unrelated concepts (a single neuron might respond to both "European royalty" and "curry"). This "superposition" suggests that concepts are represented in high-dimensional activation spaces rather than individual neurons, making per-neuron analysis incomplete. Sparse autoencoders (SAE) have recently shown promise for disentangling superposed features into more interpretable "dictionary elements." This is an active research area with significant implications for AI safety.

The Parameter Explosion Problem

A 7B parameter model has 7 billion parameters. Fine-tuning all of them requires storing 7B gradients plus optimizer states — Adam typically requires 3 copies of each parameter. A full fine-tuning run on a 70B model in 16-bit precision requires ~420GB of GPU memory. Most researchers and companies don't have 8 H100s dedicated to a single training job.

Parameter-efficient fine-tuning (PEFT) methods adapt large models to new tasks by training only a small fraction of their parameters. LoRA and QLoRA are the dominant PEFT methods, responsible for the explosion of task-specific open-source model adaptations available on HuggingFace.

LoRA: Low-Rank Adaptation

LoRA (Hu et al., 2021) is based on a key observation: fine-tuning updates to a large weight matrix W tend to be low-rank. If the update ΔW has rank r ≪ min(d, k), we can decompose it:

ΔW = BA   where B ∈ R^(d×r), A ∈ R^(r×k)

Instead of updating the full W (d×k parameters), we train only B and A (d·r + r·k parameters). For a 4096×4096 weight matrix with r=8, LoRA reduces the trainable parameters from 16.7M to 65K — a 256× reduction.

During training, the original weights W are frozen; only A and B are trained. During inference, the adapted weight is W + BA (or equivalently, BA can be absorbed into W). At typical ranks (r=4 to r=64), LoRA recovers 90-99% of full fine-tuning quality on most tasks.

Practical LoRA Configuration

Key hyperparameters:

  • Rank (r): Lower rank = fewer parameters = less capacity. r=8 works for most tasks; use r=32 or r=64 for complex reasoning. Start with r=16.
  • Alpha (α): Scaling factor for the LoRA update. The effective scaling is α/r. Common convention: set α = 2r (so effective scale = 2).
  • Target modules: Which matrices to apply LoRA to. Typically query and value projections in attention layers; sometimes all linear layers including MLP.
  • Dropout: Applied between A and B for regularization. Default 0.05.

QLoRA: Adding Quantization

QLoRA (Dettmers et al., 2023) combines LoRA with 4-bit quantization of the base model. The insight: during fine-tuning, we only need the base model to compute activations and gradients. If we store it in 4-bit (NF4) format, we dramatically reduce memory — and LoRA adapters are still trained in 16-bit, so gradient quality is preserved.

QLoRA memory requirements: a 7B model in 4-bit requires ~4GB. A 70B model in 4-bit requires ~40GB. With LoRA adapters (small, in 16-bit), the total typically adds 1-4GB. This means a 70B model can be fine-tuned on 2 A100s rather than 16.

QLoRA quality is usually within 1-2% of full-precision LoRA. For most applications, this is acceptable. It has made 70B-scale fine-tuning accessible to individual researchers and small companies.

Practical Recommendations

  • Use QLoRA for 7B–70B models when you have limited GPU memory.
  • Use full LoRA (no quantization) for 1B–7B models where memory permits.
  • Use full fine-tuning only when you need to modify all model behavior (large domain adaptation, structural changes).
  • The peft library (HuggingFace) provides production-ready LoRA/QLoRA implementations.
  • Monitor training loss curves: LoRA training should converge similarly to full fine-tuning; significantly worse convergence suggests the rank is too low or the learning rate too high.

The Scaling Dilemma

Making LLMs better by making them bigger works — up to a point. Doubling the number of parameters roughly doubles the compute required for every forward pass. At some point, you want a bigger model but can't afford the inference cost. Mixture of Experts (MoE) architecture offers a way out: a model with many more parameters than a dense model, but with only a fraction of those parameters active during any given computation.

The MoE Architecture

In a standard (dense) transformer, every token passes through every layer, activating all parameters. In an MoE transformer, each feed-forward layer is replaced with multiple "expert" networks plus a learned "router" that selects which experts to activate for each token:

MoE(x) = Σ_i gate(x)_i · Expert_i(x)

Typically only 2 of K experts are activated per token (top-2 routing). So a model with 8 experts activates 2 for each token: 25% of the expert capacity. The total parameters include all experts, but the FLOPs per token are roughly equivalent to a model with one expert.

Why This Helps

MoE separates two things that dense models conflate: the total knowledge stored in parameters (which scales with total parameters) and the compute required per forward pass (which scales with active parameters). An MoE model can store as much knowledge as a model 8× its size while having inference costs similar to a model 1/4 that size.

Mixtral 8×7B (Mistral AI, December 2023) demonstrated this concretely: with 46.7B total parameters but only ~12.9B active per forward pass, it matched or exceeded LLaMA-2 70B on most benchmarks at roughly the same inference cost as a 13B dense model.

Training MoE Models

MoE training requires careful engineering. The most important challenge is load balancing: the router tends to favor a small number of experts and ignore others ("expert collapse"), resulting in wasted capacity. Solutions include auxiliary load balancing losses that penalize uneven expert utilization, and random expert perturbation during early training. Expert buffer overflow (more tokens routed to an expert than its capacity allows) requires additional handling.

DeepSeek's MoE design innovations in V2 and V3 — shared experts, fine-grained routing, auxiliary loss design — represent the current state of the art in MoE training stability and efficiency.

When MoE Makes Sense

MoE is compelling for large models that need to be deployed at scale. For smaller models or low-throughput applications, the engineering complexity may not be worth the efficiency gains. The architecture also requires specialized serving infrastructure: load balancing across experts, careful batching to amortize the overhead of expert selection. For high-throughput applications with large models, MoE is increasingly the architecture of choice.

The 90% Problem

Getting a machine learning model to work in a Jupyter notebook is relatively easy. Getting it into production, keeping it running reliably, and maintaining it over time is where most AI projects fail. Studies consistently show that 80-90% of ML models never make it to production, and most of those that do fail within months due to data drift, infrastructure issues, or organizational problems.

MLOps (Machine Learning Operations) is the discipline that closes this gap — applying DevOps principles and practices to the ML lifecycle to build systems that actually work in production.

The Full ML Lifecycle

Production ML involves far more than model training:

  1. Data collection and versioning: Data changes over time. Without versioning (DVC, LakeFS), you can't reproduce results or debug regressions.
  2. Feature engineering: Feature stores (Feast, Tecton) provide consistent feature computation between training and serving — a common source of training-serving skew.
  3. Training infrastructure: Experiment tracking (MLflow, Weights & Biases), distributed training, GPU resource management.
  4. Model registry: Version and stage management for trained models; audit trail of what was deployed when.
  5. Serving infrastructure: Model servers, autoscaling, load balancing, A/B testing.
  6. Monitoring: Data drift detection, model performance monitoring, alerting.
  7. Retraining: Automated triggering of retraining when drift is detected; CI/CD for ML.

Training-Serving Skew: The Silent Killer

Training-serving skew — where the data distribution at serving time differs from training time — is the most common production ML failure mode. It occurs when: features are computed differently in training and serving (timestamp handling, categorical encoding, scaling), training data is not representative of production data, or the production data distribution shifts over time.

Preventing skew requires: using the same feature computation code in training and serving (feature stores solve this), testing on held-out data from the production distribution, and monitoring feature distributions in production.

Experiment Tracking

Every training run should be tracked: hyperparameters, dataset version, training code version, evaluation metrics, system metrics (GPU utilization, memory, time). MLflow and Weights & Biases are the industry standards. Without tracking, you can't reproduce previous results, compare experiments, or understand why model performance changed.

CI/CD for ML

Automated testing for ML pipelines must handle the stochastic nature of training: model performance will vary between runs. Good ML CI/CD: run automated tests on small data to detect code regressions; use statistical tests to determine whether metric changes are significant; automate deployment when all tests pass and performance meets thresholds. The Meridian AI MS in AI Engineering covers complete CI/CD pipeline implementation as part of MLE-440.

Beyond Single-Modality AI

Human cognition is inherently multimodal: we process visual, auditory, and textual information simultaneously, combining them to understand context that no single modality could provide alone. AI systems are increasingly following this pattern — not through a single grand unified model, but through increasingly tight integration of vision, language, and audio processing.

What "Multimodal" Means in Practice

In current AI systems, "multimodal" usually means at least one of:

  • Vision + Language: Models that understand and generate both images and text. GPT-4V, Claude 3, Gemini 1.5 — all can read images, answer questions about them, analyze charts, describe scenes.
  • Text-to-Image: Generating images from text descriptions. Stable Diffusion, DALL-E 3, FLUX, Midjourney.
  • Speech + Language: Speech recognition (Whisper), text-to-speech (ElevenLabs, OpenAI TTS), and real-time voice conversation (GPT-4o voice mode).
  • Video Understanding: Analyzing video content, describing actions, answering questions about scenes in video.

Architecture Approaches

The dominant architecture for language models that understand images uses a projection layer between a vision encoder (usually CLIP-family or DINOv2) and a language model. The vision encoder converts the image to patch embeddings; the projection layer aligns these to the language model's embedding space; the language model processes them as additional tokens. This "visual tokens" approach is used in LLaVA, InstructBLIP, and variants of GPT-4V.

More tightly integrated architectures (like Gemini 1.5 and Claude 3's native vision) train visual understanding from the beginning rather than adapting a pre-trained language model, enabling better integration of visual and linguistic reasoning.

What Multimodal Models Actually Do Well

Current multimodal LLMs excel at: describing image content, answering factual questions about images, reading text in images (OCR), analyzing charts and diagrams (with some errors), identifying objects and spatial relationships, and understanding document layouts. They struggle with: precise counting and measurement, detailed spatial reasoning, identifying subtle visual differences, and understanding novel image types outside their training distribution.

Emerging Capabilities

The frontier is moving rapidly. GPT-4o, Gemini 2.0 Flash, and Llama 3.2 Vision demonstrate increasingly sophisticated visual reasoning. Audio-visual models (processing video with sound together) are emerging. Real-time vision for robotics is an active area. The combination of better vision encoders, larger language models, and richer multimodal training data continues to push capabilities forward rapidly.

The Object Detection Problem

Object detection is the task of identifying all objects of interest in an image and drawing bounding boxes around them. It's one of the most practically important computer vision tasks: it underlies autonomous driving, security systems, medical imaging, inventory management, and countless other applications. The field has undergone several architectural revolutions since 2014.

Two-Stage Detectors: Precision First

Two-stage detectors (Faster R-CNN, Mask R-CNN) first propose candidate object regions, then classify and refine them. Stage 1: a Region Proposal Network (RPN) uses sliding windows over feature maps to generate candidate bounding boxes. Stage 2: an ROI pooling operation extracts features for each proposal, which are then classified and box-refined. This approach is highly accurate but slower due to the two-stage pipeline.

YOLO: Real-Time Detection

YOLO (You Only Look Once) prioritizes speed by treating detection as a single regression problem. The image is divided into an S×S grid; each grid cell predicts B bounding boxes and class probabilities. This single-pass approach achieves real-time detection speeds (30-100+ FPS) that two-stage detectors can't match. The tradeoff: historically lower accuracy, especially on small objects.

YOLOv8 (Ultralytics, 2023) and YOLOv9/v10/v11 have dramatically narrowed the accuracy gap with two-stage detectors while maintaining excellent speed. On COCO, YOLOv9 achieves 55.6% mAP at 53.6 FPS on an A100 — competitive with much slower approaches.

DETR: Detection as Set Prediction

DETR (DEtection TRansformer, 2020) reformulated object detection using a transformer encoder-decoder and treating detection as a set prediction problem. No anchor boxes, no non-maximum suppression, no hand-crafted components — the model directly outputs a fixed set of predicted objects. The simplicity is appealing; the training is expensive and slow.

DETR variants (Deformable DETR, DINO-DETR, RT-DETR) have addressed the training speed problem and significantly improved performance. RT-DETR (Real-Time DETR) achieves 53.1% mAP at 108 FPS on A100 — showing that transformer-based detection can be both accurate and fast.

Open-Vocabulary and Grounding Models

The frontier in 2026 is open-vocabulary detection: detecting objects described in natural language rather than fixed category lists. GroundingDINO, GDINO-1.5, and similar models can detect "the person wearing a red hat" or "the broken circuit board" without any category-specific training. This enables new applications like instruction-following visual agents and zero-shot quality control systems.