Mixture of Experts: How Sparse Models Scale

The Scaling Dilemma

Making LLMs better by making them bigger works — up to a point. Doubling the number of parameters roughly doubles the compute required for every forward pass. At some point, you want a bigger model but can't afford the inference cost. Mixture of Experts (MoE) architecture offers a way out: a model with many more parameters than a dense model, but with only a fraction of those parameters active during any given computation.

The MoE Architecture

In a standard (dense) transformer, every token passes through every layer, activating all parameters. In an MoE transformer, each feed-forward layer is replaced with multiple "expert" networks plus a learned "router" that selects which experts to activate for each token:

MoE(x) = Σ_i gate(x)_i · Expert_i(x)

Typically only 2 of K experts are activated per token (top-2 routing). So a model with 8 experts activates 2 for each token: 25% of the expert capacity. The total parameters include all experts, but the FLOPs per token are roughly equivalent to a model with one expert.

Why This Helps

MoE separates two things that dense models conflate: the total knowledge stored in parameters (which scales with total parameters) and the compute required per forward pass (which scales with active parameters). An MoE model can store as much knowledge as a model 8× its size while having inference costs similar to a model 1/4 that size.

Mixtral 8×7B (Mistral AI, December 2023) demonstrated this concretely: with 46.7B total parameters but only ~12.9B active per forward pass, it matched or exceeded LLaMA-2 70B on most benchmarks at roughly the same inference cost as a 13B dense model.

Training MoE Models

MoE training requires careful engineering. The most important challenge is load balancing: the router tends to favor a small number of experts and ignore others ("expert collapse"), resulting in wasted capacity. Solutions include auxiliary load balancing losses that penalize uneven expert utilization, and random expert perturbation during early training. Expert buffer overflow (more tokens routed to an expert than its capacity allows) requires additional handling.

DeepSeek's MoE design innovations in V2 and V3 — shared experts, fine-grained routing, auxiliary loss design — represent the current state of the art in MoE training stability and efficiency.

When MoE Makes Sense

MoE is compelling for large models that need to be deployed at scale. For smaller models or low-throughput applications, the engineering complexity may not be worth the efficiency gains. The architecture also requires specialized serving infrastructure: load balancing across experts, careful batching to amortize the overhead of expert selection. For high-throughput applications with large models, MoE is increasingly the architecture of choice.