AI Architecture

Mixture of Experts (MoE)

Mixture of Experts is a neural network architecture where a learned routing mechanism activates only a small subset of specialised sub-networks (experts) for each input token — delivering the capacity of a much larger model at a fraction of the per-token compute cost.

The central tension in large language model scaling is that bigger models are smarter, but also slower and more expensive to run. Mixture of Experts resolves this tension by decoupling model capacity from compute cost. An MoE model can have the parameter count of a model ten times its size while matching a much smaller dense model in inference compute — because for any given token, only a fraction of the model is actually active.

MoE has become the dominant architecture for frontier models: GPT-4, Gemini 1.5, Mistral’s models, DeepSeek V2/V3, and Amazon’s Titan all use MoE internally.

How MoE Works

A standard transformer processes every token through every layer, fully. An MoE transformer replaces some or all of the feed-forward network (FFN) layers — the compute-intensive layers that make up ~2/3 of a model’s parameters — with an MoE layer.

An MoE layer contains:

  1. N expert networks — independent FFN sub-networks, each with their own weights. A large MoE model might have 8, 64, or even 256 experts per layer.
  2. A router (gating network) — a small learned linear layer that takes each token’s representation and outputs a probability distribution over the N experts.
  3. A top-K selection — only the K highest-scoring experts (typically K=1 or K=2) process each token. All others are skipped.

The router and the experts are trained jointly via standard gradient descent. Over training, experts naturally specialise: some handle syntactic patterns, others domain-specific vocabulary, others certain reasoning types — though this specialisation is emergent and implicit rather than explicitly designed.

The Core Efficiency Gain

Consider a model with 8 experts per MoE layer, each expert the same size as the FFN in a dense model, with top-2 routing (each token goes to 2 of 8 experts):

  • Total parameters: ~8× the dense model’s FFN parameters
  • Active parameters per token: ~2/8 = 25% of the total FFN parameters
  • Compute per token: ~25% of a model with all parameters dense

This is the MoE value proposition: scale the parameter count 8× while keeping per-token compute nearly constant. More parameters means higher capacity and more world knowledge; constant compute means inference cost and latency don’t scale proportionally.

Key Design Choices

Number of Experts

More experts = higher total capacity but also higher memory requirements (all expert weights must be loaded, even if most are idle per token). Practical deployments balance expert count against available GPU memory.

Top-K Selection

  • Top-1 (sparse MoE): Each token processed by exactly one expert. Maximum efficiency, but training can be unstable.
  • Top-2 (the standard): Each token processed by two experts, outputs averaged by routing score. More stable training and better quality — the dominant choice since Mixtral.

Load Balancing

Without intervention, routers collapse — they learn to send most tokens to a few experts and starve the rest. Load balancing losses and auxiliary terms force more even expert utilisation, ensuring all experts train and specialise.

Expert Granularity

DeepSeek V2 popularised fine-grained experts — using many smaller experts (e.g., 160 experts, each 1/16 the size of a standard FFN) rather than a few large ones, and selecting more of them per token (top-6 from 160). This improves expert specialisation and routing flexibility while keeping active compute constant.

MoE vs. Dense Models

Dense ModelMoE Model
ParametersAll active per tokenSmall fraction active per token
Compute per tokenProportional to parameters~K/N of total parameters
MemoryProportional to parametersFull model must fit in memory
Training stabilityHighLower (requires load balancing)
Inference throughputHighLower (routing + expert dispatch overhead)
Quality at same computeBaselineSignificantly higher

Notable MoE Models

Switch Transformer (Google, 2021): First demonstration that extreme sparsity (top-1 routing) could scale to trillion-parameter models with stable training. Proved MoE could work reliably at large scale.

Mixtral 8×7B (Mistral AI, 2024): A landmark open-weight MoE model with 8 experts of 7B parameters each (46.7B total), using top-2 routing (12.9B active per token). Matched or exceeded LLaMA 2 70B on most benchmarks at a fraction of the inference cost. Made MoE accessible for open-source practitioners.

GPT-4 (OpenAI, 2023): Widely reported to use an MoE architecture, though OpenAI has not officially confirmed architectural details.

DeepSeek V2/V3 (DeepSeek, 2024): Pushed MoE efficiency further with Multi-head Latent Attention (MLA) and fine-grained experts, achieving performance competitive with GPT-4-class models at dramatically lower inference cost per token — a key reason for DeepSeek’s cost advantage.

DeepSeek-V3 (December 2024): Scaled to 671B total parameters with 37B active per token using 256 fine-grained experts. Critically, introduced auxiliary-loss-free load balancing — previous MoE models used auxiliary loss terms to force even expert utilisation, which conflicted with the main training objective. DeepSeek-V3’s approach uses bias-based routing adjustments instead, improving training stability and final quality. Competitive with GPT-4-class models at a fraction of the inference cost.

Gemini 1.5/2.x (Google DeepMind, 2024–2025): Uses MoE to achieve long context windows efficiently. Sparse activation reduces the compute burden of processing million-token contexts that would be prohibitive for a dense model of the same capability.

Llama 4 (Meta, 2025): Meta’s first MoE architecture in the Llama series, marking MoE’s arrival as the default choice even for models intended for broad open-source deployment.

Qwen3 (Alibaba, 2025): The Qwen3-Next 80B-A3B model demonstrated that only 3B active parameters (from 80B total) could compete with far larger dense models on reasoning benchmarks. Qwen3-Coder-Next (2026) outperformed DeepSeek V3.2 on coding tasks with a fraction of the active compute.

Kimi K2 (Moonshot AI, 2025): Scaled MoE to ~1 trillion total parameters — demonstrating that the architecture can reach previously theoretical scale while remaining practical to serve. Used a top-K routing strategy across thousands of experts with improved load balancing.

2025–2026: Architecture Advances Beyond Scale

The frontier has shifted from simply scaling expert count toward making routing more reliable and efficient under long training runs:

Auxiliary-loss-free load balancing (pioneered by DeepSeek-V3) replaces the loss-term approach with routing bias adjustments that update separately from gradient descent — cleaner separation of concerns and better final model quality.

Progressive Scaling Routing (PSR) anneals the candidate expert pool during training — starting with broader routing and progressively narrowing it — to mitigate load imbalance and router collapse over long runs.

Multilinear / Factorised MoE represents the expert mapping as a factorised tensor contraction, enabling tens of thousands of differentiable micro-experts with minimal additional FLOP overhead. Allows much finer-grained specialisation than discrete top-K routing.

As of mid-2026, over 60% of open-source model releases use MoE architectures, and all top-10 open-weight models on major benchmarks are MoE-based. The architecture has crossed from research technique to production default.

Practical Implications

For teams deploying MoE models:

  • Memory is the bottleneck, not compute. All expert weights must be loaded into GPU/CPU memory even though most are idle per token. Plan for 2–4× the memory of a same-compute dense model.
  • Expert parallelism requires routing tokens to the right GPU/device — distributed MoE inference is architecturally more complex than dense model serving.
  • MoE models can be quantised with the same techniques as dense models, which helps significantly with memory.

For teams choosing between MoE and dense models:

  • If you’re constrained by inference compute budget, MoE gives more capability per FLOP.
  • If you’re constrained by GPU memory (e.g., edge deployment), dense models are simpler.
  • For fine-tuning, MoE models behave similarly to dense models — LoRA adapters can be applied to expert layers normally.

Ready to build?

Leverage AI technologies to build your product stack

Superteams can help you build, deploy and launch AI application stacks using open source technologies — from architecture through to production.

Talk to Superteams