DeepSeek V4 — AI Glossary

DeepSeek V4 is a two-model open-source family released by DeepSeek on April 24, 2026, under the MIT License. It continues the lineage established by DeepSeek-V3 and V3.2, pushing the efficiency and capability of open-weight models to a new threshold — placing a 1.6-trillion-parameter model within 0.2 percentage points of Claude Opus 4.6 on SWE-bench Verified while charging $3.48 per million output tokens versus Claude’s $25.

Model Variants

	V4-Pro	V4-Flash
Total parameters	1.6 trillion	284 billion
Active per token	49 billion	13 billion
Training tokens	33 trillion	32 trillion
Context window	1 million tokens	1 million tokens

Both variants use a Mixture-of-Experts (MoE) architecture where only a small subset of parameters activates per token, keeping inference compute practical despite the enormous total parameter count.

Key Architectural Innovations

CSA / HCA Hybrid Attention V4 replaces standard multi-head attention with a hybrid of two complementary mechanisms:

Compressed Sparse Attention (CSA) — handles long-range dependencies efficiently by attending sparsely across the full 1M-token context.
Heavily Compressed Attention (HCA) — processes local context with aggressive compression for speed.

The result: at 1M-token context, V4-Pro requires only 27% of the single-token inference FLOPs and 10% of the KV cache compared to DeepSeek-V3.2. This makes sustained long-context inference economically viable at scale.

mHC (Manifold-Constrained Hyper-Connections) Standard residual connections in very deep models suffer from signal amplification — gradients and activations compound across layers, causing instability at scale. DeepSeek’s mHC framework constrains the residual connection mixing matrices to the Birkhoff Polytope using the Sinkhorn-Knopp algorithm, capping signal amplification at 1.6× regardless of depth. This is what allowed stable pre-training at 1.6 trillion parameters on 33 trillion tokens without gradient collapse.

Muon Optimizer DeepSeek replaced the standard AdamW optimizer with Muon for pre-training — a second-order optimizer that uses orthogonal updates. Combined with mHC’s stability guarantees, Muon enabled faster convergence and maintained training stability across the full 33-trillion-token run.

On-Policy Distillation (Post-Training) Rather than reinforcement learning from human feedback (RLHF), V4’s post-training pipeline uses On-Policy Distillation — the model learns from outputs generated by a stronger teacher model under the same distribution the student will be deployed in. This eliminates reward model errors and produces more consistent reasoning behaviour than RL-based fine-tuning.

Configurable Reasoning Depth Developers can select the level of reasoning effort per request — trading latency for analytical depth. This is exposed as a parameter at inference time, making V4 practical for both low-latency applications (low reasoning) and complex tasks that benefit from extended chain-of-thought (high reasoning).

Performance

LiveCodeBench: 93.5% — ahead of Gemini (91.7%) and Claude (88.8%)
SWE-bench Verified: 80.6% — within 0.2 points of Claude Opus 4.6
Long-context efficiency: 27% of V3.2’s FLOPs at 1M tokens

Significance

DeepSeek V4 marks a new high-water mark for open-source frontier models. Each V3-generation release has closed the gap with proprietary models by a meaningful margin; V4 effectively eliminated it on coding benchmarks. The MIT license and competitive API pricing ($3.48/M output tokens) make it a credible default for production coding and reasoning workloads where cost and openness matter.