DeepSeek V4 is a two-model open-source family released by DeepSeek on April 24, 2026, under the MIT License. It continues the lineage established by DeepSeek-V3 and V3.2, pushing the efficiency and capability of open-weight models to a new threshold — placing a 1.6-trillion-parameter model within 0.2 percentage points of Claude Opus 4.6 on SWE-bench Verified while charging $3.48 per million output tokens versus Claude’s $25.
Model Variants
| V4-Pro | V4-Flash | |
|---|---|---|
| Total parameters | 1.6 trillion | 284 billion |
| Active per token | 49 billion | 13 billion |
| Training tokens | 33 trillion | 32 trillion |
| Context window | 1 million tokens | 1 million tokens |
Both variants use a Mixture-of-Experts (MoE) architecture where only a small subset of parameters activates per token, keeping inference compute practical despite the enormous total parameter count.
Key Architectural Innovations
CSA / HCA Hybrid Attention V4 replaces standard multi-head attention with a hybrid of two complementary mechanisms:
- Compressed Sparse Attention (CSA) — handles long-range dependencies efficiently by attending sparsely across the full 1M-token context.
- Heavily Compressed Attention (HCA) — processes local context with aggressive compression for speed.
The result: at 1M-token context, V4-Pro requires only 27% of the single-token inference FLOPs and 10% of the KV cache compared to DeepSeek-V3.2. This makes sustained long-context inference economically viable at scale.
mHC (Manifold-Constrained Hyper-Connections) Standard residual connections in very deep models suffer from signal amplification — gradients and activations compound across layers, causing instability at scale. DeepSeek’s mHC framework constrains the residual connection mixing matrices to the Birkhoff Polytope using the Sinkhorn-Knopp algorithm, capping signal amplification at 1.6× regardless of depth. This is what allowed stable pre-training at 1.6 trillion parameters on 33 trillion tokens without gradient collapse.
Muon Optimizer DeepSeek replaced the standard AdamW optimizer with Muon for pre-training — a second-order optimizer that uses orthogonal updates. Combined with mHC’s stability guarantees, Muon enabled faster convergence and maintained training stability across the full 33-trillion-token run.
On-Policy Distillation (Post-Training) Rather than reinforcement learning from human feedback (RLHF), V4’s post-training pipeline uses On-Policy Distillation — the model learns from outputs generated by a stronger teacher model under the same distribution the student will be deployed in. This eliminates reward model errors and produces more consistent reasoning behaviour than RL-based fine-tuning.
Configurable Reasoning Depth Developers can select the level of reasoning effort per request — trading latency for analytical depth. This is exposed as a parameter at inference time, making V4 practical for both low-latency applications (low reasoning) and complex tasks that benefit from extended chain-of-thought (high reasoning).
Performance
- LiveCodeBench: 93.5% — ahead of Gemini (91.7%) and Claude (88.8%)
- SWE-bench Verified: 80.6% — within 0.2 points of Claude Opus 4.6
- Long-context efficiency: 27% of V3.2’s FLOPs at 1M tokens
Significance
DeepSeek V4 marks a new high-water mark for open-source frontier models. Each V3-generation release has closed the gap with proprietary models by a meaningful margin; V4 effectively eliminated it on coding benchmarks. The MIT license and competitive API pricing ($3.48/M output tokens) make it a credible default for production coding and reasoning workloads where cost and openness matter.