DeepSeek V4 is a two-model open-source family released by DeepSeek on April 24, 2026, under the MIT License. It continues the lineage established by DeepSeek-V3 and V3.2, pushing the efficiency and capability of open-weight models to a new threshold — placing a 1.6-trillion-parameter model within 0.2 percentage points of Claude Opus 4.6 on SWE-bench Verified while charging $3.48 per million output tokens versus Claude’s $25.
Model Variants
| V4-Pro | V4-Flash | |
|---|---|---|
| Total parameters | 1.6 trillion | 284 billion |
| Active per token | 49 billion | 13 billion |
| Training tokens | 33 trillion | 32 trillion |
| Context window | 1 million tokens | 1 million tokens |
Both variants use a Mixture-of-Experts (MoE) architecture where only a small subset of parameters activates per token, keeping inference compute practical despite the enormous total parameter count.
Key Architectural Innovations
CSA / HCA Hybrid Attention V4 replaces standard multi-head attention with a hybrid of two complementary mechanisms:
- Compressed Sparse Attention (CSA) — handles long-range dependencies efficiently by attending sparsely across the full 1M-token context.
- Heavily Compressed Attention (HCA) — processes local context with aggressive compression for speed.
The result: at 1M-token context, V4-Pro requires only 27% of the single-token inference FLOPs and 10% of the KV cache compared to DeepSeek-V3.2. This makes sustained long-context inference economically viable at scale.
mHC (Manifold-Constrained Hyper-Connections) Standard residual connections in very deep models suffer from signal amplification — gradients and activations compound across layers, causing instability at scale. DeepSeek’s mHC framework constrains the residual connection mixing matrices to the Birkhoff Polytope using the Sinkhorn-Knopp algorithm, capping signal amplification at 1.6× regardless of depth. This is what allowed stable pre-training at 1.6 trillion parameters on 33 trillion tokens without gradient collapse.
Muon Optimizer DeepSeek replaced the standard AdamW optimizer with Muon for pre-training — a second-order optimizer that uses orthogonal updates. Combined with mHC’s stability guarantees, Muon enabled faster convergence and maintained training stability across the full 33-trillion-token run.
On-Policy Distillation (Post-Training) Rather than reinforcement learning from human feedback (RLHF), V4’s post-training pipeline uses On-Policy Distillation — the model learns from outputs generated by a stronger teacher model under the same distribution the student will be deployed in. This eliminates reward model errors and produces more consistent reasoning behaviour than RL-based fine-tuning.
Configurable Reasoning Depth Developers can select the level of reasoning effort per request — trading latency for analytical depth. This is exposed as a parameter at inference time, making V4 practical for both low-latency applications (low reasoning) and complex tasks that benefit from extended chain-of-thought (high reasoning).
Performance
- LiveCodeBench: 93.5% — ahead of Gemini (91.7%) and Claude (88.8%)
- SWE-bench Verified: 80.6% — within 0.2 points of Claude Opus 4.6
- Long-context efficiency: 27% of V3.2’s FLOPs at 1M tokens
Significance
DeepSeek V4 marks a new high-water mark for open-source frontier models. Each V3-generation release has closed the gap with proprietary models by a meaningful margin; V4 effectively eliminated it on coding benchmarks. The MIT license and competitive API pricing ($3.48/M output tokens) make it a credible default for production coding and reasoning workloads where cost and openness matter.
Ready to build?
Leverage AI technologies to build your product stack
Superteams can help you build, deploy and launch AI application stacks using open source technologies — from architecture through to production.
Talk to Superteams