DeepSeek-V4 series introduces two MoE models with million-token context support and efficiency gains
DeepSeek-V4-Pro (1.6T parameters, 49B activated) and DeepSeek-V4-Flash (284B parameters, 13B activated) demonstrate reduced inference costs and state-of-the-art performance in long-context tasks.
1 source · cross-referenced
- Two new Mixture-of-Experts models, DeepSeek-V4-Pro and DeepSeek-V4-Flash, support one-million-token contexts with improved efficiency.
- DeepSeek-V4-Pro-Max achieves state-of-the-art performance among open models in core tasks, per the preprint.
- The models incorporate a hybrid attention architecture, manifold-constrained hyper-connections, and the Muon optimizer.
- DeepSeek-V4-Pro reduces inference FLOPs by 73% and KV cache by 90% compared with DeepSeek-V3.2 in million-token settings.
The DeepSeek-V4 series introduces two Mixture-of-Experts (MoE) language models: DeepSeek-V4-Pro with 1.6 trillion parameters (49 billion activated) and DeepSeek-V4-Flash with 284 billion parameters (13 billion activated). Both models support a context length of one million tokens, according to the arXiv preprint.
Architectural upgrades include a hybrid attention mechanism combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to improve long-context efficiency, Manifold-Constrained Hyper-Connections (mHC) to enhance residual connections, and the Muon optimizer for faster convergence and training stability.
The models were pre-trained on more than 32 trillion diverse and high-quality tokens and underwent a comprehensive post-training pipeline to unlock and enhance their capabilities. DeepSeek-V4-Pro-Max, the maximum reasoning effort mode of DeepSeek-V4-Pro, is reported to redefine the state-of-the-art for open models in core tasks.
In million-token context settings, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2, enabling routine support for million-token contexts and making long-horizon tasks and test-time scaling more feasible.
The model checkpoints are available via Hugging Face under the DeepSeek-V4 collection.
- Jun 20, 2026 · MIT Technology Review — AI
Startup claims sparse-attention LLM rivals top dense models on coding benchmarks
Trust71 - Jun 19, 2026 · Simon Willison — everything
Z.ai releases GLM-5.2, a 753B-parameter open-weights text-only LLM with 1M token context
Trust79 - Jun 18, 2026 · OpenAI — News
OpenAI adds spend controls and usage analytics for ChatGPT Enterprise
Trust74