DeepSeek V4 Preview: Million-Token Context as an Efficiency Problem

DeepSeek V4 Preview is not a quiet model-card update. On April 24, 2026, DeepSeek announced two open-weight preview models, DeepSeek-V4-Pro and DeepSeek-V4-Flash, with a shared headline: one-million-token context across official DeepSeek services.

That number is easy to misread. The important claim is not merely that the context window is large. Large context windows are not useful if every request turns into a compute and memory event that only a lab can afford. The core claim in the technical report is that V4 changes the cost curve of long-context inference.

论文速览

DeepSeek V4 Preview: Towards Highly Efficient Million-Token Context Intelligence

DeepSeek-AI

DeepSeek technical report and preview release · 2026

Paper / PDF →

DeepSeek presents V4-Pro and V4-Flash as preview MoE language models with one-million-token context support, hybrid CSA/HCA attention, mHC residual connections, Muon optimization, FP4-aware deployment work, and post-training modes for different reasoning budgets.

The release note gives the product surface: deepseek-v4-pro and deepseek-v4-flash are available through the API, both support 1M context and Thinking / Non-Thinking usage, and the old deepseek-chat and deepseek-reasoner names are scheduled to be fully retired after July 24, 2026, 15:59 UTC. The Hugging Face DeepSeek V4 collection is the open-weight distribution point.

This article treats the release as a preview, not a final verdict. The claims below are grounded in DeepSeek's release note and report unless marked otherwise. The benchmarks are useful, but most are official or internal evaluations, so independent replication still matters.

The Model Family

V4-Pro is the flagship: 1.6T total parameters, 49B active parameters. V4-Flash is the economical model: 284B total parameters, 13B active parameters. Both are MoE models, so the active parameter count is the more relevant per-token compute signal than the total parameter count.

This pairing makes the release more interesting than a single frontier checkpoint. Pro is the model DeepSeek positions for hardest reasoning, agentic coding, knowledge, and long-context work. Flash is meant to preserve much of that behavior at a lower serving cost.

The architecture keeps the DeepSeek lineage: DeepSeekMoE for feed-forward layers and Multi-Token Prediction from the V3 family. The new pieces are the long-context attention design, the mHC residual-stream change, and the training and serving stack around them.

The Real Story: KV Cache Compression

Vanilla attention becomes punishing at extreme context because every new token must attend over an enormous prefix, and the service must keep a large key-value cache around. A 1M-token window therefore has two separate bottlenecks:

Compute: the cost of reading and scoring prior context for each generated token.
Memory: the accumulated KV cache that must stay available during generation.
Serving reuse: the cost of repeated prefilling when many requests share long prefixes.

DeepSeek's answer is hybrid attention. The report describes Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) as interleaved attention mechanisms. CSA compresses KV entries and selects sparse compressed blocks for attention. HCA compresses more aggressively. Both retain a sliding-window branch for local detail.

That design explains why the release note says 1M context is now the default across official DeepSeek services. The claim is not "we can technically fit a million tokens once." It is "the architecture and serving system make a million tokens routine enough to expose as a standard product capability."

CSA, HCA, and the Attention Trade

CSA and HCA are not just names for sparsity. They encode a specific trade: represent long history in compressed form, recover enough targeted access through sparse selection, and keep recent tokens uncompressed through a sliding window.

The practical intuition:

CSA is the higher-fidelity compressed path. It compresses groups of KV entries, then uses an indexer to select relevant compressed entries.
HCA is the stronger compression path. It reduces memory and compute further where the model can tolerate coarser historical access.
Sliding-window attention preserves local detail, which matters because compression is most dangerous near the current generation point.

This is the right kind of compromise for agentic and research workflows. A coding agent with a huge transcript does not need full-resolution attention over every token from the first minute of the session. It often needs precise recent context plus recoverable access to old decisions, file summaries, logs, and plans.

The report also discusses on-disk KV cache storage. For shared-prefix requests, compressed CSA/HCA KV entries can be stored and reused, while the larger sliding-window KV entries require separate strategies. That matters in production because long-context systems are usually bottlenecked by repeated prefill, not only by the final generation pass.

mHC and Muon: Not Just Attention

The report frames V4 as a combination of architecture, optimization, and infrastructure. Two upgrades stand out beyond attention.

First, mHC, or Manifold-Constrained Hyper-Connections, upgrades conventional residual connections. Deep residual stacks are not only a capacity problem; they are a signal-routing problem. mHC is DeepSeek's attempt to make the residual stream more stable and expressive as the model gets deeper.

Second, DeepSeek uses the Muon optimizer for most parameters, while retaining AdamW for selected modules such as embeddings, output heads, mHC modules, and RMSNorm weights. The stated goal is faster convergence and stability at V4 scale.

These details matter because they make V4 less like "V3.2 with a larger context window" and more like a systems-level redesign around long reasoning trajectories.

Training Scale and Post-Training Modes

DeepSeek reports more than 32T pre-training tokens for the series: Flash on 32T tokens and Pro on 33T tokens. The training sequence length is gradually extended through 16K, 64K, and 1M.

After pre-training, both models are post-trained into multiple reasoning modes.

The mode design is practical. Non-Think is for fast direct answers. High is the normal deliberate mode. Max is the capability-seeking mode, where DeepSeek changes prompts and training incentives to let the model spend more reasoning budget.

That also creates a measurement trap: comparing "V4" without specifying the mode is not precise. A production application using Flash Non-Think and a benchmark using Pro Max are effectively using different points on the same model family curve.

Benchmarks: Useful, but Not the Final Word

DeepSeek's report presents V4-Pro-Max as the strongest open model in several categories and competitive with leading closed models in selected reasoning, coding, long-context, and agentic evaluations. The most important long-context numbers are MRCR 1M and CorpusQA 1M, where Pro Max reports 83.5 and 62.0 respectively, ahead of Flash Max.

The grouped chart above uses DeepSeek report Table 6, not the YouTube screenshot, because those values are traceable to the technical report. It compares DeepSeek-V4-Pro-Max with Claude Opus 4.6 Max, GPT-5.4 xHigh, and Gemini-3.1-Pro High on benchmarks where the report lists all four values.

For agentic work, the gap between Pro and Flash is clearer. On Terminal Bench 2.0, Pro Max is reported at 67.9 while Flash Max is 56.9. On SWE Verified, the gap is smaller: 80.6 versus 79.0.

My read: Flash looks unusually strong for its active parameter budget, but Pro remains the safer default for tasks where failures are expensive, multi-step, or hard to detect.

What Changes for Developers

The API migration surface is intentionally small. DeepSeek says to keep the same base_url and update the model name:

const model = "deepseek-v4-pro"; // or "deepseek-v4-flash"

The compatibility claim is also broad: the release note says the API supports OpenAI Chat Completions and Anthropic APIs. That makes V4 easy to test in existing toolchains, especially coding agents and retrieval-heavy systems.

The retirement note is more urgent. deepseek-chat and deepseek-reasoner are currently routed to V4-Flash non-thinking/thinking, but DeepSeek says they will become inaccessible after July 24, 2026, 15:59 UTC. If those names are hard-coded in an application, they should be migrated before that date.

Limitations

The release is still a preview. That should affect how teams evaluate it.

The strongest claims are official DeepSeek claims, not independent third-party replications.
Many benchmark results are from DeepSeek's internal evaluation framework.
The current public story is text-first; this is not a multimodal release.
One-million-token context does not remove retrieval, summarization, or memory design. It changes the trade space.
Cost and latency should be measured on real prompts, because reasoning mode, output length, and cache reuse can dominate headline model pricing.

There is also an evaluation question around long context itself. Passing a 1M-token retrieval benchmark is not the same as reliably reasoning over a million tokens of messy documents, logs, code, and tool traces. V4 makes that workload more plausible. It does not make it solved.

Practical Assessment

DeepSeek V4 Preview is best understood as an efficiency release disguised as a frontier-model release. The parameter counts are large, but the architecture story is about making long contexts economically usable.

Use V4-Pro when the task is hard to verify, agentic, reasoning-heavy, or genuinely long-context: large codebase sessions, multi-document analysis, complex planning, and high-value automation.

Use V4-Flash when cost and latency dominate and the task can tolerate a smaller model: production assistants, long-context classification, structured extraction, routine coding help, and workflows where you can add validators around the model.

The preview status matters, but the direction is clear. If 2024 and 2025 were about making context windows larger, V4 is a bet that 2026 is about making those windows cheap enough to use every day.

Sources: DeepSeek release note, DeepSeek V4 technical report, DeepSeek V4 Hugging Face collection, and the supporting YouTube discussion.

DeepSeek V4 Preview: Million-Token Context as an Efficiency Problem

论文速览

DeepSeek V4 Preview: Towards Highly Efficient Million-Token Context Intelligence

DeepSeek-AI

DeepSeek technical report and preview release · 2026

Paper / PDF →

The Model Family

The Real Story: KV Cache Compression

Compute: the cost of reading and scoring prior context for each generated token.
Memory: the accumulated KV cache that must stay available during generation.
Serving reuse: the cost of repeated prefilling when many requests share long prefixes.

CSA, HCA, and the Attention Trade

The practical intuition:

CSA is the higher-fidelity compressed path. It compresses groups of KV entries, then uses an indexer to select relevant compressed entries.
HCA is the stronger compression path. It reduces memory and compute further where the model can tolerate coarser historical access.
Sliding-window attention preserves local detail, which matters because compression is most dangerous near the current generation point.

mHC and Muon: Not Just Attention

The report frames V4 as a combination of architecture, optimization, and infrastructure. Two upgrades stand out beyond attention.

These details matter because they make V4 less like "V3.2 with a larger context window" and more like a systems-level redesign around long reasoning trajectories.

Training Scale and Post-Training Modes

DeepSeek reports more than 32T pre-training tokens for the series: Flash on 32T tokens and Pro on 33T tokens. The training sequence length is gradually extended through 16K, 64K, and 1M.

After pre-training, both models are post-trained into multiple reasoning modes.

Benchmarks: Useful, but Not the Final Word

For agentic work, the gap between Pro and Flash is clearer. On Terminal Bench 2.0, Pro Max is reported at 67.9 while Flash Max is 56.9. On SWE Verified, the gap is smaller: 80.6 versus 79.0.

My read: Flash looks unusually strong for its active parameter budget, but Pro remains the safer default for tasks where failures are expensive, multi-step, or hard to detect.

What Changes for Developers

The API migration surface is intentionally small. DeepSeek says to keep the same base_url and update the model name:

const model = "deepseek-v4-pro"; // or "deepseek-v4-flash"

Limitations

The release is still a preview. That should affect how teams evaluate it.

The strongest claims are official DeepSeek claims, not independent third-party replications.
Many benchmark results are from DeepSeek's internal evaluation framework.
The current public story is text-first; this is not a multimodal release.
One-million-token context does not remove retrieval, summarization, or memory design. It changes the trade space.
Cost and latency should be measured on real prompts, because reasoning mode, output length, and cache reuse can dominate headline model pricing.

Practical Assessment

Use V4-Pro when the task is hard to verify, agentic, reasoning-heavy, or genuinely long-context: large codebase sessions, multi-document analysis, complex planning, and high-value automation.

The preview status matters, but the direction is clear. If 2024 and 2025 were about making context windows larger, V4 is a bet that 2026 is about making those windows cheap enough to use every day.

Sources: DeepSeek release note, DeepSeek V4 technical report, DeepSeek V4 Hugging Face collection, and the supporting YouTube discussion.