DeepSeek-V4登場、エージェントが実用可能な100万トークン文脈を実現 DeepSeek-V4: a million-token context that agents can actually use

Hugging Face Blog · huggingface.co · 2026/04/24 09:00 · 2mo ago · 📖 2 min

AI 3 行サマリ

DeepSeek-V4は最大100万トークンの長文コンテキストを扱えるオープンモデルとして公開された。
単なる長さの拡張に留まらず、エージェント用途で実際に有効活用できる検索・推論性能を狙った設計が特徴とされる。

中国発のオープンウェイト大規模言語モデルを手がけるDeepSeekが、最新世代となるDeepSeek-V4を発表した。最大の特徴は100万トークン級のコンテキストウィンドウを備えつつ、エージェント用途で実際に役立つ長文処理性能を志向している点にある。

これまでも長大なコンテキストを謳うモデルは多数登場してきたが、実際には文脈の中盤で情報を取りこぼす「lost in the middle」現象や、トークン数が増えるほど推論コストとレイテンシが急増する問題が指摘されてきた。DeepSeek-V4ではアテンション機構やキャッシュ管理を見直し、長文中での検索・参照精度を高めることでツール呼び出しを伴うエージェント処理に耐えうる挙動を目指したと説明されている。

背景として、長文コンテキストはコード補完、リポジトリ全体の解析、複数ドキュメントを横断する調査エージェント、長期記憶を要する対話エージェントなどで需要が高まっている領域である。GoogleのGemini 1.5 Proが100万〜200万トークンを実用化し、AnthropicのClaudeやOpenAIのGPT系も段階的に拡張してきたが、オープンウェイトでこの水準を提供する選択肢はまだ限られていた。Qwenシリーズの長文版や、Mambaなど状態空間モデルを取り入れたハイブリッド手法も研究が進んでおり、DeepSeek-V4はそうした潮流の中でMoE路線を継承しつつ長文性能を伸ばすアプローチと見られる。

単なる長さの拡張に留まらず、エージェント用途で実際に有効活用できる検索・推論性能を狙った設計が特徴とされる。

🏠 Local LLM / Open Models · 本記事のポイント

また、Hugging Face上での公開は、ローカル推論やファインチューニングを行いたい開発者にとって重要な意味を持つ。vLLMやSGLangといった推論基盤がMoEと長文KVキャッシュ最適化に対応しつつあるため、自前環境で100万トークン級モデルを運用するハードルは下がりつつある。ただし実運用ではメモリ消費や速度面の制約が大きく、量子化やコンテキスト圧縮との併用が現実的な選択肢となる可能性がある。

DeepSeek, the Chinese lab behind a fast-growing family of open-weight large language models, has unveiled DeepSeek-V4, its newest flagship release. The headline feature is a context window pushing into the million-token range, paired with a stated focus on making that long context actually useful for agentic workloads rather than serving as a benchmark trophy.

Long-context claims have become almost routine in the LLM space, but practitioners have repeatedly run into the same failure modes. Models tend to drop information buried in the middle of a prompt — the well-documented "lost in the middle" problem — and inference cost and latency balloon as token counts climb. According to DeepSeek's accompanying notes, V4 reworks the attention mechanism and key-value cache handling to improve retrieval and reference accuracy deep inside long inputs, with the explicit goal of supporting multi-step agent loops that involve tool calls, intermediate reasoning traces, and accumulated state.

The practical motivation is clear. Demand for genuinely usable long context has grown across several workloads: whole-repository code understanding and refactoring, research agents that traverse many documents, customer-facing assistants that need durable memory across sessions, and pipelines that ingest large logs or transcripts in a single pass. In each of these, the bottleneck has rarely been raw window size — it has been whether the model can reliably pick out the right span when the window is mostly full.

On the architecture side, DeepSeek-V4 appears to continue the Mixture-of-Experts direction the lab has pursued since V2 and V3, with sparse expert activation keeping per-token compute manageable even as parameter counts grow. Combined with the long-context optimizations, this positions V4 as an attempt to push the MoE recipe further into territory previously dominated by closed frontier models. Specific numbers around active parameters, expert routing, and benchmark scores will need independent verification as the community runs its own evaluations.

The competitive backdrop matters here. Google's Gemini 1.5 Pro put one-to-two-million-token contexts into production use, and both Anthropic's Claude family and OpenAI's GPT line have steadily extended their windows. Open-weight options at this scale, however, have been comparatively scarce. Alibaba's Qwen series has shipped long-context variants, and hybrid approaches incorporating state-space models such as Mamba continue to attract research interest as a way to escape the quadratic cost of standard attention. DeepSeek-V4 slots into that landscape as an MoE-first answer that prioritizes long-range fidelity over architectural novelty.

The choice to publish on Hugging Face carries real weight for developers who want to run or fine-tune the model themselves. Inference stacks such as vLLM and SGLang have been actively adding support for MoE routing and long-context KV cache optimizations, including paged attention, prefix caching, and disaggregated prefill. That trajectory has steadily lowered the barrier to operating million-token-class models outside hyperscaler APIs, though the engineering remains nontrivial.

In practice, memory footprint and throughput will likely be the limiting factors for most teams. Holding a million-token KV cache in GPU memory is expensive even with grouped-query or multi-head latent attention schemes, and prefill latency on prompts of that length can be measured in tens of seconds without careful batching. Quantization down to 8- or 4-bit weights, along with context-compression techniques such as summarization, retrieval pre-filtering, or learned token pruning, may turn out to be necessary companions rather than optional tweaks for cost-sensitive deployments.

There are also open questions the release itself does not fully resolve. How V4 behaves on adversarial long-context evaluations such as needle-in-a-haystack variants with distractors, RULER, or agent-oriented benchmarks like SWE-bench and τ-bench will determine whether the agent-readiness pitch holds up. Licensing terms, training data disclosures, and tool-use fine-tunes will similarly shape adoption in regulated environments. For now, DeepSeek-V4 reads as a credible attempt to close the gap between nominal context length and the long-horizon reasoning that real agent systems demand — and, importantly, to do so with weights that anyone can download.