Cursor 2.2、デバッグモードやマルチエージェント審査機能を追加 Debug Mode, Plan Mode Improvements, Multi-Agent Judging, and Pinned Chats

Cursor Changelog · cursor.com · 2025/12/10 09:00 · 6mo ago · 📖 2 min

AI 3 行サマリ

Cursorの最新アップデート2.2では、AIが自律的にバグを特定するデバッグモード、強化されたプランモード、複数エージェントの出力を比較評価するマルチエージェント審査、ピン留めチャットなどが追加された。
エージェント駆動開発の実用性向上を狙う改善群となっている。

English summary

Debug Mode helps you reproduce and fix the most tricky bugs.

AI コーディング環境の Cursor がバージョン 2.2 をリリースし、エージェント駆動の開発体験を一段と深掘りする複数の機能を投入した。中心となるのは、AI が能動的にコードベースを調査して不具合の原因を突き止める Debug Mode と、複数エージェントの結果を比較する Multi-Agent Judging だ。

Debug Mode は、ユーザーが症状を伝えるとエージェントが仮説を立て、ログ挿入やコード探索を繰り返してバグを特定するワークフローに特化しているとされる。従来のチャット形式に比べ、原因追跡という反復作業を AI 側に委ねやすくする狙いがあると見られる。

Plan Mode の改善では、コードの変更前に方針を立案し、ユーザーが承認したうえで実装に進むフローがより洗練された。大規模リファクタリングや複雑なタスクで、エージェントが暴走しないようにガードレールとして機能する位置付けだ。

Cursorの最新アップデート2.2では、AIが自律的にバグを特定するデバッグモード、強化されたプランモード、複数エージェントの出力を比較評価するマルチエージェント審査、ピン留めチャットなどが追加された。

🖱️ AI Editors · 本記事のポイント

Multi-Agent Judging は、同じタスクを複数のモデルやエージェントに実行させ、別の AI に成果を比較・採点させる仕組み。Anthropic や OpenAI などが研究してきた LLM-as-a-judge の手法を実プロダクトに組み込んだ事例として注目される。Pinned Chats は重要な対話を固定表示し、長時間にわたるプロジェクトでも文脈を失いにくくする UX 改善である。

背景として、GitHub Copilot や Windsurf、Cline、Claude Code などエージェント型コーディングツールの競争が激化しており、単発のコード補完から自律的なタスク完遂へと比重が移りつつある。Cursor は VS Code フォークという基盤を活かしつつ、エージェントオーケストレーション層で差別化を図ろうとしている可能性がある。

Cursor has shipped version 2.2 of its AI-native code editor, bundling several upgrades aimed at making agent-driven development more practical for everyday engineering work. The headline additions are Debug Mode, refinements to Plan Mode, Multi-Agent Judging, and Pinned Chats.

Debug Mode reframes troubleshooting as an autonomous loop: the user describes a symptom, and the agent forms hypotheses, inserts logging, traces through the codebase, and iterates until it isolates the root cause. This shifts the tedious detective work of debugging — often the most time-consuming part of software maintenance — onto the model, rather than expecting developers to drive each step through chat prompts.

Plan Mode, introduced in earlier releases, has been tightened so that the agent more reliably drafts an explicit plan before touching code, waits for user approval, and then executes against it. For larger refactors or multi-file changes, this acts as a guardrail against the common failure mode where agents charge ahead and produce sprawling, hard-to-review diffs.

Multi-Agent Judging is arguably the most architecturally interesting addition. It lets users run the same task across multiple agents or models in parallel, then uses another model to compare and score the outputs. The pattern echoes the LLM-as-a-judge methodology widely studied by Anthropic, OpenAI, and academic groups, and it is increasingly being productised in tools like LangSmith and various eval frameworks. Surfacing it inside the editor itself could make divergent-then-converge workflows more accessible to mainstream developers.

Pinned Chats is a smaller but welcome UX touch: important conversations can be kept visible across sessions, helping preserve context on long-running features or investigations where chat history would otherwise scroll away.

The broader context is a rapidly intensifying race among agentic coding tools. GitHub Copilot has been pushing its own agent mode, while Windsurf, Cline, Aider, and Anthropic's Claude Code each stake out different points on the autonomy-versus-control spectrum. Cursor, built on a VS Code fork, appears to be betting that its differentiation will come less from the underlying models — which it largely shares with competitors via API — and more from the orchestration layer: how plans are formed, how multiple agents collaborate or compete, and how their work is reviewed. Whether features like Multi-Agent Judging meaningfully improve output quality in production codebases, versus simply multiplying token spend, will likely depend on how well the judging model can detect subtle correctness issues, an area where current evaluations remain mixed.