IBMとUC Berkeley、IT-BenchとMASTで企業向けAIエージェントの失敗要因を診断 IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

Hugging Face Blog · huggingface.co · 2026/02/19 01:15 · 4mo ago · 📖 2 min

AI 3 行サマリ

IBMリサーチとUC Berkeleyは、企業ITタスクにおけるAIエージェントの失敗を分析する診断フレームワークを発表した。
IT-Benchで実環境タスクを評価し、MASTで失敗モードを分類することで、SRE・CISO・FinOps領域における低い成功率の原因を体系的に特定する。

IBMリサーチとUC Berkeleyの研究チームは、企業向けAIエージェントがなぜ実運用環境で失敗するのかを体系的に診断するための共同研究成果を公開した。生成AIエージェントへの期待が高まる一方で、現実のIT業務における成功率は依然として低く、その原因を構造的に解明する試みである。

中核となるのはIT-Benchと呼ばれるベンチマークで、サイト信頼性エンジニアリング(SRE)、最高情報セキュリティ責任者(CISO)業務、FinOps(クラウドコスト最適化)という三つの実務領域を対象としている。Kubernetesクラスタの障害対応、コンプライアンス評価、コスト分析といったタスクを、実際のインフラを模した環境で再現し、エージェントが取るアクションをエンドツーエンドで検証する仕組みだ。報告されている成功率は領域によって大きく異なるが、いずれも人間の専門家には遠く及ばない水準にとどまるという。

失敗の内訳を理解するために用いられているのが、UC Berkeley側が提案したMAST(Multi-Agent System Failure Taxonomy)である。MASTはマルチエージェント協調における失敗を、仕様の誤解、エージェント間のミスコミュニケーション、検証の欠如など複数のカテゴリに分類する枠組みで、単に「正解/不正解」では捉えられない中間プロセスの破綻を可視化する。IT-BenchのトレースをMASTで分析することで、モデル能力の不足なのか、ツール呼び出しの設計問題なのか、オーケストレーション層の欠陥なのかを切り分けられる点が新しい。

IT-Benchで実環境タスクを評価し、MASTで失敗モードを分類することで、SRE・CISO・FinOps領域における低い成功率の原因を体系的に特定する。

🏠 Local LLM / Open Models · 本記事のポイント

背景として、AnthropicやOpenAIが進めるComputer Use、SWE-benchやτ-benchといった既存のエージェント評価軸が単一タスクの成否に偏りがちだった点が挙げられる。IBMの取り組みは、エンタープライズ特有の長期的・多段階的なワークフローと、複数エージェントが協調する現実のアーキテクチャを評価対象に据えた点で補完的と見られる。Watsonxを含むIBM自身のエージェント製品戦略への布石でもある可能性が高い。

IT-BenchとMASTはオープンに公開されており、研究コミュニティが独自モデルやエージェントフレームワーク(LangGraph、CrewAI、AutoGenなど)の弱点診断に活用できる。エージェント実装が乱立する現在、共通の失敗分類軸が整備されることは、再現性のある改善サイクルを回すうえで重要な一歩と言える。

IBM Research and UC Berkeley have released a joint study aimed at systematically diagnosing why enterprise AI agents tend to fall short in production-like environments. As expectations for agentic AI continue to rise across the industry, real-world success rates on substantive IT tasks remain disappointingly low, and the work tries to explain the gap in structural terms rather than anecdotally.

The centerpiece is IT-Bench, a benchmark suite covering three concrete enterprise domains: Site Reliability Engineering (SRE), Chief Information Security Officer (CISO) workflows, and FinOps cloud cost optimization. Tasks include diagnosing failures in live Kubernetes clusters, evaluating compliance posture, and reasoning about cost anomalies. Crucially, IT-Bench evaluates agents end-to-end against realistic infrastructure rather than toy sandboxes. Reported success rates vary by domain but consistently sit well below what human experts would achieve, underscoring how far current agent stacks are from autonomous IT operations.

To decompose those failures, the researchers leverage MAST (Multi-Agent System Failure Taxonomy), developed at UC Berkeley. MAST categorizes breakdowns in multi-agent collaboration into buckets such as specification misunderstanding, inter-agent miscommunication, missing verification, and premature termination. Rather than collapsing every run into a binary pass/fail, applying MAST to IT-Bench traces makes it possible to distinguish whether an agent failed because the underlying model lacked capability, because tool calls were poorly designed, or because the orchestration layer dropped context between steps.

This combination is notable because most existing agent benchmarks, including SWE-bench, τ-bench, and the various Computer Use evaluations from Anthropic and OpenAI, tend to focus on single-task outcomes or relatively narrow domains. Enterprise IT, by contrast, is characterized by long-horizon, multi-step workflows that often span several specialized agents and external systems. By pairing a domain-grounded benchmark with a structured failure taxonomy, IBM and Berkeley provide a complementary lens that may help frameworks such as LangGraph, CrewAI, or AutoGen identify where their abstractions actually break.

There is also a clear strategic dimension. IBM is heavily invested in agentic offerings around watsonx and its consulting practice, and a credible diagnostic methodology gives the company both a research narrative and a tool to differentiate enterprise-grade deployments from generic chatbot wrappers. It is reasonable to assume, though not explicitly stated, that internal product teams will use IT-Bench and MAST to harden their own agents before customer rollout.

For the broader community, the practical value lies in shared vocabulary. Agent development today suffers from inconsistent reporting: every new framework claims improvements, but failure modes are described in idiosyncratic ways. A common taxonomy like MAST, anchored to a reproducible benchmark like IT-Bench, makes it easier to compare interventions, whether those are better planning prompts, stronger tool schemas, or fundamentally different model backbones. If adopted widely, it could push agent research toward the kind of cumulative, error-driven progress that benchmarks like GLUE or MMLU once catalyzed for language models, although the complexity of multi-agent systems means progress is likely to be slower and noisier.