AGIへの進捗を測る認知フレームワーク、DeepMindが提案 Measuring progress toward AGI: A cognitive framework

Google DeepMind Blog · deepmind.google · 2026/03/18 01:03 · 3mo ago · 📖 2 min

AI 3 行サマリ

Google DeepMindは、汎用人工知能(AGI)への進捗を体系的に評価するための認知科学に基づくフレームワークを提案した。
人間の知能の多様な側面を10領域に分類し、現行モデルの能力ギャップを可視化することで、研究の方向性と安全性議論の基盤を提供する狙いがある。

English summary

We’re introducing a framework to measure progress toward AGI, and launching a Kaggle hackathon to build the relevant evaluations.

Google DeepMindは、汎用人工知能(AGI)への進捗を一貫した方法で測定するための新たな認知フレームワークを提案した。AGIという用語は研究者・企業・政策担当者の間で意味の揺れが大きく、進捗評価が困難になっているという課題への応答である。

提案フレームワークは、人間の知能を構成する能力を認知科学の知見に基づき10領域に分類する。推論、知識、記憶、知覚、言語など、心理測定学(psychometrics)で長く研究されてきた次元を参照しており、CHC理論(Cattell-Horn-Carroll理論)など人間の知能テストの基盤と整合する設計と見られる。各領域について、現行のフロンティアモデルがどこまで到達し、どこに大きなギャップが残るかを評価できるようにする。

DeepMindはこの枠組みを用いて、現行モデルが言語生成や広範な知識想起では人間水準に近い一方、長期記憶や継続学習、堅牢な計画能力などで顕著な弱点を抱えると指摘している。単一のベンチマークスコアやチューリングテスト的な総合評価ではなく、能力プロファイルとしてAGIを捉え直すアプローチである。

人間の知能の多様な側面を10領域に分類し、現行モデルの能力ギャップを可視化することで、研究の方向性と安全性議論の基盤を提供する狙いがある。

✨ Gemini / Gemma · 本記事のポイント

背景として、AGI評価をめぐっては2023年にDeepMindのShane Leggらのチームが「Levels of AGI」論文を公表しており、性能水準と汎用性の二軸でAGIを段階分類する議論を行ってきた。今回のフレームワークはその延長線上にあり、より細粒度の認知能力分解を加えたものと位置づけられる。OpenAIが内部で用いるとされる5段階のAGI定義や、ARC-AGIベンチマークなど、業界全体で評価基準の標準化を模索する動きとも軌を一にしている。

安全性やガバナンスの観点でも、能力領域ごとの到達度を明示することで、リスク評価や規制議論に具体的な根拠を提供しうる。一方で、人間の認知構造を基準とすることがAIの非人間的な能力プロファイル(例: 並列処理や巨大コンテキスト処理)を過小評価する可能性もあり、今後の議論で補正が求められる可能性がある。

Google DeepMind has put forward a cognitive framework intended to bring more rigor and consistency to how progress toward artificial general intelligence (AGI) is measured. The proposal addresses a persistent problem in the field: the term AGI means very different things to different researchers, companies, and policymakers, making it hard to assess where frontier models actually stand.

Rather than relying on a single benchmark score or a Turing-style pass/fail judgment, DeepMind's framework decomposes human intelligence into ten cognitive domains drawn from decades of work in cognitive science and psychometrics. The dimensions cover areas such as reasoning, knowledge, memory, perception, and language, echoing well-established models of human cognition like the Cattell-Horn-Carroll (CHC) theory that underpins many standardized intelligence tests. For each domain, the framework asks how close current systems come to competent human performance and where significant gaps remain.

Applying this lens, DeepMind argues that today's frontier models perform near human level in some areas — notably broad knowledge recall and fluent language generation — while showing pronounced weaknesses in others, including long-term memory, continual learning, and robust planning. The result is less a single AGI score than a capability profile, making it easier to see which research problems are still open.

The paper builds on DeepMind's earlier 2023 work led by Shane Legg, which proposed Levels of AGI defined along axes of performance and generality. The new framework can be read as a more fine-grained extension, layering a structured cognitive taxonomy onto that earlier classification. It also reflects a broader industry trend: OpenAI is reported to use an internal five-tier definition of AGI progress, while community efforts such as François Chollet's ARC-AGI benchmark push for evaluations that resist memorization and reward genuine generalization.

The framing has implications beyond research planning. Clearer, domain-level descriptions of capability could give regulators, safety teams, and standards bodies a more concrete vocabulary for discussing risks. Capability evaluations are already central to frontier-model safety policies at DeepMind, OpenAI, and Anthropic, and a shared cognitive map could help align those efforts. It may also help defuse some of the marketing-driven ambiguity around AGI claims, by forcing specific assertions about which abilities a system does or does not possess.

There are caveats worth noting. Anchoring AGI evaluation to human cognitive structure is intuitive but not neutral: it may underweight the distinctly non-human strengths of large models, such as massive parallelism, very long context windows, or the ability to operate across many languages and modalities simultaneously. It could also struggle to capture emergent agentic behaviors that do not map cleanly onto any single cognitive domain. DeepMind appears aware of these tensions, and the framework is likely to evolve as empirical evaluations accumulate.

For practitioners, the most immediate value may be methodological. Treating AGI as a multidimensional profile rather than a finish line encourages more honest reporting of model strengths and weaknesses, and makes it harder to overclaim based on a few headline benchmarks. Whether the wider community converges on this particular taxonomy or develops competing variants, the direction of travel — toward structured, cognitively grounded evaluation — looks increasingly hard to avoid.