SIMA 2: 仮想3D世界で推論・学習する Gemini 搭載エージェント SIMA 2: An Agent that Plays, Reasons, and Learns With You in Virtual 3D Worlds
- Google DeepMind は、Gemini を基盤とする汎用ゲームAIエージェント「SIMA 2」を発表した。
- 前世代と比べ複雑な指示への追従や推論能力が向上し、未学習のゲーム環境でも自己学習を通じてタスクを遂行できるようになった。
English summary
- Introducing SIMA 2, a Gemini-powered AI agent that can think, understand, and take actions in interactive environments.
Google DeepMind は、3D仮想環境で人間と協調しながらプレイ・推論・学習する汎用エージェント「SIMA 2」を公開した。初代 SIMA の後継にあたり、基盤モデルに Gemini を組み込むことで、単なる指示追従からより高度な目標理解と自律行動へと踏み込んだ世代となる。
SIMA 2 は商用ゲームや研究用環境を含む多様な3D世界で訓練され、画面映像と自然言語指示のみを入力に、人間と同じくマウス・キーボード操作で行動する。初代との大きな違いは、Gemini による推論層が組み込まれた点で、抽象的・多段階の指示を分解して実行できる。DeepMind によれば、複雑なタスクの達成率は前モデルからおよそ人間レベルに近づき、画像や絵文字を介した指示、さらには未知の言語にも対応できるという。
さらに注目されるのは自己改善ループである。Gemini が新環境でタスクと報酬信号を生成し、SIMA 2 がそれを基に自律的に経験を積むことで、人間のデモなしでも未訓練ゲームへの適応が進むとされる。Genie などで生成された手続き的世界でも動作する点は、DeepMind が掲げる「世界モデル」研究との接続を示すものと見られる。
前世代と比べ複雑な指示への追従や推論能力が向上し、未学習のゲーム環境でも自己学習を通じてタスクを遂行できるようになった。
背景として、汎用エージェント研究は OpenAI の VPT(Minecraft)、NVIDIA の Voyager、Anthropic の Computer Use、Google の Project Mariner など、ゲームやGUIを舞台とした実世界応用に向けた競争が激化している。SIMA 2 はゲーム特化に見えて、実際にはロボティクスや具現化AI(embodied AI)への布石である可能性が高い。物理ロボットへの転移には知覚や制御の差分という課題が残るが、視覚と言語を結ぶ汎用方策の研究としては重要なマイルストーンと言える。現段階では研究プレビューとして限定公開され、安全性評価と並行して段階的に拡張される見通しである。
Google DeepMind has introduced SIMA 2, the next iteration of its generalist agent for 3D virtual worlds. Where the original SIMA focused on following short natural-language instructions across commercial video games, SIMA 2 integrates Gemini as a reasoning backbone, turning the system from an instruction-follower into something closer to a collaborative game-playing partner that can plan, explain its actions, and learn on its own.
Like its predecessor, SIMA 2 perceives only the rendered pixels of a game and acts through the same keyboard-and-mouse interface a human would use. That constraint is deliberate: it forces the agent to generalize across engines and genres rather than relying on privileged APIs. The new version was trained across a broad portfolio of titles and research environments, and DeepMind reports significant gains on long-horizon, multi-step tasks, with performance on complex goals approaching human-level in evaluated games.
The Gemini integration unlocks several qualitative jumps. SIMA 2 can interpret abstract instructions, respond to sketches or emoji prompts, and follow directions in languages it was not explicitly trained on. It can also narrate its reasoning, which is useful both for user trust and for debugging agent behavior. Perhaps more importantly, DeepMind describes a self-improvement loop in which Gemini proposes tasks and reward signals in new environments, letting SIMA 2 bootstrap competence in unfamiliar games without fresh human demonstrations. The team also shows the agent operating inside procedurally generated worlds from Genie, hinting at a tighter coupling between DeepMind's world-model research and its embodied-agent line.
The broader context matters. Generalist agents that act through perception and motor-like outputs have become a crowded research frontier: OpenAI's VPT learned Minecraft from video, NVIDIA's Voyager used LLM-driven curricula in the same game, Anthropic's Computer Use and Google's own Project Mariner extend the idea to desktop and browser interfaces. SIMA 2 sits somewhere between these efforts, leaning on rich 3D simulation as a training ground for skills DeepMind has openly described as relevant to robotics and embodied AI. Games are cheap, diverse, and safe; a policy that can take a vague human goal in an unfamiliar 3D world and decompose it into grounded actions is exactly the kind of capability that real-world robots will eventually need.
There are caveats. Pixel-to-action play in games still differs from physical manipulation, where sensing noise, contact dynamics, and safety constraints dominate. Self-generated rewards via an LLM can drift or reward-hack, and DeepMind's own reporting acknowledges that long-horizon reliability remains imperfect. The release is positioned as a limited research preview rather than a product, with access gated while the team studies safety and alignment behaviors.
Still, SIMA 2 is a notable marker. It suggests that combining a strong multimodal reasoner like Gemini with an action-capable policy and a steady supply of simulated worlds may be a viable recipe for agents that keep getting better with experience. Whether that recipe transfers cleanly out of the game window is the open question the next year of research is likely to answer.
本ページの本文・要約は AI による自動生成です。正確性は元記事 (deepmind.google) をご確認ください。