Gemini Robotics-ER 1.6: 身体化推論を強化し実世界ロボット制御を加速 Gemini Robotics-ER 1.6: Powering real-world robotics tasks through enhanced embodied reasoning

Google DeepMind Blog · deepmind.google · 2026/04/14 00:52 · 2mo ago · 📖 2 min

AI 3 行サマリ

Google DeepMindは身体化推論モデル「Gemini Robotics-ER 1.6」を発表した。
空間理解や複雑なタスク計画能力を強化し、実世界のロボット操作精度を向上。
Gemini APIを通じて開発者が利用できる。

English summary

Gemini Robotics ER 1.6: Enhancing spatial reasoning and multi-view understanding for autonomous robotics.

Google DeepMindは、ロボット向けの身体化推論(Embodied Reasoning)モデル「Gemini Robotics-ER 1.6」を公開した。実世界のロボットタスクにおける空間把握や手順立案の精度を高めた最新版で、開発者はGemini API経由でアクセスできる。

本モデルはGeminiの汎用マルチモーダル能力をロボット領域に拡張したもので、画像から物体位置やアフォーダンス(操作可能性)を推定し、複数ステップにわたる動作計画を生成する役割を担う。前バージョンと比較して、空間推論ベンチマークでの性能向上、長期的タスクの分解能力の改善、安全性に関わる判断の強化が図られたとされる。実機ロボットの制御スタックでは、ER モデルが「脳」として高レベル指示を担い、低レベル制御は別のVLA(Vision-Language-Action)モデルやコントローラに委ねる二層構成が一般的で、本モデルもこの位置付けと見られる。

背景として、Google DeepMindは今年に入りクラウド側で動作する「Gemini Robotics」とオンデバイス版、さらに推論特化の「Gemini Robotics-ER」を順次公開しており、汎用ロボット基盤モデルの開発で攻勢を強めている。競合ではFigure AIのHelix、Physical IntelligenceのpiシリーズやNVIDIA GR00T、OpenAIのロボティクス再参入などが相次いでおり、基盤モデルを物理世界へ展開する競争は2024年以降一段と激化した。

Google DeepMindは身体化推論モデル「Gemini Robotics-ER 1.6」を発表した。

✨ Gemini / Gemma · 本記事のポイント

ER系モデルはApertus・Open X-Embodimentなど公開データセットで研究されてきたVLM+行動学習の流れに位置付けられる存在で、API提供によりスタートアップや研究者が独自ハードウェアと組み合わせて検証できる点が特徴と言える。実運用に向けては安全策・遅延・タスク汎化が依然課題となる可能性があり、今後の評価が注目される。

Google DeepMind has released Gemini Robotics-ER 1.6, the latest version of its embodied reasoning model designed to bring Gemini's multimodal capabilities to physical robots. The update, accessible to developers through the Gemini API, focuses on improving spatial understanding and multi-step task planning for real-world robotic deployments.

The ER (Embodied Reasoning) line extends Gemini's general-purpose vision and language abilities into the robotics domain. Given camera input, the model can localize objects, infer affordances such as where and how an item can be grasped or manipulated, and generate sequences of actions to accomplish a high-level goal. According to Google DeepMind, version 1.6 delivers measurable gains on spatial-reasoning benchmarks, handles longer-horizon tasks more reliably through better decomposition, and exhibits stronger judgment on safety-relevant decisions compared with its predecessor.

In typical robot control stacks built around these models, the ER component effectively functions as the high-level "brain," interpreting instructions, perceiving the scene, and planning steps, while a separate vision-language-action (VLA) policy or low-level controller translates those plans into motor commands. Gemini Robotics-ER 1.6 appears to be positioned squarely in that upper layer, intended to be paired with action models or conventional controllers tuned to specific hardware. This separation of concerns has become a common pattern as developers seek to combine the broad world knowledge of large multimodal models with the real-time precision required for physical actuation.

The release fits into a broader push by Google DeepMind across the robotics stack this year. The company has previously introduced a cloud-based Gemini Robotics model, an on-device variant aimed at lower-latency or offline use cases, and earlier iterations of the reasoning-focused Gemini Robotics-ER family. Taken together, the lineup signals an intent to offer a tiered set of foundation models that span from heavy-duty server inference to constrained embedded deployments.

The competitive landscape around general-purpose robot foundation models has intensified sharply since 2024. Figure AI has been promoting its Helix system, Physical Intelligence has iterated on its pi-series policies, NVIDIA continues to expand the GR00T humanoid platform, and OpenAI has reportedly re-entered robotics after winding down its earlier effort. Several of these projects converge on a similar architectural bet: a large multimodal or VLA model serving as the cognitive core, coupled with learned or classical low-level control. The race to bring such foundation models into physical environments has become one of the more visible frontiers in applied AI.

Research context for ER-style models draws on publicly available datasets and benchmarks such as Open X-Embodiment, which aggregates demonstrations across many robot embodiments, as well as open VLM efforts that have explored how language-and-vision pretraining transfers to embodied tasks. By exposing Gemini Robotics-ER 1.6 through an API, Google DeepMind effectively lowers the barrier for startups and academic groups to plug the model into their own hardware platforms and evaluate it on bespoke tasks, without needing to train a comparable backbone from scratch.

Practical deployment, however, is likely to remain non-trivial. Latency is a recurring concern when a cloud-hosted reasoning model sits in the perception-to-action loop, and safety guarantees are difficult to obtain from probabilistic planners operating in unstructured environments. Generalization across unseen objects, layouts, and embodiments also continues to be an open research problem; benchmark gains do not always translate cleanly to robust behavior on a physical robot. Google DeepMind highlights improvements in these areas, but independent evaluation by external developers will be important to gauge how far the new model moves the needle.

For now, Gemini Robotics-ER 1.6 represents an incremental but notable step in the company's robotics roadmap. Whether it becomes a default high-level reasoning layer for third-party robot builders may depend on how it compares in practice with rival foundation models, on pricing and rate limits via the API, and on how easily it integrates with the diverse VLA policies and control stacks already in use across the industry.