Gemini

DeepMind、AIに4次元で世界を捉えさせる新手法D4RTを発表 D4RT: Teaching AI to see the world in four dimensions

Google DeepMind Blog · deepmind.google · 2026/01/16 19:39 · 4mo ago · 📖 2 min

元記事を読む鮮度 OK

AI 3 行サマリ

Google DeepMindは、動画から空間と時間を統合して理解する新手法D4RT(4D Radiance Tracing)を発表。
複数視点の動画から動的なシーンを高精度に再構成し、自由視点での再生を可能にする。
NeRF系技術を時間軸に拡張する次世代の研究と位置づけられる。

English summary

D4RT: Unified, efficient 4D reconstruction and tracking up to 300x faster than prior methods.

Google DeepMindは、AIに3次元空間だけでなく時間軸を加えた4次元で世界を捉えさせる新しい研究「D4RT(4D Radiance Tracing)」を公開した。静止した3Dシーンの再構成を超え、動く被写体や変化する光環境を含む動的シーンを高精度にモデル化することを狙う。

D4RTは、複数のカメラで撮影された同期動画を入力に、シーン全体を時空間的な放射輝度場として学習する。従来のNeRF(Neural Radiance Fields)が静的なシーンを前提としていたのに対し、D4RTは時間を第4の次元として明示的に扱い、任意の時刻・任意の視点からのレンダリングを可能にする。レイトレーシング的な物理ベースの定式化を組み合わせることで、反射や半透明など光学的に複雑な現象にも対応しやすくなると見られる。

応用先としては、スポーツやコンサートの自由視点リプレイ、映画・VFX制作におけるバーチャルプロダクション、VR/AR向けの没入型コンテンツ生成、さらにはロボティクスや自動運転で必要となる動的環境の理解などが想定される。動く人物や物体を含むシーンを高品質に保存・再生できることは、デジタルツインや遠隔体験の質を一段引き上げる可能性がある。

Google DeepMindは、動画から空間と時間を統合して理解する新手法D4RT(4D Radiance Tracing)を発表。

✨ Gemini · 本記事のポイント

背景として、近年は3D Gaussian Splattingが静的シーン再構成の主流となり、これを動画に拡張する4D Gaussian Splattingや、Luma AI、Polycam、NVIDIAのInstant NGPなど商用・研究双方で動的シーン表現が活発化している。Googleも以前からImmersive Light Fieldsや動画拡散モデルVeoなど映像生成・再構成の研究を進めており、D4RTはそれら一連の流れの中で、生成ではなく観測ベースの忠実な4D再構成を担うピースと位置づけられそうだ。実用化には計算コストや撮影セットアップの簡素化が課題となる可能性がある。

Google DeepMind has introduced D4RT (4D Radiance Tracing), a research effort aimed at teaching AI systems to perceive the world not only in three spatial dimensions but also along the axis of time. The goal is to move beyond static 3D reconstruction toward faithful modeling of dynamic scenes, where objects move, lighting changes, and viewpoints shift freely.

D4RT takes synchronized multi-view video as input and learns a spatio-temporal radiance field that captures the entire scene as a continuous 4D function. While classic NeRF (Neural Radiance Fields) approaches assume the world stands still, D4RT treats time as a first-class fourth dimension, enabling rendering from any viewpoint at any moment in the captured sequence. By coupling this representation with a ray-tracing-style, physics-aware formulation, the method appears better positioned to handle optically tricky phenomena such as reflections, specularities, and translucency than purely volumetric approaches.

The potential applications are broad. Free-viewpoint replays of sporting events and concerts, virtual production workflows in film and VFX, immersive VR and AR content, and even training data generation for robotics and autonomous driving all benefit from being able to faithfully replay a captured moment from arbitrary perspectives. Preserving dynamic scenes with people and objects in motion could meaningfully raise the bar for digital twins and telepresence.

The wider context matters here. Over the past two years, 3D Gaussian Splatting has overtaken NeRF as the dominant approach for static scene reconstruction thanks to its real-time rendering speed, and several groups have proposed 4D Gaussian Splatting variants to capture motion. Companies such as Luma AI and Polycam have commercialized neural capture pipelines, while NVIDIA's Instant NGP popularized fast hash-grid encodings. Google itself has explored Immersive Light Fields for volumetric video and, on the generative side, video models like Veo. D4RT can be read as a complementary piece in that ecosystem, focused on observation-grounded 4D reconstruction rather than synthesis from prompts.

Several practical questions remain open. Multi-camera capture rigs are still expensive and operationally complex, and training high-fidelity radiance fields, particularly with ray-tracing components, can be computationally heavy. It is plausible that follow-up work will target sparser camera setups, monocular video, or distillation into faster representations suitable for on-device playback. If those constraints are addressed, methods in this family could become a standard substrate for how machines record, reason about, and re-experience the physical world.