Gemini / Gemma ⚠ 古い情報の可能性

ArmとGoogle AI Edgeが進めるオンデバイスAI高速化の最前線 Accelerating on-device AI: A look at Arm and Google AI Edge optimization

Google Developers Blog · developers.googleblog.com · 2026/05/14 09:00 · 1mo ago · 📖 2 min

元記事を読む古い情報の可能性

AI 3 行サマリ

GoogleはArmと連携し、Scalable Matrix Extension 2(SME2)をGoogle AI EdgeスタックやKleidiAIに統合した。
CPUを行列演算アクセラレータとして活用し、LiteRTやMediaPipe経由でモバイル端末のLLM・生成AI推論を高速化する。

English summary

Google and Arm have integrated Arm's Scalable Matrix Extension 2 (SME2) with the Google AI Edge stack and KleidiAI, turning the CPU into a matrix-compute accelerator to speed up on-device LLM and generative AI inference via LiteRT and MediaPipe.

Googleは、半導体IP大手のArmと連携し、モバイルやエッジ機器上でのAI推論を加速させる取り組みを進めている。両社の協業は、Google AI Edgeスタックを中核に据え、スマートフォンやIoT機器上でLLMや生成AIモデルを効率的に動かすことを目指している。

Google AI Edgeは、軽量推論ランタイムのLiteRT（旧TensorFlow Lite）、視覚・音声・テキスト処理向けのMediaPipe、そしてオンデバイスLLM実行を担うLLM Inference APIなどから構成される。これらをArm製のCPU（Cortex系）、GPU（Mali / Immortalis）、そしてNPU相当のアクセラレータ向けに最適化することで、量子化・カーネル融合・SIMD/SME拡張命令の活用といった手法を通じ、推論レイテンシとエネルギー効率の改善が図られている。

特に近年は、Arm v9世代で導入されたSME（Scalable Matrix Extension）やi8mm命令によって、行列演算を多用するTransformer系モデルが恩恵を受けやすくなっている。Googleが提供するGemma系の軽量モデルや、Gemini Nanoのようなオンデバイス向けモデルも、こうしたハードウェア機能を前提に最適化が進んでいると見られる。

GoogleはArmと連携し、Scalable Matrix Extension 2(SME2)をGoogle AI EdgeスタックやKleidiAIに統合した。

✨ Gemini / Gemma · 本記事のポイント

背景として、生成AIをクラウドのみに依存させると遅延・プライバシー・コストの課題が大きくなるため、業界全体でオンデバイス実行への関心が高まっている。AppleのApple Intelligenceや、Qualcomm・MediaTekによるNPU強化も同じ潮流に位置付けられる。ArmはKleidiAIライブラリを通じて主要フレームワーク（PyTorch、ExecuTorch、MediaPipe、LiteRTなど）と統合を進めており、Googleとの連携はその一環といえる。

開発者にとっての実利は、特別なハードウェア知識なしに、LiteRTやMediaPipe経由でArmチップの性能を引き出せる点にある。今後はNPUの抽象化レイヤや、より大規模なLLMのオンデバイス実行が焦点になる可能性がある。

Google has outlined how its ongoing collaboration with Arm is pushing on-device AI forward, focusing on the Google AI Edge stack as the connective tissue between models and the silicon that runs them. The partnership aims to make running LLMs and generative AI workloads on phones and embedded devices both faster and more energy efficient.

Google AI Edge brings together several components: LiteRT (the rebranded TensorFlow Lite runtime) for lightweight inference, MediaPipe for multimodal pipelines spanning vision, audio and text, and the LLM Inference API for running compact language models directly on device. By tuning these pieces for Arm's Cortex CPUs, Mali and Immortalis GPUs, and emerging NPU-class accelerators, the teams target gains through quantization, kernel fusion, and the use of advanced SIMD and matrix instructions.

A key enabler is Arm v9's Scalable Matrix Extension (SME) and i8mm instructions, which are particularly well suited to the matrix multiplications that dominate Transformer-based models. Google's own on-device-friendly models, such as the Gemma family and Gemini Nano, appear to be designed with this kind of hardware acceleration in mind, although exact deployment details vary by device and partner.

The broader context matters. Pure cloud inference for generative AI faces growing pressure around latency, privacy and cost, pushing the industry toward hybrid and on-device execution. Apple's Apple Intelligence initiative and aggressive NPU roadmaps from Qualcomm and MediaTek reflect the same trend. Arm has been responding with KleidiAI, a set of optimized micro-kernels that plug into major frameworks including PyTorch, ExecuTorch, MediaPipe and LiteRT. The Google collaboration can be viewed as part of that wider effort to ensure that whatever model a developer chooses, it can take advantage of the underlying Arm hardware without bespoke engineering.

For developers, the practical takeaway is that performance improvements largely come for free through framework updates. Using LiteRT or MediaPipe, an app can benefit from the latest CPU and GPU optimizations without needing to hand-tune kernels. Looking ahead, the harder problems likely involve a clean abstraction layer for heterogeneous NPUs across vendors and the question of how large an LLM can realistically run on a flagship phone while preserving battery life and thermal headroom. Progress on both fronts will shape how quickly generative features migrate from the cloud to the edge.