Gemini

Gemma Scope 2公開、AI安全研究で言語モデル挙動の解明を促進 Gemma Scope 2: helping the AI safety community deepen understanding of complex language model behavior

Google DeepMind Blog · deepmind.google · 2025/12/16 19:14 · 5mo ago · 📖 2 min

元記事を読む鮮度 OK

AI 3 行サマリ

Google DeepMindはGemma Scope 2を公開し、Gemma系言語モデルの内部動作を解析するためのスパースオートエンコーダ群を提供。
AI安全コミュニティが複雑なモデル挙動の解釈可能性研究を深化させる土台となる。

English summary

Open interpretability tools for language models are now available across the entire Gemma 3 family with the release of Gemma Scope 2.

Google DeepMindは、Gemma系言語モデルの内部表現を解析するためのスパースオートエンコーダ(SAE)群「Gemma Scope 2」を公開した。AI安全コミュニティに対し、より複雑な言語モデルの挙動を解釈するための基盤を提供することを目的としている。

Gemma Scopeは、モデルの各層から得られる活性化を疎な特徴ベクトルに分解する事前学習済みSAEのコレクションである。初代Gemma Scopeは2024年に公開され、Gemma 2モデル群に対して数百のSAEを提供したことで、機械論的解釈可能性(mechanistic interpretability)研究のハードルを大きく下げたとされる。今回の第2弾では、対象モデルやレイヤーの拡張、より大規模で高品質なSAEの提供が行われていると見られ、推論能力やマルチターン対話など、より高度な挙動の分析を可能にすることが狙いとされる。

背景として、SAEはAnthropicの「Towards Monosemanticity」やOpenAIによるGPT-4へのSAE適用研究など、近年の解釈可能性研究で中心的な手法となっている。モデル内部の重畳された概念(superposition)を人間が解釈しやすい単義的な特徴に分離することで、特定の振る舞い(欺瞞、バイアス、危険な知識など)に関与する内部回路を特定しやすくする。

Google DeepMindはGemma Scope 2を公開し、Gemma系言語モデルの内部動作を解析するためのスパースオートエンコーダ群を提供。

✨ Gemini · 本記事のポイント

DeepMindはこれまでもJumpReLU SAEなどの技術を発表しており、Gemma Scopeはその実装を広く研究者に開放するインフラ的役割を担う。Anthropic、OpenAI、EleutherAIなど他社・他団体も類似のオープンSAE資産を整備しつつあり、解釈可能性は安全性研究の主要な競争領域となりつつある。Gemma Scope 2の公開は、安全性ベンチマークやレッドチーミング、整合性研究の加速に寄与する可能性がある。

Google DeepMind has released Gemma Scope 2, an expanded suite of sparse autoencoders (SAEs) trained on its open-weight Gemma language models. The release is positioned as infrastructure for the AI safety community, aiming to help researchers probe and understand increasingly complex model behaviors.

Sparse autoencoders decompose a model's internal activations into a large dictionary of sparse, ideally monosemantic features. By learning these features at various layers, researchers can isolate the concepts and circuits a model uses when generating output, ranging from simple lexical patterns to higher-level notions such as deception, sycophancy, or specific factual knowledge. The original Gemma Scope, released in 2024 alongside Gemma 2, made hundreds of pretrained SAEs publicly available and significantly lowered the barrier to entry for mechanistic interpretability work.

Gemma Scope 2 appears to extend that effort with broader model and layer coverage and likely improved SAE quality, supporting analysis of more sophisticated behaviors such as multi-step reasoning and instruction following. While the announcement frames the release primarily as a contribution to safety research, the artifacts are also relevant for general interpretability, model debugging, and steering experiments where researchers manipulate specific features to influence outputs.

The broader context matters. Interpretability has become a focal point of frontier AI safety, with Anthropic's work on monosemanticity and dictionary learning in Claude-class models, OpenAI's recent paper on scaling SAEs to GPT-4, and community efforts from groups such as EleutherAI and Apollo Research. DeepMind itself has contributed methodological advances like JumpReLU SAEs, which improve the trade-off between sparsity and reconstruction fidelity. Gemma Scope can be seen as the productization of those research threads into a reusable public asset.

For practitioners, the practical value is that training high-quality SAEs is computationally expensive, often requiring substantial GPU time per layer. By releasing pretrained SAEs over an open-weights model family, DeepMind makes it feasible for academic labs and independent researchers to study questions such as how specific features activate during jailbreak attempts, how reasoning chains are represented internally, or whether safety-relevant concepts can be reliably detected and suppressed.

There are caveats. SAEs remain an imperfect tool: features are not always cleanly monosemantic, and recent work has questioned how faithfully they capture a model's true computation rather than merely producing interpretable-looking artifacts. Findings from Gemma-based studies may also not transfer directly to closed frontier models with different training data and scale. Still, the open availability of Gemma Scope 2 is likely to accelerate empirical interpretability research and provide a shared substrate for comparing methods.

If the trajectory continues, public SAE suites tied to open-weight model releases may become a standard companion artifact, much as evaluation harnesses and tokenizers are today, embedding interpretability more deeply into the open-source AI stack.