#mechanistic-interpretability

Entries page 1/1 · 3 total

Mon, Jun 1 1 entries

paper research 3w ago ·

arxiv-cs-lg

LLMにおけるアライメントの痕跡を計測・局在化・除去する研究 Measuring, Localizing, and Ablating Alignment Signatures in LLMs

重要度 Medium Medium priority 重要度 Medium · 論文/研究 · Papers / Benchmarks Medium priority · paper/research · Papers / Benchmarks 公開 6月1日 Published Jun 1

AI要約アライン済み言語モデルが示す「AIらしい文体」の内部表現上の起源を調査した論文。ポストトレーニングによる特徴的な表現パターンがモデル内のどの層に宿るかを特定し、それを選択的に除去する手法を提案している。

EN arXiv:2605.30526v1 Announce Type: new Abstract: Aligned language models often exhibit a recognizable AI-like style, yet its connection to post-training and internal representations remains poorly unde

#arxiv #paper #alignment +5

arxiv.org →

fallback

Wed, May 27 1 entries

paper research 4w ago ·

arxiv-cs-cl

LLMが構造化知識でハルシネーションを起こす理由：線形化表現上の推論メカニズム分析 Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations

重要度 Medium Medium priority 重要度 Medium · 論文/研究 · Papers / Benchmarks Medium priority · paper/research · Papers / Benchmarks 公開 5月27日 Published May 27

AI要約グラフや表などの構造化知識を線形化してLLMに入力する際にハルシネーションが生じるメカニズムを機械的に分析した研究論文。

EN arXiv:2605.26362v1 Announce Type: new Abstract: In many reasoning tasks, large language models (LLMs) rely on structured external knowledge, such as graphs and tables, which is typically linearized in

#arxiv #paper #hallucination +5

arxiv.org →

og fallback

Fri, May 8 1 entries

NEW blog claude 1mo ago ·

youtube-anthropic

Anthropic、Claudeの思考を言語化する解釈可能性研究を公開 Translating Claude’s thoughts into language

通常 Normal 新着 · 技術記事 · Claude / Claude Code New · technical post · Claude / Claude Code 公開 5月8日 Published May 8

AI要約 Anthropicが、Claudeの内部表現を人間の言語に翻訳する解釈可能性研究の動画を公開。モデルが推論中に何を「考えて」いるかを可視化し、AIの透明性と安全性向上を目指す取り組みを紹介している。

原文JA Anthropicが、Claudeの内部表現を人間の言語に翻訳する解釈可能性研究の動画を公開。モデルが推論中に何を「考えて」いるかを可視化し、AIの透明性と安全性向上を目指す取り組みを紹介している。

#anthropic #youtube #interpretability +3

youtube.com →

fallback

#mechanistic-interpretability 3 total

Entries page 1/1 · 3 total

LLMにおけるアライメントの痕跡を計測・局在化・除去する研究 Measuring, Localizing, and Ablating Alignment Signatures in LLMs

LLMが構造化知識でハルシネーションを起こす理由：線形化表現上の推論メカニズム分析 Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations

Anthropic、Claudeの思考を言語化する解釈可能性研究を公開 Translating Claude’s thoughts into language