#deceptive-alignment — TECH Dashboard

paper research 2w ago ·

arxiv-cs-lg

LLMが「一貫して嘘をつく」ことを学習するとき：合成欺瞞の線形表現に関するマルチモデル研究 When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception

重要度 Medium Medium priority 重要度 Medium · 論文/研究 · Papers / Benchmarks Medium priority · paper/research · Papers / Benchmarks 公開 6月1日 Published Jun 1

AI要約 LLMが内部では正確な表現を保ちながら意図的に誤った出力を生成する「欺瞞的アライメント」を、複数モデルにわたって線形表現の観点から分析した研究。モデルが合成的な欺瞞をどのように学習・符号化するかを明らかにしようとしている。

EN arXiv:2605.30381v1 Announce Type: new Abstract: Deceptive alignment, in which models maintain accurate internal representations while deliberately producing false outputs, remains a central challenge

#arxiv #paper #ai-safety +5

arxiv.org →

fallback

#deceptive-alignment 1 total

Entries page 1/1 · 1 total

LLMが「一貫して嘘をつく」ことを学習するとき：合成欺瞞の線形表現に関するマルチモデル研究 When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception