#backdoor-detection — TECH Dashboard

paper research 1d ago ·

arxiv-cs-cl

活性化差分でバックドア検出: SAEアーキテクチャの比較研究 Activation Differences Reveal Backdoors: A Comparison of SAE Architectures

AI要約本論文は、Sparse Autoencoder(SAE)を用いて言語モデル内のバックドアを検出する手法を提案する。クリーン入力と汚染入力の活性化差分を解析し、複数のSAEアーキテクチャを比較して検出性能を評価した。

EN This paper proposes using Sparse Autoencoders (SAEs) to detect backdoors in language models by analyzing activation differences between clean and poisoned inputs, comparing several SAE architectures for detection performance.

#arxiv #paper #sparse-autoencoder #backdoor-detection

arxiv.org →

#backdoor-detection page 1/1 · 1 total

活性化差分でバックドア検出: SAEアーキテクチャの比較研究 Activation Differences Reveal Backdoors: A Comparison of SAE Architectures