RAGが逆効果になる時:医療QAにおける検索拡張の落とし穴 An analysis showing that retrieval-augmented generation (RAG) can sometimes hurt rather th…

Zenn LLM tag · zenn.dev · 2026/05/12 08:00 · 1d ago · 📖 2 min

AI 3 行サマリ

医療分野の質問応答においてRAGが必ずしも精度向上に寄与せず、むしろ性能を下げるケースがあることを指摘した記事。
検索ノイズやモデル本来の知識との干渉が原因と考えられ、RAG設計の前提を見直す必要性を論じている。

English summary

An analysis showing that retrieval-augmented generation (RAG) can sometimes hurt rather than help medical question answering, due to noisy retrieved context interfering with the model's parametric knowledge.

RAG(Retrieval-Augmented Generation)はLLMの幻覚抑制や知識更新の有力手段として広く採用されているが、医療QAの文脈では必ずしも有効ではないという指摘がある。本記事は、検索拡張が逆に回答精度を下げる事例を整理し、その要因を考察している。

医療QAでRAGを使う場合、PubMedや教科書、ガイドラインなどから関連文献を検索して文脈としてLLMに与えるのが一般的だ。しかし検索結果が質問と部分的にしか一致しない、あるいは古い・矛盾する情報を含むと、モデルが本来パラメトリック知識として正しく持っていた回答をかえって誤った方向に誘導してしまうことがある。特にGPT-4クラスの大規模モデルでは、医学試験ベンチマーク(MedQA、MMLUの医療サブセットなど)で素のモデルがRAG付きを上回るケースが報告されている。

要因として挙げられるのは、検索器のドメイン適合度の低さ、チャンク分割による文脈断片化、そしてLLMが与えられた文脈を過度に重視する「コンテキスト・バイアス」だ。関連性の低いパッセージが混入すると、モデルはそれを正当な根拠と誤認して回答を歪める可能性がある。

医療分野の質問応答においてRAGが必ずしも精度向上に寄与せず、むしろ性能を下げるケースがあることを指摘した記事。

🏠 Local LLM · 本記事のポイント

関連知見として、近年はSelf-RAGやCorrective RAG(CRAG)、Adaptive RAGなど、検索の必要性自体をモデルに判断させたり、検索結果の信頼度を評価して取捨選択する手法が提案されている。医療特化ではMedRAGやAlmanacといったベンチマーク・フレームワークもあり、検索器・コーパス・生成モデルの組み合わせによる性能差が定量化されつつある。

実務的な示唆としては、RAGを「常にオン」にするのではなく、質問の種類や難易度に応じて検索の有無や検索量を制御する設計が望ましいと見られる。特に高リスク領域である医療では、検索ソースのキュレーション、再ランキング、引用検証のレイヤーを組み合わせる多段構成が現実的な落としどころとなりそうだ。

Retrieval-Augmented Generation (RAG) has become a default recipe for grounding LLMs in up-to-date or domain-specific knowledge, yet evidence is accumulating that in medical question answering it can actually degrade accuracy. This article walks through cases where retrieval hurts and analyzes why.

A typical medical RAG pipeline pulls passages from sources like PubMed abstracts, clinical guidelines, or textbooks and injects them into the prompt. The implicit assumption is that more context is better. In practice, retrieval quality is uneven: passages may be only loosely relevant, outdated, or mutually contradictory. When that happens, the model can be pulled away from an answer it would have produced correctly from its parametric knowledge alone. Reports on benchmarks such as MedQA and the medical subsets of MMLU show that strong base models like GPT-4 sometimes score higher without RAG than with it.

Several mechanisms are at play. Retrievers trained on general corpora are not always well aligned with clinical phrasing, so top-k results can be off-topic. Chunking strategies fragment reasoning chains that span entire guideline sections. And LLMs exhibit a well-documented context bias, treating retrieved text as authoritative even when it is tangential, which can amplify rather than correct errors.

The broader research community has been responding with adaptive variants. Self-RAG lets the model decide when to retrieve and critique its own outputs. Corrective RAG (CRAG) introduces a lightweight evaluator that triggers web search or filtering when retrieved evidence looks weak. Adaptive RAG routes queries to different pipelines based on complexity. In the medical domain specifically, frameworks like MedRAG and Almanac have started to systematically benchmark combinations of retrievers, corpora, and generators, making it easier to see where retrieval helps and where it backfires.

The practical takeaway is that RAG should not be treated as an always-on safety net, particularly in high-stakes domains. A more defensible design likely involves gating retrieval on query type, carefully curating the source corpus, applying re-ranking, and adding a citation-verification layer before surfacing answers. For clinical applications, hybrid approaches that combine parametric medical knowledge with selective, high-precision retrieval appear to be the more promising direction, though the optimal balance is still an open empirical question.