0次選択から2次判定へ:組合せ強化でフロンティアLLMの構成的失敗を露呈 From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs

arXiv cs.CL · arxiv.org · 2026/05/12 13:00 · 2d ago · 📖 2 min

AI 3 行サマリ

本論文は、選択肢を提示する0次評価から、複数の主張を組み合わせて真偽を判定させる2次評価へと難度を高める「組合せ強化」手法を提案。
最先端LLMが個別事実は把握しつつも、複合的な構成判断で系統的に誤る様子を示す。

English summary

The paper introduces a combinatorial hardening methodology that escalates evaluation from 0-order multiple choice to 2-order compositional judgments, exposing systematic failures of frontier LLMs at composing facts even when they know individual pieces correctly.

大規模言語モデル(LLM)の評価において、選択肢から正解を選ばせる従来型ベンチマークは飽和しつつあり、モデル本来の推論力を測りにくくなっている。本論文は、評価タスクを段階的に難化させる「組合せ強化(Combinatorial Hardening)」を提案し、フロンティアLLMが個別事実は正しく扱えても、それらを組み合わせた判断では系統的に失敗することを示した。

著者らは評価を三段階に整理する。0次は標準的な多肢選択、1次は単一命題の真偽判定、2次は複数命題の組合せに対する整合性判定である。次数が上がるほど、推測やパターンマッチでの正解は困難になり、モデルは命題同士の関係を実際に統合する必要が出てくる。同一の知識領域でも、0次では高得点を取るモデルが、2次では大幅にスコアを落とすことが報告されている。

この結果は、LLMの「知っていること」と「組み合わせて判断できること」の乖離を改めて浮き彫りにする。近年、GSM-SymbolicやARC-AGI、BIG-Bench Hardなど、表層パターンへの依存を排した評価が相次いで提案されており、本研究もその系譜に位置づけられると見られる。特に、選択肢提示型タスクで生じる「消去法バイアス」を回避する点で、真偽判定の組合せという形式は有効な可能性がある。

本論文は、選択肢を提示する0次評価から、複数の主張を組み合わせて真偽を判定させる2次評価へと難度を高める「組合せ強化」手法を提案。

🔬 Research · 本記事のポイント

実務的には、エージェントやRAGシステムが複数事実を統合して結論を出す場面で、こうした構成的失敗は信頼性リスクとなりうる。個別の事実検索は正確でも、それらを束ねた最終判断が誤る挙動は、評価指標を0次の正答率だけに頼る危うさを示唆する。学習面でも、合成的データや強化学習による構成推論の強化が今後さらに重要になると考えられる。

なお、本論文のarXiv IDは年代表記が通常と異なるため、投稿時期やバージョンについては原文での確認が望ましい。

Standard LLM benchmarks built around multiple-choice questions are showing clear signs of saturation, making it increasingly difficult to distinguish surface pattern matching from genuine reasoning. This paper proposes a methodology called combinatorial hardening, which systematically escalates evaluation difficulty and reveals that frontier models fail at compositional judgment even when they reliably know the underlying facts.

The authors organize evaluation into three orders. A 0-order task is the familiar multiple-choice format, where a model selects one answer from a candidate set. A 1-order task asks the model to judge the truth of a single proposition, removing the elimination heuristics that multiple-choice formats invite. A 2-order task presents combinations of propositions and asks the model to judge their joint consistency or truth, forcing actual integration across statements rather than isolated recall.

The central empirical finding is a sharp degradation as order increases. Models that score highly on 0-order versions of a benchmark can collapse on the 2-order variant covering the same knowledge. This gap is not easily explained by missing facts, since the 1-order results indicate the atomic propositions are largely known. Instead, it points to a compositional weakness: frontier LLMs struggle to combine pieces of knowledge they individually possess into a coherent verdict.

This line of work fits into a broader ecosystem of evaluations designed to resist shortcut solutions, including GSM-Symbolic, ARC-AGI, BIG-Bench Hard, and various perturbation-based probes. A recurring theme across these efforts is that headline scores on legacy benchmarks overstate model competence because the formats themselves leak information. The combinatorial hardening approach is notable for being format-agnostic in spirit; it can in principle be applied on top of existing question banks by reformulating them as truth judgments over combinations.

The practical implications matter for anyone deploying LLMs in agentic or retrieval-augmented settings. Real workflows rarely involve picking among neatly enumerated answers. They involve gathering several facts, possibly from tools or documents, and arriving at a combined conclusion. If models retrieve correctly yet integrate poorly, downstream reliability suffers in ways that 0-order accuracy metrics will not catch. This may partly explain why agent benchmarks often look worse than chat benchmarks for the same underlying model.

On the training side, the results suggest that scaling pretraining alone is unlikely to close the compositional gap. Approaches that explicitly target multi-step verification, such as process reward models, self-consistency, deliberate chain-of-thought training, and synthetic data emphasizing compositional structure, are plausibly more effective levers. Reasoning-focused models in the o-series and similar systems have shown gains on related tasks, though it remains to be seen whether they fully overcome the 2-order penalty reported here.

Readers should note that the arXiv identifier formatting in the source appears unusual, so submission date and version details are best confirmed against the canonical record. Regardless of metadata particulars, the methodological contribution stands on its own: by moving the evaluation lens from selection to combinatorial judgment, the paper provides a sharper instrument for diagnosing where current frontier models actually break down.

#arxiv #paper #llm-evaluation #compositional-reasoning #benchmarks #frontier-models

SourcearXiv cs.CLT1
Source Avg ★ 1.0
Type論文
Importance ★ 通常 (top 10% in Research)
Half-life 🏛️ 長期 (アーキテクチャ)
LangEN
Collected2026/05/13 07:55

元記事を読む

arxiv.org

本ページの本文・要約は AI による自動生成です。正確性は元記事 (arxiv.org) をご確認ください。

🔬 Research の他の記事 もっと見る →

🔬 Research の他の記事もっと見る →