Home›Tags›#arxiv›Page 3

Tag timeline

#arxiv page 3/3

同じキーワードで束ねられた更新の続きです。カテゴリをまたいだ関連ニュースや実装トピックの追跡に使えます。

First page Browse categories Open archive

Total 77

Showing 17

Page 3/3

Updated 1h ago

Entries page 3/3 · 77 total

Wed, May 27 7 entries

paper research 3w ago ·

arxiv-cs-se

普遍的な崖とデザイン指紋：LLMオーケストレーション下のクロスセクション欠陥検出 A Universal Cliff and a Design Fingerprint: Cross-Section Defect Detection Under LLM Orchestration

重要度 Medium Medium priority 重要度 Medium · 論文/研究 · Papers / Benchmarks Medium priority · paper/research · Papers / Benchmarks 公開 5月27日 Published May 27

AI要約 LLMが複数のワーカーエージェントに処理を分散する際に生じるクロスセクション欠陥を検出する研究。設計上の「指紋」パターンと性能崖の存在を報告。

EN This paper investigates defect detection across the invisible orchestration layer of production LLM systems, identifying a universal performance cliff and a recurring design fingerprint in multi-agent architectures.

#arxiv #paper #llm +5

fallback

paper research 3w ago ·

arxiv-cs-se

RepoMirage: 摂動を用いたコードエージェントのリポジトリコンテキスト推論の検証 RepoMirage: Probing Repository Context Reasoning in Code Agents with Perturbations

重要度 Medium Medium priority 重要度 Medium · 論文/研究 · Papers / Benchmarks Medium priority · paper/research · Papers / Benchmarks 公開 5月27日 Published May 27

AI要約コードエージェントがリポジトリレベルのベンチマークで本当にコンテキストを理解しているか、摂動を加えて検証する研究。

EN RepoMirage probes whether code agents genuinely reason about repository context or exploit shortcuts, using controlled perturbations on repository-level benchmarks.

#arxiv #paper #code-agents +4

fallback

paper research 3w ago ·

arxiv-cs-se

SetupX: LLMエージェントはコードリポジトリのセットアップ失敗から学習できるか？ SetupX: Can LLM Agents Learn from Past Failures in Functionality-Correct Code Repository Setup?

重要度 Medium Medium priority 重要度 Medium · 論文/研究 · Papers / Benchmarks Medium priority · paper/research · Papers / Benchmarks 公開 5月27日 Published May 27

AI要約リポジトリの実行環境構成を正しく行うベンチマーク SetupX を提案し、LLMエージェントが過去の失敗から学習できるかを検証した研究。

EN SetupX is a benchmark studying whether LLM agents can learn from past failures to correctly configure execution environments for code repositories.

#arxiv #paper #llm-agents +4

fallback

paper research 3w ago ·

arxiv-cs-se

Verus-SpecGym: 仕様の自動形式化を評価するエージェント環境 Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization

重要度 Medium Medium priority 重要度 Medium · 論文/研究 · Papers / Benchmarks Medium priority · paper/research · Papers / Benchmarks 公開 5月27日 Published May 27

AI要約 AIコーディングエージェントの出力の正しさを保証するため、仕様の自動形式化を評価するベンチマーク環境Verus-SpecGymを提案した研究論文。

EN Verus-SpecGym is a new agentic benchmark environment for evaluating how well AI agents can autoformalize software specifications, addressing correctness challenges in AI-generated code.

#agent #arxiv #paper +5

fallback

paper research 3w ago ·

arxiv-cs-se

構造的カバレッジ基準によるエージェントワークフローのテスト Testing Agentic Workflows with Structural Coverage Criteria

重要度 Medium Medium priority 重要度 Medium · 論文/研究 · Papers / Benchmarks Medium priority · paper/research · Papers / Benchmarks 公開 5月27日 Published May 27

AI要約マルチエージェントシステムのワークフロー構造（エージェント・ツール・委譲パス等）を活用した新しいテスト手法を提案する研究論文。

EN A research paper proposing structural coverage criteria for testing multi-agent workflows, leveraging explicit structures such as agents, tools, access rules, and delegation paths.

#agent #arxiv #benchmark +6

fallback

paper research 3w ago ·

arxiv-cs-se

FuzzPilot: カバレッジ停滞をトリガーとする構造化テキストファジングのレシピ検証システム FuzzPilot: Plateau-Triggered Recipe Validation for Structured Text Fuzzing

重要度 Medium Medium priority 重要度 Medium · 論文/研究 · Papers / Benchmarks Medium priority · paper/research · Papers / Benchmarks 公開 5月27日 Published May 27

AI要約 FuzzPilotはAFL++向けコントローラで、カバレッジが停滞した際にコーパスをスナップショットし、高コストな推論をミューテーションのホットパスから分離する手法を提案。

EN FuzzPilot is an AFL++ controller that defers expensive reasoning to coverage-plateau events, snapshotting the corpus and validating mutation recipes without blocking the hot path.

#arxiv #paper #fuzzing +4

fallback

paper research 3w ago ·

arxiv-cs-se

TrajAudit: エージェント型コーディングシステムの障害自動診断 TrajAudit: Automated Failure Diagnosis for Agentic Coding Systems

重要度 Medium Medium priority 重要度 Medium · 論文/研究 · Papers / Benchmarks Medium priority · paper/research · Papers / Benchmarks 公開 5月27日 Published May 27

AI要約バグ修正などを行うエージェント型AIシステムの失敗原因を自動診断するフレームワーク「TrajAudit」を提案した研究論文。

EN TrajAudit is a proposed framework for automated failure diagnosis in agentic coding systems such as AI-driven bug fixers, helping explain why tasks go wrong.

#agent #arxiv #paper +4

fallback

Tue, May 26 5 entries

paper research 3w ago ·

arxiv-cs-cl

Raon-Speech テクニカルレポート Raon-Speech Technical Report

重要度 Medium Medium priority 重要度 Medium · 論文/研究 · Papers / Benchmarks Medium priority · paper/research · Papers / Benchmarks 公開 5月26日 Published May 26

AI要約英語・韓国語に対応した90億パラメータの音声言語モデル「Raon-Speech」の技術報告。音声理解・応答・生成で高性能を達成。

EN Raon-Speech is a top-performing 9B-parameter speech language model supporting English and Korean speech understanding, answering, and generation tasks.

#arxiv #paper #speech-language-model +4

og fallback

paper research 3w ago ·

arxiv-cs-cl

科学的仮説の自動生成のためのマルチペルソナ討論システム Multi-Persona Debate System for Automated Scientific Hypothesis Generation

重要度 Medium Medium priority 重要度 Medium · 論文/研究 · Papers / Benchmarks Medium priority · paper/research · Papers / Benchmarks 公開 5月26日 Published May 26

AI要約断片的な知識を統合して科学的仮説を自動生成するマルチペルソナ討論フレームワークをarXivで発表。

EN A multi-persona debate system is proposed to automate scientific hypothesis generation by synthesizing fragmented knowledge into actionable research directions.

#arxiv #paper #hypothesis-generation +4

og fallback

paper research 3w ago ·

arxiv-cs-ai

大規模言語モデルにおける信頼度キャリブレーション Confidence Calibration in Large Language Models

重要度 Medium Medium priority 重要度 Medium · 論文/研究 · Papers / Benchmarks Medium priority · paper/research · Papers / Benchmarks 公開 5月26日 Published May 26

AI要約 LLMの信頼度キャリブレーションを多様なタスクで調査した事前登録済み研究。モデルの自信度と実際の正確さの整合性を検証。

EN A preregistered study investigates how well large language models calibrate their expressed confidence across diverse tasks, examining alignment between stated certainty and actual accuracy.

#arxiv #paper #calibration +4

og fallback

paper research 3w ago ·

arxiv-cs-ai

どれだけ考えれば十分か？LLM推論における冗長性の定量化と理解 How Much Thinking is Enough? Quantifying and Understanding Redundancy in LLM Reasoning

重要度 Medium Medium priority 重要度 Medium · 論文/研究 · Papers / Benchmarks Medium priority · paper/research · Papers / Benchmarks 公開 5月26日 Published May 26

AI要約 LLMの長い思考チェーンに含まれる冗長性を定量化し、レイテンシ・GPU時間・エネルギーコストを削減する手法を研究した論文。

EN A research paper quantifying redundancy in LLM chain-of-thought reasoning, aiming to reduce latency, GPU time, and energy costs without sacrificing accuracy.

#arxiv #paper #chain-of-thought +4

og fallback

paper research 3w ago ·

arxiv-cs-ai

LLMを活用したエージェントワークフローの信頼性設計：レイテンシ・信頼性・コストのトレードオフ最適化 Toward Reliable Design of LLM-Enabled Agentic Workflows: Optimizing Latency-Reliability-Cost Tradeoffs

重要度 Medium Medium priority 重要度 Medium · 論文/研究 · Papers / Benchmarks Medium priority · paper/research · Papers / Benchmarks 公開 5月26日 Published May 26

AI要約複数のLLMエージェントが連携するワークフローにおける、レイテンシ・信頼性・コストの三者トレードオフを最適化する設計手法を提案した研究論文。

EN A research paper proposing methods to optimize latency, reliability, and cost tradeoffs in agentic workflows composed of multiple interacting LLM-powered and conventional agents.

#agent #arxiv #paper +6

og fallback

Mon, May 25 5 entries

paper research 3w ago ·

arxiv-cs-lg

Latent Cache Flow：テキストを介さないモデル間通信 Latent Cache Flow: Model-to-Model Communication Without Text

重要度 Medium Medium priority 重要度 Medium · 論文/研究 · Papers / Benchmarks Medium priority · paper/research · Papers / Benchmarks 公開 5月25日 Published May 25

AI要約 LLMエージェントがテキストではなくKVキャッシュを直接共有することで、レイテンシ削減と情報損失の低減を図る手法を提案。

EN A proposed method enabling LLM agents to communicate via shared KV caches rather than text, reducing autoregressive decoding latency and information loss between models.

#arxiv #paper #llm-agents +4

fallback

paper research 3w ago ·

arxiv-cs-lg

言語モデルの生成軌跡から較正された不確実性を読み取る Reading Calibrated Uncertainty from Language Model Trajectories

重要度 Medium Medium priority 重要度 Medium · 論文/研究 · Papers / Benchmarks Medium priority · paper/research · Papers / Benchmarks 公開 5月25日 Published May 25

AI要約言語モデルの不確実性定量化において、最大ソフトマックス確率に代わる軌跡ベースの較正手法を提案した研究論文。

EN A research paper proposing trajectory-based methods to extract calibrated uncertainty estimates from language models, moving beyond the default maximum softmax probability (MSP) baseline.

#arxiv #paper #uncertainty-quantification +4

fallback

paper research 3w ago ·

arxiv-cs-lg

残差から理由へ：表形式データにおけるLLM誘導メカニズム推論 From Residuals to Reasons: LLM-Guided Mechanism Inference from Tabular Data

重要度 Medium Medium priority 重要度 Medium · 論文/研究 · Papers / Benchmarks Medium priority · paper/research · Papers / Benchmarks 公開 5月25日 Published May 25

AI要約 LLMを活用して表形式データの統計的残差から因果メカニズムを推論する手法を提案。予測と理解の両立を目指す研究。

EN A new method uses LLMs to infer causal mechanisms from model residuals in tabular data, aiming to bridge predictive accuracy and scientific interpretability.

#arxiv #paper #llm +4

fallback

paper research 3w ago ·

arxiv-cs-lg

MARGIN: マルチエージェント基盤モデル協調のためのランタイム信頼度キャリブレーション MARGIN: Runtime Confidence Calibration for Multi-Agent Foundation Model Coordination

重要度 Medium Medium priority 重要度 Medium · 論文/研究 · Papers / Benchmarks Medium priority · paper/research · Papers / Benchmarks 公開 5月25日 Published May 25

AI要約複数の基盤モデルエージェントが協調する環境で、コーディネーターが各エージェントの応答をどれだけ信頼すべきかを実行時にキャリブレーションする手法MARGINを提案。

EN MARGIN proposes a runtime confidence calibration method for multi-agent deployments, helping a coordinator decide which foundation model agent's response to trust.

#agent #arxiv #paper +5

fallback

paper research 3w ago ·

arxiv-cs-lg

PACE: 小規模言語モデルエージェントの2タイムスケール自己進化 PACE: Two-Timescale Self-Evolution for Small Language Model Agents

重要度 Medium Medium priority 重要度 Medium · 論文/研究 · Papers / Benchmarks Medium priority · paper/research · Papers / Benchmarks 公開 5月25日 Published May 25

AI要約小規模LMエージェントを本番環境で効率的に運用するため、プロンプトやパーサーを自動チューニングする2タイムスケール自己進化フレームワークPACEを提案。

EN PACE introduces a two-timescale self-evolution framework that automates prompt and component tuning for small language model agents, reducing compute and human effort in production deployments.

#arxiv #paper #small-language-model +4

fallback