LIVE · 05/12
cursorRELCursor in Microsoft TeamsCursor in Microsoft Teams[cursor-changelog]researchContext-Augmented Code Generation: How Product Context Improves AI Coding Agent Decision Compliance by 49%Context-Augmented Code Generation: How Product Context Improves AI Coding Agent Decision Compliance by 49%[arxiv-cs-se]researchComputer Use at the Edge of the Statistical PrecipiceComputer Use at the Edge of the Statistical Precipice[arxiv-cs-se]researchExecution Envelopes: A Shared Admission Contract for Backend AI Execution RequestsExecution Envelopes: A Shared Admission Contract for Backend AI Execution Requests[arxiv-cs-se]researchDo not copy and paste! Rewriting strategies for code retrievalDo not copy and paste! Rewriting strategies for code retrieval[arxiv-cs-se]researchMazocarta: A Seeded Procedural Deckbuilder for Instrumented Game DevelopmentMazocarta: A Seeded Procedural Deckbuilder for Instrumented Game Development[arxiv-cs-se]researchWhat Software Engineering Looks Like to AI Agents? -- An Empirical Study of AI-Only Technical Discourse on MoltBookWhat Software Engineering Looks Like to AI Agents? -- An Empirical Study of AI-Only Technical Discourse on MoltBook[arxiv-cs-se]researchA Dataset of Agentic AI Coding Tool ConfigurationsA Dataset of Agentic AI Coding Tool Configurations[arxiv-cs-se]researchVeriContest: A Competitive-Programming Benchmark for Verifiable Code GenerationVeriContest: A Competitive-Programming Benchmark for Verifiable Code Generation[arxiv-cs-se]researchEvidenT: An Evidence-Preserving Framework for Iterative System-Level Package RepairEvidenT: An Evidence-Preserving Framework for Iterative System-Level Package Repair[arxiv-cs-se]researchSemantic Voting: Execution-Grounded Consensus for LLM Code GenerationSemantic Voting: Execution-Grounded Consensus for LLM Code Generation[arxiv-cs-se]researchA Learning Method for Symbolic Systems Using Large Language ModelsA Learning Method for Symbolic Systems Using Large Language Models[arxiv-cs-se]researchDebugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering AgentsDebugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents[arxiv-cs-se]researchUsing Semantic Distance to Estimate Uncertainty in LLM-Based Code GenerationUsing Semantic Distance to Estimate Uncertainty in LLM-Based Code Generation[arxiv-cs-se]researchParityFuzz: Finding Inconsistencies across Solidity Compilers via Fine-Grained Mutation and Differential AnalysisParityFuzz: Finding Inconsistencies across Solidity Compilers via Fine-Grained Mutation and Differential Analysis[arxiv-cs-se]researchEvaluating LLM-Generated Code: A Benchmark and Developer StudyEvaluating LLM-Generated Code: A Benchmark and Developer Study[arxiv-cs-se]researchGenerating Complex Code Analyzers from Natural Language QuestionsGenerating Complex Code Analyzers from Natural Language Questions[arxiv-cs-se]researchPrediction Model of Motivators and Demotivators of Integrating Large Language Models in Software Engineering Education: An Empirical StudyPrediction Model of Motivators and Demotivators of Integrating Large Language Models in Software Engineering Education: An Empirical Study[arxiv-cs-se]researchMACAA: Belief-Revision Multi-Agent Reasoning for Open-World Code Authorship VerificationMACAA: Belief-Revision Multi-Agent Reasoning for Open-World Code Authorship Verification[arxiv-cs-se]researchConCovUp: Effective Agent-Based Test Driver Generation for Concurrency TestingConCovUp: Effective Agent-Based Test Driver Generation for Concurrency Testing[arxiv-cs-se]researchZoom, Don't Wander: Why Regional Search Outperforms Pareto Reasoning and Global Optimization in Budget-Constrained SBSEZoom, Don't Wander: Why Regional Search Outperforms Pareto Reasoning and Global Optimization in Budget-Constrained SBSE[arxiv-cs-se]researchTrajectory Supervision for Continual Tool-Use Learning in LLMsTrajectory Supervision for Continual Tool-Use Learning in LLMs[arxiv-cs-se]researchEvaluating Tool Cloning in Agentic-AI EcosystemsEvaluating Tool Cloning in Agentic-AI Ecosystems[arxiv-cs-se]researchDeterministic vs. LLM-Controlled Orchestration for COBOL-to-Python ModernizationDeterministic vs. LLM-Controlled Orchestration for COBOL-to-Python Modernization[arxiv-cs-se]
Today 56
Total 246
Major 12
Active sources 12/51
Updated just now
Daily Summary

今日の更新

Today's Updates

Today 56 ▼ 81%
Yesterday 288
7-day 502
Last 7 days
34
48
26
14
36
288
56
05/06 05/07 05/08 05/09 05/10 05/11 05/12
Last 7 days article counts
DateCount
2026-05-0634
2026-05-0748
2026-05-0826
2026-05-0914
2026-05-1036
2026-05-11288
2026-05-1256
主要な更新 Top stories 05/12 · 10 件
  1. 01 cursor REL Cursor in Microsoft Teams Cursor in Microsoft Teams Cursor is now available in Microsoft Teams. [cursor-changelog]
  2. 02 research Context-Augmented Code Generation: How Product Context Improves AI Coding Agent Decision Compliance by 49% Context-Augmented Code Generation: How Product Context Improves AI Coding Agent Decision Compliance by 49% arXiv:2605.08112v1 Announce Type: new Abstract: AI coding agents powered by large language models can read codebases and produce functional code, but they routinely violate team-specific product decis [arxiv-cs-se]
  3. 03 research Computer Use at the Edge of the Statistical Precipice Computer Use at the Edge of the Statistical Precipice arXiv:2605.08261v1 Announce Type: new Abstract: Evaluating Computer Use Agents (CUAs) on interactive environments is fraught with methodological pitfalls that the field has yet to systematically addre [arxiv-cs-se]
  4. 04 claude OpenAI CodexとClaude Codeの「AIコーディング支援のコスト感」の違い (no English title) はじめに OpenAI Codex と Claude Code を両方使っていると、単純な「どちらが賢いか」とは別に、かなり現実的な差が見えてきます。 それが コスト感 です。 ここでいうコストは、月額料金だけではありません。 どれくらい作 [qiita-claude]
  5. 05 claude [備忘録-1] VScode内でHTMLをGo Liveしようとしたがうまくいかなかった。 (no English title) 本題: VScodeで初歩的なカウントアプリを作って学習中に、次の問題に直面した。 Live Serveer機能でHTMLを記述したファイルを表示しようとしても表示されなくなってしまった。 ※最初の投稿かつ備忘録としての書きなぐりためいろい [qiita-claude]
  6. 06 tech-news New and improved in Copilot Studio: Intelligent workflows, connected experiences and more New and improved in Copilot Studio: Intelligent workflows, connected experiences and more The post New and improved in Copilot Studio: Intelligent workflows, connected experiences and more appeared first on Source . [microsoft-source]
  7. 07 tech-news Microsoft releases dataset covering electricity grid in 48 US states to aid power systems research Microsoft releases dataset covering electricity grid in 48 US states to aid power systems research The post Microsoft releases dataset covering electricity grid in 48 US states to aid power systems research appeared first on Source . [microsoft-source]
  8. 08 claude 自律型AIエージェントによる最新ナレッジ収集システムの構築とアーキテクチャ設計 (no English title) 本記事では、X(旧Twitter)やYouTube等のプラットフォームから最新のAI開発系ナレッジを自動で収集・要約する「自律型AIエージェントシステム」の構築について、要件定義から具体的なアーキテクチャ設計、そして技術選定(SDKとフレー [zenn-claude]
  9. 09 claude Claude Code完全入門:インストールから実務活用まで、AIコーディングの新常識を徹底解説 (no English title) はじめに ターミナルに常駐し、コードベース全体を読み、ファイルを編集し、シェルを叩き、Gitを操作する。「AIエンジニアと一緒に開発する」感覚に最も近いツール、それがAnthropic公式のCLIツールClaude Codeです。これまで人 [zenn-claude]
  10. 10 codex How ChatGPT adoption broadened in early 2026 How ChatGPT adoption broadened in early 2026 ChatGPT adoption surged in Q1 2026, with fastest growth among users over 35 and more balanced gender usage, signaling broader mainstream AI adoption. [openai-blog]
🔥 Today's Top 3 importance × recency
  1. CodeQL 2.25.3 adds Swift 6.3 support CodeQL 2.25.3 adds Swift 6.3 support github-changelog 3d ago
  2. Cursor in Microsoft Teams Cursor in Microsoft Teams cursor-changelog 7h ago
  3. Create repositories on the go with GitHub Mobile Create repositories on the go with GitHub Mobile github-changelog 21h ago

Timeline 246 total · page 1/9

TODAY 30 entries
NEW paper research 3h ago · arxiv-cs-se

Context-Augmented Code Generation: How Product Context Improves AI Coding Agent Decision Compliance by 49% Context-Augmented Code Generation: How Product Context Improves AI Coding Agent Decision Compliance by 49%

EN arXiv:2605.08112v1 Announce Type: new Abstract: AI coding agents powered by large language models can read codebases and produce functional code, but they routinely violate team-specific product decis

EN arXiv:2605.08112v1 Announce Type: new Abstract: AI coding agents powered by large language models can read codebases and produce functional code, but they routinely violate team-specific product decis

arxiv.org
Context-Augmented Code Generation: How Product Context Improves AI Coding Agent Decision Compliance by 49% og
NEW paper research 3h ago · arxiv-cs-se

Computer Use at the Edge of the Statistical Precipice Computer Use at the Edge of the Statistical Precipice

EN arXiv:2605.08261v1 Announce Type: new Abstract: Evaluating Computer Use Agents (CUAs) on interactive environments is fraught with methodological pitfalls that the field has yet to systematically addre

EN arXiv:2605.08261v1 Announce Type: new Abstract: Evaluating Computer Use Agents (CUAs) on interactive environments is fraught with methodological pitfalls that the field has yet to systematically addre

arxiv.org
Computer Use at the Edge of the Statistical Precipice og
NEW paper research 3h ago · arxiv-cs-se

Execution Envelopes: A Shared Admission Contract for Backend AI Execution Requests Execution Envelopes: A Shared Admission Contract for Backend AI Execution Requests

EN arXiv:2605.08267v1 Announce Type: new Abstract: Enterprise AI backends increasingly admit heterogeneous execution requests across model deployment, inference, evaluation, data movement, and agentic wo

EN arXiv:2605.08267v1 Announce Type: new Abstract: Enterprise AI backends increasingly admit heterogeneous execution requests across model deployment, inference, evaluation, data movement, and agentic wo

arxiv.org
Execution Envelopes: A Shared Admission Contract for Backend AI Execution Requests og
NEW paper research 3h ago · arxiv-cs-se

Do not copy and paste! Rewriting strategies for code retrieval Do not copy and paste! Rewriting strategies for code retrieval

EN arXiv:2605.08299v1 Announce Type: new Abstract: Embedding-based code retrieval often suffers when encoders overfit to surface syntax. Prior work mitigates this by using LLMs to rephrase queries and co

EN arXiv:2605.08299v1 Announce Type: new Abstract: Embedding-based code retrieval often suffers when encoders overfit to surface syntax. Prior work mitigates this by using LLMs to rephrase queries and co

arxiv.org
NEW paper research 3h ago · arxiv-cs-se

Mazocarta: A Seeded Procedural Deckbuilder for Instrumented Game Development Mazocarta: A Seeded Procedural Deckbuilder for Instrumented Game Development

EN arXiv:2605.08319v1 Announce Type: new Abstract: Mazocarta is a seeded procedural tactical deckbuilder implemented in Rust, compiled to WebAssembly for browser play, and executable natively for simulat

EN arXiv:2605.08319v1 Announce Type: new Abstract: Mazocarta is a seeded procedural tactical deckbuilder implemented in Rust, compiled to WebAssembly for browser play, and executable natively for simulat

arxiv.org
NEW paper research 3h ago · arxiv-cs-se

What Software Engineering Looks Like to AI Agents? -- An Empirical Study of AI-Only Technical Discourse on MoltBook What Software Engineering Looks Like to AI Agents? -- An Empirical Study of AI-Only Technical Discourse on MoltBook

EN arXiv:2605.08380v1 Announce Type: new Abstract: AI agents are increasingly framed as software-engineering teammates, yet most research studies them inside human-centered workflows. Little is known abo

EN arXiv:2605.08380v1 Announce Type: new Abstract: AI agents are increasingly framed as software-engineering teammates, yet most research studies them inside human-centered workflows. Little is known abo

arxiv.org
NEW paper research 3h ago · arxiv-cs-se

A Dataset of Agentic AI Coding Tool Configurations A Dataset of Agentic AI Coding Tool Configurations

EN arXiv:2605.08435v1 Announce Type: new Abstract: Agentic AI coding tools such as Claude Code and OpenAI Codex execute multi-step coding tasks with limited human oversight. To steer these tools, develop

EN arXiv:2605.08435v1 Announce Type: new Abstract: Agentic AI coding tools such as Claude Code and OpenAI Codex execute multi-step coding tasks with limited human oversight. To steer these tools, develop

arxiv.org
NEW paper research 3h ago · arxiv-cs-se

VeriContest: A Competitive-Programming Benchmark for Verifiable Code Generation VeriContest: A Competitive-Programming Benchmark for Verifiable Code Generation

EN arXiv:2605.08553v1 Announce Type: new Abstract: Large language models can generate useful code from natural language, but their outputs come without correctness guarantees. Verifiable code generation

EN arXiv:2605.08553v1 Announce Type: new Abstract: Large language models can generate useful code from natural language, but their outputs come without correctness guarantees. Verifiable code generation

arxiv.org
NEW paper research 3h ago · arxiv-cs-se

EvidenT: An Evidence-Preserving Framework for Iterative System-Level Package Repair EvidenT: An Evidence-Preserving Framework for Iterative System-Level Package Repair

EN arXiv:2605.08621v1 Announce Type: new Abstract: Frequent toolchain updates and growing ISA diversity have made system-level software package repair increasingly important. Diagnosing and repairing bui

EN arXiv:2605.08621v1 Announce Type: new Abstract: Frequent toolchain updates and growing ISA diversity have made system-level software package repair increasingly important. Diagnosing and repairing bui

arxiv.org
NEW paper research 3h ago · arxiv-cs-se

Semantic Voting: Execution-Grounded Consensus for LLM Code Generation Semantic Voting: Execution-Grounded Consensus for LLM Code Generation

EN arXiv:2605.08680v1 Announce Type: new Abstract: LLM code-generation pipelines often sample multiple candidates and select one final answer without access to a complete oracle. Existing pipelines mix t

EN arXiv:2605.08680v1 Announce Type: new Abstract: LLM code-generation pipelines often sample multiple candidates and select one final answer without access to a complete oracle. Existing pipelines mix t

arxiv.org
NEW paper research 3h ago · arxiv-cs-se

A Learning Method for Symbolic Systems Using Large Language Models A Learning Method for Symbolic Systems Using Large Language Models

EN arXiv:2605.08694v1 Announce Type: new Abstract: Automated theorem proving is essential for the formal verification of safety-critical systems. As the corpus of formal proofs grows, a natural paradigm

EN arXiv:2605.08694v1 Announce Type: new Abstract: Automated theorem proving is essential for the formal verification of safety-critical systems. As the corpus of formal proofs grows, a natural paradigm

arxiv.org
NEW paper research 3h ago · arxiv-cs-se

Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents

EN arXiv:2605.08717v1 Announce Type: new Abstract: Software engineering agents are increasingly deployed in evaluable engineering environments, yet post-failure recovery remains costly, manual, and ad ho

EN arXiv:2605.08717v1 Announce Type: new Abstract: Software engineering agents are increasingly deployed in evaluable engineering environments, yet post-failure recovery remains costly, manual, and ad ho

arxiv.org
NEW paper research 3h ago · arxiv-cs-se

Using Semantic Distance to Estimate Uncertainty in LLM-Based Code Generation Using Semantic Distance to Estimate Uncertainty in LLM-Based Code Generation

EN arXiv:2605.09023v1 Announce Type: new Abstract: LLMs show strong performance in code generation, but their outputs lack correctness guarantees. Sample-based uncertainty estimators address this by gene

EN arXiv:2605.09023v1 Announce Type: new Abstract: LLMs show strong performance in code generation, but their outputs lack correctness guarantees. Sample-based uncertainty estimators address this by gene

arxiv.org
NEW paper research 3h ago · arxiv-cs-se

ParityFuzz: Finding Inconsistencies across Solidity Compilers via Fine-Grained Mutation and Differential Analysis ParityFuzz: Finding Inconsistencies across Solidity Compilers via Fine-Grained Mutation and Differential Analysis

EN arXiv:2605.09051v1 Announce Type: new Abstract: The Solidity smart contract ecosystem has rapidly grown, leading to multiple compilers targeting different blockchain platforms or improving compilation

EN arXiv:2605.09051v1 Announce Type: new Abstract: The Solidity smart contract ecosystem has rapidly grown, leading to multiple compilers targeting different blockchain platforms or improving compilation

arxiv.org
NEW paper research 3h ago · arxiv-cs-se

Evaluating LLM-Generated Code: A Benchmark and Developer Study Evaluating LLM-Generated Code: A Benchmark and Developer Study

EN arXiv:2605.09059v1 Announce Type: new Abstract: Code generation is one of the tasks for which the use of Large Language Models is widely adopted and highly successful. Given this popularity, there are

EN arXiv:2605.09059v1 Announce Type: new Abstract: Code generation is one of the tasks for which the use of Large Language Models is widely adopted and highly successful. Given this popularity, there are

arxiv.org
NEW paper research 3h ago · arxiv-cs-se

Generating Complex Code Analyzers from Natural Language Questions Generating Complex Code Analyzers from Natural Language Questions

EN arXiv:2605.09304v1 Announce Type: new Abstract: Many software development tasks, such as implementing features and fixing bugs, begin with developers posing questions about a codebase. However, answer

EN arXiv:2605.09304v1 Announce Type: new Abstract: Many software development tasks, such as implementing features and fixing bugs, begin with developers posing questions about a codebase. However, answer

arxiv.org
NEW paper research 3h ago · arxiv-cs-se

Prediction Model of Motivators and Demotivators of Integrating Large Language Models in Software Engineering Education: An Empirical Study Prediction Model of Motivators and Demotivators of Integrating Large Language Models in Software Engineering Education: An Empirical Study

EN arXiv:2605.09393v1 Announce Type: new Abstract: Context: Large Language Models (LLMs) are increasingly influencing software engineering practice and education. While prior studies examine their techni

EN arXiv:2605.09393v1 Announce Type: new Abstract: Context: Large Language Models (LLMs) are increasingly influencing software engineering practice and education. While prior studies examine their techni

arxiv.org
NEW paper research 3h ago · arxiv-cs-se

MACAA: Belief-Revision Multi-Agent Reasoning for Open-World Code Authorship Verification MACAA: Belief-Revision Multi-Agent Reasoning for Open-World Code Authorship Verification

EN arXiv:2605.09421v1 Announce Type: new Abstract: Code authorship attribution (CAA) supports software forensics, plagiarism detection, and intellectual property protection. However, existing supervised

EN arXiv:2605.09421v1 Announce Type: new Abstract: Code authorship attribution (CAA) supports software forensics, plagiarism detection, and intellectual property protection. However, existing supervised

arxiv.org
NEW paper research 3h ago · arxiv-cs-se

ConCovUp: Effective Agent-Based Test Driver Generation for Concurrency Testing ConCovUp: Effective Agent-Based Test Driver Generation for Concurrency Testing

EN arXiv:2605.09573v1 Announce Type: new Abstract: Concurrency testing is essential to improve the reliability and security of multi-threaded programs. Dynamic analysis tools, such as TSan, depend on hig

EN arXiv:2605.09573v1 Announce Type: new Abstract: Concurrency testing is essential to improve the reliability and security of multi-threaded programs. Dynamic analysis tools, such as TSan, depend on hig

arxiv.org
NEW paper research 3h ago · arxiv-cs-se

Zoom, Don't Wander: Why Regional Search Outperforms Pareto Reasoning and Global Optimization in Budget-Constrained SBSE Zoom, Don't Wander: Why Regional Search Outperforms Pareto Reasoning and Global Optimization in Budget-Constrained SBSE

EN arXiv:2605.09658v1 Announce Type: new Abstract: Traditional Search-Based Software Engineering (SBSE) assumes global search and full Pareto exploration are essential. We offer the following negative re

EN arXiv:2605.09658v1 Announce Type: new Abstract: Traditional Search-Based Software Engineering (SBSE) assumes global search and full Pareto exploration are essential. We offer the following negative re

arxiv.org
NEW paper research 3h ago · arxiv-cs-se

Trajectory Supervision for Continual Tool-Use Learning in LLMs Trajectory Supervision for Continual Tool-Use Learning in LLMs

EN arXiv:2605.09734v1 Announce Type: new Abstract: Most language-model training data shows final artifacts, not the process that produced them. We study a tractable version of this question in tool use:

EN arXiv:2605.09734v1 Announce Type: new Abstract: Most language-model training data shows final artifacts, not the process that produced them. We study a tractable version of this question in tool use:

arxiv.org
NEW paper research 3h ago · arxiv-cs-se

Evaluating Tool Cloning in Agentic-AI Ecosystems Evaluating Tool Cloning in Agentic-AI Ecosystems

EN arXiv:2605.09817v1 Announce Type: new Abstract: Agent tools are becoming a core interface through which LLM agents access external data, services, and execution environments. As these tools are distri

EN arXiv:2605.09817v1 Announce Type: new Abstract: Agent tools are becoming a core interface through which LLM agents access external data, services, and execution environments. As these tools are distri

arxiv.org
NEW paper research 3h ago · arxiv-cs-se

Deterministic vs. LLM-Controlled Orchestration for COBOL-to-Python Modernization Deterministic vs. LLM-Controlled Orchestration for COBOL-to-Python Modernization

EN arXiv:2605.09894v1 Announce Type: new Abstract: Modernizing legacy COBOL systems remains difficult due to scarce expertise, large and long-lived codebases, and strict correctness requirements. Recent

EN arXiv:2605.09894v1 Announce Type: new Abstract: Modernizing legacy COBOL systems remains difficult due to scarce expertise, large and long-lived codebases, and strict correctness requirements. Recent

arxiv.org
NEW paper research 3h ago · arxiv-cs-se

Instruction Adherence in Coding Agent Configuration Files: A Factorial Study of Four File-Structure Variables Instruction Adherence in Coding Agent Configuration Files: A Factorial Study of Four File-Structure Variables

EN arXiv:2605.10039v1 Announce Type: new Abstract: Frontier coding agents read configuration files (CLAUDE.md, AGENTS.md, Cursor Rules) at session start and are expected to follow the conventions inside

EN arXiv:2605.10039v1 Announce Type: new Abstract: Frontier coding agents read configuration files (CLAUDE.md, AGENTS.md, Cursor Rules) at session start and are expected to follow the conventions inside

arxiv.org
NEW paper research 3h ago · arxiv-cs-se

MARGIN: Margin-Aware Regularized Geometry for Imbalanced Vulnerability Detection MARGIN: Margin-Aware Regularized Geometry for Imbalanced Vulnerability Detection

EN arXiv:2605.10240v1 Announce Type: new Abstract: Software vulnerability detection is critical for ensuring software security and reliability. Despite recent advances in deep learning, real-world vulner

EN arXiv:2605.10240v1 Announce Type: new Abstract: Software vulnerability detection is critical for ensuring software security and reliability. Despite recent advances in deep learning, real-world vulner

arxiv.org
NEW paper research 3h ago · arxiv-cs-se

DREAMS: Modelling Support for Research into Engineering and Artistic Design DREAMS: Modelling Support for Research into Engineering and Artistic Design

EN arXiv:2605.10382v1 Announce Type: new Abstract: Design Research Methodology (DRM) supports systematic design research through representations such as Reference Models and Impact Models. However, the p

EN arXiv:2605.10382v1 Announce Type: new Abstract: Design Research Methodology (DRM) supports systematic design research through representations such as Reference Models and Impact Models. However, the p

arxiv.org
NEW paper research 3h ago · arxiv-cs-se

VISOR: A Vision-Language Model-based Test Oracle for Testing Robot VISOR: A Vision-Language Model-based Test Oracle for Testing Robot

EN arXiv:2605.10408v1 Announce Type: new Abstract: Testing robots requires assessing whether they perform their intended tasks correctly, dependably, and with high quality, a challenge known as the test

EN arXiv:2605.10408v1 Announce Type: new Abstract: Testing robots requires assessing whether they perform their intended tasks correctly, dependably, and with high quality, a challenge known as the test

arxiv.org
NEW paper research 3h ago · arxiv-cs-se

CrackMeBench: Binary Reverse Engineering for Agents CrackMeBench: Binary Reverse Engineering for Agents

EN arXiv:2605.10597v1 Announce Type: new Abstract: Benchmarks for coding agents increasingly measure source-level software repair, and cybersecurity benchmarks increasingly measure broad capture-the-flag

EN arXiv:2605.10597v1 Announce Type: new Abstract: Benchmarks for coding agents increasingly measure source-level software repair, and cybersecurity benchmarks increasingly measure broad capture-the-flag

arxiv.org