#benchmark · p.2 — TECH Dashboard

🔥 HOT blog codex 6mo ago ·

openai-blog

GPT-5.2で科学と数学を前進させる Advancing science and math with GPT-5.2

重要度 High High priority 重要度 High · 技術記事 · OpenAI / Codex High priority · technical post · OpenAI / Codex 公開 12月11日 Published Dec 11

AI要約 OpenAIがGPT-5.2を発表。GPQA DiamondやFrontierMathなど主要ベンチマークで最高水準を達成し、科学・数学分野の推論能力を大幅に強化。

EN GPT-5.2 is OpenAI’s strongest model yet for math and science, setting new state-of-the-art results on benchmarks like GPQA Diamond and FrontierMath. This post shows how those gains translate into real

#benchmark #openai #gpt-5.2 +7

openai.com →

fallback

NEW blog gemini 6mo ago ·

google-deepmind

FACTS Benchmark Suite: LLMの事実性を体系的に評価する新基準 FACTS Benchmark Suite: Systematically evaluating the factuality of large language models

重要度 Medium Medium priority 重要度 Medium · 技術記事 · Gemini / Gemma Medium priority · technical post · Gemini / Gemma 公開 12月9日 Published Dec 9

AI要約 Google DeepMindがLLMの事実性を体系的に評価するベンチマーク群「FACTS Benchmark Suite」を発表した。長文応答の事実性や根拠付けを測るFACTS Groundingに加え、新たな評価軸を追加し、モデルの幻覚問題を多角的に検証する枠組みを提供する。

EN Systematically evaluating the factuality of large language models with the FACTS Benchmark Suite.

#benchmark #deepmind #google +4

deepmind.google →

FACTS Benchmark Suite: Systematically evaluating the factuality of large language models

media fallback

#benchmark page 2/2

Entries page 2/2 · 32 total

GPT-5.2で科学と数学を前進させる Advancing science and math with GPT-5.2

FACTS Benchmark Suite: LLMの事実性を体系的に評価する新基準 FACTS Benchmark Suite: Systematically evaluating the factuality of large language models