#reproducibility — TECH Dashboard

Entries page 1/1 · 5 total

YESTERDAY 1 entries

NEW blog claude 17h ago ·

zenn-claude

OpenMythosはFable 5 / Mythos 5を再現できているのか This article empirically examines how accurately OpenMythos reproduces the behavior and ca…

重要度 Medium Medium priority 重要度 Medium · 技術記事 · Claude / Claude Code Medium priority · technical post · Claude / Claude Code 公開 6月29日 Published Jun 29

AI要約オープンソース実装であるOpenMythosが、Fable 5およびMythos 5の動作や性能をどの程度忠実に再現できるかを実験的に検証した記事。再現精度の限界と課題を明らかにすることで、商用AIモデルのオープンソース複製可能性という問いに迫っている。

EN This article empirically examines how accurately OpenMythos reproduces the behavior and capabilities of Fable 5 / Mythos 5, identifying performance gaps and reproduction limitations. The findings shed light on the broader challenge of faithfully replicating proprietary AI systems through open-source efforts.

#claude #zenn #open-source +4

zenn.dev →

fallback

Fri, Jun 5 1 entries

paper research 3w ago ·

arxiv-cs-se

DeployBench：研究成果物のデプロイメントにおけるLLMエージェントのベンチマーク DeployBench: Benchmarking LLM Agents for Research Artifact Deployment

重要度 Medium Medium priority 重要度 Medium · 論文/研究 · Papers / Benchmarks Medium priority · paper/research · Papers / Benchmarks 公開 6月5日 Published Jun 5

AI要約 LLMエージェントが研究成果物をゼロから動作させる能力を評価するベンチマーク「DeployBench」を提案。既存のSE・ML研究ベンチマークが見落としていた環境構築能力を測定する。

EN DeployBench benchmarks LLM agents on deploying research artifacts from scratch, addressing a gap where prior SE and ML benchmarks assume pre-configured working environments.

#arxiv #paper #llm-agents +5

arxiv.org →

fallback

Tue, May 26 1 entries

blog local-llm 4w ago ·

zenn-llm

Gemma 4 の MMLU-Pro スコアを NVIDIA B200 で再現する：ステップ・バイ・ステップガイド A step-by-step guide on reproducing Google Gemma 4 31B-IT's claimed ~85.2% MMLU-Pro score …

重要度 Medium Medium priority 重要度 Medium · 技術記事 · Local LLM / Open Models Medium priority · technical post · Local LLM / Open Models 公開 5月26日 Published May 26

AI要約 Google の Gemma 4 31B-IT が主張する MMLU-Pro 約 85.2% を NVIDIA B200 上で lm_eval を使って手元再現する手順を詳解した実践ガイド。

EN A step-by-step guide on reproducing Google Gemma 4 31B-IT's claimed ~85.2% MMLU-Pro score on NVIDIA B200 hardware using lm_eval, covering practical pitfalls beyond a single command.

#llm #open-model #zenn +6

zenn.dev →

fallback

Thu, May 7 1 entries

NEW blog local-llm 1mo ago ·

huggingface-blog

vLLM V0からV1へ:RLにおける修正より正確性を優先 vLLM V0 to V1: Correctness Before Corrections in RL

重要度 Medium Medium priority 重要度 Medium · 技術記事 · Local LLM / Open Models Medium priority · technical post · Local LLM / Open Models 公開 5月7日 Published May 7

AI要約 ServiceNow AIがvLLMをV0からV1に移行した際の強化学習トレーニングで生じた数値的不一致と再現性問題を検証。ロジット計算やバッチ処理の正確性を確認してから修正に進む重要性を示した。

EN ServiceNow AI examined numerical discrepancies and reproducibility issues that arose when migrating vLLM from V0 to V1 for RL training, stressing the need to verify logit and batching correctness before applying corrections.

#huggingface #open-model #vllm +7

huggingface.co →

vLLM V0 to V1: Correctness Before Corrections in RL

og fallback

Wed, Feb 4 1 entries

NEW blog local-llm 4mo ago ·

huggingface-blog

Community Evals：ブラックボックスのリーダーボードより、コミュニティの評価を信頼する時代へ Community Evals: Because we're done trusting black-box leaderboards over the community

重要度 Medium Medium priority 重要度 Medium · 技術記事 · Local LLM / Open Models Medium priority · technical post · Local LLM / Open Models 公開 2月4日 Published Feb 4

AI要約 Hugging Faceがコミュニティ主導のLLM評価プラットフォーム「Community Evals」を発表。不透明なリーダーボードに依存せず、透明性と再現性を備えたオープンな評価エコシステムの構築を目指す。

EN Hugging Face launched Community Evals, a community-driven LLM evaluation platform that prioritizes transparency and reproducibility as an open alternative to opaque black-box leaderboards.

#huggingface #open-model #llm-evaluation +7

huggingface.co →

fallback

#reproducibility 5 total

Entries page 1/1 · 5 total

OpenMythosはFable 5 / Mythos 5を再現できているのか This article empirically examines how accurately OpenMythos reproduces the behavior and ca…

DeployBench：研究成果物のデプロイメントにおけるLLMエージェントのベンチマーク DeployBench: Benchmarking LLM Agents for Research Artifact Deployment

Gemma 4 の MMLU-Pro スコアを NVIDIA B200 で再現する：ステップ・バイ・ステップガイド A step-by-step guide on reproducing Google Gemma 4 31B-IT's claimed ~85.2% MMLU-Pro score …

vLLM V0からV1へ:RLにおける修正より正確性を優先 vLLM V0 to V1: Correctness Before Corrections in RL

Community Evals：ブラックボックスのリーダーボードより、コミュニティの評価を信頼する時代へ Community Evals: Because we're done trusting black-box leaderboards over the community