ローカルLLM動作比較: gpt-oss vs DiffusionGemma vs Qwen3.5—tok/s は仕事の速さではない A three-way local LLM shootout of gpt-oss, DiffusionGemma, and Qwen3.5 finds that tok/s is…

Zenn LLM tag · zenn.dev · 2026/07/03 19:00 · 21h ago · 📖 2 min

AI 3 行サマリ

gpt-oss、DiffusionGemma、Qwen3.5をローカル環境で比較し、tok/sという速度指標だけでは実際の作業品質を測れないことを検証した記事で、モデル選定の新たな視点を提示している。

English summary

A three-way local LLM shootout of gpt-oss, DiffusionGemma, and Qwen3.5 finds that tok/s is a misleading benchmark—practical output quality and task accuracy are equally important for real-world usefulness.

ローカル環境で動く大規模言語モデル（LLM）を比較する際、多くの人がまず注目するのが「1秒あたり何トークン生成できるか」を示す tok/s という指標だ。しかし、この数値が高いモデルが必ずしも「仕事が速い」わけではない——。zenn に投稿された記事は、gpt-oss、DiffusionGemma、Qwen3.5 の3モデルをローカルで動かして比較し、速度指標だけでは実際の作業品質を測れないことを検証している。

比較対象は性格の異なる3つのモデルだ。gpt-oss は OpenAI が公開したオープンウェイトモデル、Qwen3.5 はアリババが開発する Qwen 系列の系譜にあたる。DiffusionGemma は、Google の Gemma をベースにしつつ、拡散（Diffusion）方式でテキストを生成する点が特徴的とされる。従来主流の自己回帰型（トークンを1つずつ順番に予測する方式）に対し、拡散型は複数のトークンをまとめて並列的に洗練させていく仕組みを採る。この生成方式の違いが、tok/s の単純比較を難しくしている。

記事の要点は、tok/s が高くても実際のタスク完了品質が伴わなければ意味がない、という点にある。たとえば推論（reasoning）を重視するモデルは、答えにたどり着くまでに大量の中間トークンを消費するため、見かけの生成速度が高くても、正解に到達するまでの総時間や精度では別の評価になり得る。逆に、少ないトークンで的確な出力を返せるモデルは、tok/s では見劣りしても実務では効率的な場合がある。

こうした視点は、ローカル LLM を選ぶ実務家にとって示唆に富む。近年は Ollama や LM Studio、llama.cpp といったツールの普及で、個人の PC でも複数モデルを手軽に切り替えられるようになった。量子化（モデルを軽量化する手法）の進展もあり、選択肢は急速に増えている。だからこそ、単一のベンチマーク数値ではなく、自分の用途に即した品質・精度・速度のバランスを見極める必要があるといえる。

記事はあくまで特定環境での検証であり、ハードウェアや設定次第で結果は変わる可能性がある。それでも、tok/s を「仕事の速さ」と同一視する落とし穴に注意を促す内容として、モデル選定に新たな視点を投じるものといえるだろう。

Local large language models have become practical enough that individual developers now routinely run them on consumer hardware, and a common question follows: which model is fastest? A recent Zenn blog post pushes back on that framing by comparing three locally deployable models—gpt-oss, DiffusionGemma, and Qwen3.5—and arguing that tokens per second (tok/s), the headline metric in most benchmarks, is a poor proxy for how quickly a model actually finishes useful work.

The premise is straightforward. Tok/s measures raw generation throughput: how many tokens a model can emit each second once it starts producing output. That number is easy to capture and easy to compare, which is exactly why it dominates leaderboards and marketing. But the author notes that throughput says little about whether the resulting text is correct, complete, or usable without revision. A model that streams 80 tokens per second yet produces code that fails to compile can be slower, in wall-clock terms, than one that emits 30 tokens per second and gets the answer right the first time.

The three models also illustrate why the comparison is not apples-to-apples. gpt-oss belongs to the family of open-weight reasoning-oriented models, which tend to spend many "thinking" tokens before delivering a final answer. Those intermediate tokens count toward tok/s and total generation time even though the user never reads them directly, so a high throughput figure can coexist with a long time-to-answer. Qwen3.5, from the widely used Qwen lineage, represents a conventional autoregressive decoder that generates one token at a time. DiffusionGemma is the most unusual of the three: it is described as a diffusion-based text model that refines an entire sequence in parallel across denoising steps rather than left-to-right. For diffusion models the very notion of tokens per second is ambiguous, because output does not arrive in the same sequential stream, which further undermines cross-architecture comparison on that single axis.

The post suggests several metrics that track real productivity more closely than raw tok/s. Time to first token matters for interactive use, where perceived responsiveness shapes the experience. Total time to a correct, accepted answer—sometimes framed as "goodput" rather than throughput—captures the cost of retries and corrections. And task-level accuracy, measured on the kind of prompts a user actually issues, determines how often fast output has to be discarded. When reasoning models are involved, counting only the final answer tokens, or measuring end-to-end latency, can produce a very different ranking than counting every generated token.

Some context helps explain why these distinctions surface now. Running models locally usually means going through inference stacks such as llama.cpp, Ollama, LM Studio, or vLLM, and applying quantization—reducing weight precision to formats like 4-bit GGUF—to fit models into limited VRAM. Quantization changes both speed and quality, so the same model can post very different tok/s and accuracy figures depending on how it is packaged. Hardware, context length, and batch size add further variance. All of this means a single benchmark number rarely transfers cleanly from one setup to another.

The broader industry appears to be moving in the direction the article describes. Diffusion-based language models, including research systems and commercial efforts that emphasize parallel generation, are being promoted partly on speed, which makes fair comparison against autoregressive models a live issue. Meanwhile, the spread of open-weight releases—OpenAI's gpt-oss, Alibaba's Qwen, and Google's Gemma among them—has given local users genuine choice and, with it, the need for evaluation methods that reflect their own workloads rather than generic leaderboards.

The article stops short of crowning a winner, which is consistent with its argument: the "best" local model likely depends on the task, the tolerance for errors, and whether latency or throughput dominates the user's workflow. Its practical recommendation is to benchmark against representative prompts and to weigh output quality alongside speed, since a model that is fast but frequently wrong is likely to cost more total time than its tok/s figure suggests. For anyone selecting a local LLM, the piece is a useful reminder that the most quotable metric is not necessarily the most decision-relevant one.