無料のexecutorが一番高くついた話 — Opus + ローカルQwenが全タスクで最高額になった理由 Pairing Claude Opus with a free local Qwen executor paradoxically yielded the highest cost…

Zenn LLM tag · zenn.dev · 2026/07/03 18:46 · 21h ago · 📖 2 min

AI 3 行サマリ

無料のローカルQwenをexecutorとして採用したところ、タスク失敗率の上昇によりClaude Opus APIへの呼び出しが増加し、40回の試行を通じて全構成中で最もコストが高くなるという逆説的な結果が示された。
モデルの組み合わせがトータルコストに与える影響を再考させる事例である。

English summary

Pairing Claude Opus with a free local Qwen executor paradoxically yielded the highest costs across 40 trials, showing that low-capability executors can silently inflate API spending.

「無料だから安い」という直感が、エージェント型のLLM構成では必ずしも成り立たない——。Zennに投稿された検証記事が、Claude Opusを司令塔(planner)に据え、ローカルで動く無料のQwenを実行役(executor)に組み合わせた構成が、40回の試行を通じて全パターン中で最も高コストになったという逆説的な結果を報告した。

検証の背景には、近年広がる「プランナー・エグゼキューター」型のワークフローがある。高価だが高性能なモデルにタスクの分解や判断を任せ、単純な実行部分を安価または無料のモデルに振り分けることで、全体の費用を抑えるという発想だ。ローカルで動作するオープンモデルであるQwenはAPI課金が発生しないため、executorとして採用すれば実行コストをほぼゼロにできると期待される。

しかし実際には、executorの能力不足がタスクの失敗率を押し上げた。executorが指示を正しく遂行できないと、plannerであるClaude Opusが状況を再判断し、修正や再試行の指示を出す必要が生じる。この「やり直し」のたびにOpusへのAPI呼び出しが増え、結果的に最も高価なモデルの使用量が膨らんでいったと見られる。無料のはずのローカル実行が、上流の課金モデルを繰り返し呼び出す引き金になった格好だ。

この現象は、トータルコストを「1回あたりの単価」ではなく「タスク完了までに要した総呼び出し量」で捉える必要があることを示している。単価の安い部品を組み込んでも、全体の失敗率が上がれば高価な部品の稼働が増え、かえって割高になる可能性がある。

無料のローカルQwenをexecutorとして採用したところ、タスク失敗率の上昇によりClaude Opus APIへの呼び出しが増加し、40回の試行を通じて全構成中で最もコストが高くなるという逆説的な結果が示された。

🏠 Local LLM / Open Models · 本記事のポイント

Qwenはアリババが公開するオープンモデル群で、ローカル推論やコスト最適化の文脈でしばしば選択肢に挙がる。同様のハイブリッド構成は、上位にOpenAIやGoogleの高性能モデルを、下位に軽量なオープンモデルを据える形でも試みられており、モデル同士の「相性」や役割分担の設計が成果を左右する要因になりつつある。

今回の結果はあくまで特定のタスクと40回という限られた試行に基づくものであり、すべての場面に一般化できるわけではない。executorに用いるモデルのサイズやプロンプト設計を調整すれば、異なる傾向が出る可能性もある。ただし、無料や低単価のモデルを組み込む際には、失敗による再試行が上流コストへ波及するリスクを含めて総合的に評価すべきだという教訓は、多くの実運用に通じるものと言えそうだ。

A recent write-up on Zenn explores a counterintuitive finding for anyone building agentic LLM systems: pairing a premium planning model with a free, locally hosted executor can end up costing more than a leaner setup. The author combined Anthropic's Claude Opus as an orchestrator with a locally run Qwen model acting as the task executor, and across 40 trials this configuration produced the highest total spend of any setup tested. The result matters because the "let a cheap or free model do the grunt work" pattern is widely assumed to reduce costs, and this case suggests that assumption can quietly break down.

The architecture in question is a common one. In many agentic pipelines, a capable model plans, decomposes work, and reviews results, while a smaller or cheaper model carries out individual steps. The appeal is obvious: the expensive model is invoked sparingly for high-level reasoning, and the bulk of token throughput is handled by something cheaper. When the executor runs locally on open weights such as Qwen, its marginal API cost is effectively zero, so on paper the combination looks like the frugal choice.

The problem, according to the report, is that cost is not driven by per-token price alone but by how often the system has to retry, re-plan, and re-verify. When a lower-capability executor fails a task or produces output that does not pass the orchestrator's checks, the workflow loops back to the expensive planner. Each failure appears to trigger additional Opus calls to diagnose the problem, adjust instructions, and re-issue the task. Those extra planning and review turns accumulate, and because Opus sits at the premium end of Anthropic's pricing, even a modest increase in call volume can dominate the bill. The free executor, in other words, appears to have shifted spending onto the most expensive component rather than removing it.

The dynamics echo a broader lesson in cost optimization for LLM applications: total cost of ownership depends on the end-to-end success rate, not the sticker price of any single model. A capable executor that finishes a task in one pass may generate fewer overall tokens than a weak one that needs three attempts, each accompanied by supervisory overhead from the orchestrator. This is sometimes framed as the difference between unit cost and effective cost per completed task.

Some background helps frame the trade-off. Qwen is Alibaba's family of open-weight models, available in a range of sizes and popular for local deployment because it can be run on consumer or workstation hardware without per-request fees. Claude Opus is the largest and most expensive tier in Anthropic's Claude line, positioned for complex reasoning. Local inference is not truly free either; it carries hardware, memory, and energy costs that a pure API accounting would miss, though those were apparently not the deciding factor here. The headline cost driver was the escalation in paid Opus usage.

There are several ways such a setup could be tuned, and the report implicitly points to them. Choosing a stronger local model, or a larger Qwen variant, might raise the executor's single-pass success rate enough to shorten the retry loop. Alternatively, using a mid-tier hosted model such as Claude Sonnet or Haiku for planning could lower the cost of each supervisory turn even if the number of turns stays high. Capping retries, adding cheaper validation steps before escalating, and matching executor capability to task difficulty are other common mitigations.

The findings should be read with caution. Forty trials on a particular task mix is a small sample, and outcomes will vary with the prompts, the task types, the specific Qwen size used, and the pricing in effect at the time. The result does not show that local executors are generally uneconomical, only that a capability mismatch between planner and executor can invert the expected savings. It is also worth noting that the balance is likely to move as open-weight models improve, since a more reliable executor would reduce the failed handoffs that drove the cost.

As hybrid local-plus-cloud designs become more common, the practical takeaway is to measure cost per successful task across the whole pipeline rather than reasoning from per-token rates or the word "free." An executor with no invoice can still be the most expensive part of a system if it makes the premium model work harder.