LLMから検索を切り離すDSG、DoorDashが検索コストを9割超削減 LLMから検索を切り離すDSG、DoorDashが検索コストを9割超削減
- 「最終的な答えだけを、説明抜きで返せ」とプロンプトで念を押す。
- なのにWeb検索ツールを与えた瞬間、モデルは「検索結果によると、200 West Streetは749フィートで、888 7th Avenueは628フィートなので…」と語り出し
検索ツールを与えられた大規模言語モデル(LLM)が、頼んでもいない説明文を延々と生成してしまう問題は、コスト面での無視できない負担になっている。これを構造から見直す「DSG」と呼ばれるアプローチで、DoorDashが検索関連のコストを9割超削減したと報じられた。
問題の出発点はシンプルだ。「最終的な答えだけを、説明抜きで返せ」とプロンプトで念を押しても、Web検索ツールを渡した瞬間にモデルは「検索結果によると、200 West Streetは749フィートで、888 7th Avenueは628フィートなので…」と冗長に語り出す。検索結果を逐一読み込み、その内容を文章として再生成するため、出力トークンが膨らみ、レイテンシも料金も増える。エージェント型の構成では検索と推論が密結合になりやすく、この傾向はさらに強まると見られる。
DSGの基本的な発想は、検索という工程をLLMの生成プロセスから切り離すことにある。モデルに「結果を語らせる」のではなく、必要なデータ取得は外部の検索・取得処理に任せ、LLMは整形済みの情報をもとに最小限の答えだけを返す。冗長な説明の生成を抑えることで、出力トークン量そのものを削減し、コストを大きく圧縮できる狙いだ。検索とモデル呼び出しを役割分担させる設計は、近年のRAG(検索拡張生成)の発展形とも位置づけられる。
なのにWeb検索ツールを与えた瞬間、モデルは「検索結果によると、200 West Streetは749フィートで、888 7th Avenueは628フィートなので…」と語り出し
背景には、ツール接続の標準化が進んだことがある。Anthropicが提唱したMCP(Model Context Protocol)をはじめ、外部データやツールをLLMにつなぐ仕組みが広がる一方で、ツールを増やすほど応答が冗長化し、コストが膨張する副作用も顕在化していた。OpenAIやGoogleも関数呼び出しやエージェント向け機能を拡充しており、いかにトークンを抑えつつ精度を保つかが共通課題になっている。
DoorDashのように検索リクエストが大量に発生するサービスでは、1件あたりのわずかなトークン差が総額で大きな違いを生む。9割超という削減幅は、プロンプト調整だけでなく、生成と検索の責務を分離するアーキテクチャ転換の効果と考えられる。同様の手法は他社のコスト最適化にも応用しうる可能性があり、今後の運用ノウハウとして注目されそうだ。
Large language models are increasingly expected to call external tools, and web or catalog search is one of the most common. Yet a recurring inefficiency undermines that pattern: even when explicitly told to return only a final answer, a model handed a search tool tends to narrate its reasoning, quoting intermediate findings before reaching a conclusion. A technique described as DSG, reportedly used by DoorDash, aims to address this by separating search from the language model, and the company says the change cut search costs by more than 90 percent.
The problem is easy to demonstrate. A prompt may insist, "return the final answer only, with no explanation," but the moment a web search capability is attached, the model begins to explain itself: "According to the search results, 200 West Street is 749 feet and 888 7th Avenue is 628 feet, so..." Every one of those words consumes tokens, and tokens cost money and add latency. When a system processes large search payloads on each query, the model effectively re-reads and re-summarizes substantial volumes of text, multiplying the bill across millions of requests.
DSG appears to stand for an approach that decouples the search-and-retrieval stage from generation, rather than asking a single model to both fetch and verbalize results in one expensive pass. In that design, search is handled by a lighter, more deterministic component, and the language model is invoked only for the parts that genuinely require natural-language synthesis. The intermediate reasoning that models like to spell out is suppressed or never sent through the costly generation path, so the model is not paid to repeat the contents of the search index back to itself. The reported reduction of more than 90 percent suggests that, at the scale of a platform like DoorDash, most search-related token spend was going to redundant narration and bulk text processing rather than to actual answer formation.
This sits within the broader shift toward agentic and tool-using systems. Retrieval-augmented generation, or RAG, became standard practice because it lets models ground answers in current data they were not trained on. Search tools extend that idea, letting an agent decide what to look up and when. But the same flexibility creates cost exposure: each tool call may return long documents, and an unconstrained model will fold them into verbose chains of thought. Engineering teams have learned that controlling what enters and exits the model is often more impactful than swapping in a larger model.
The Model Context Protocol, or MCP, is relevant context. MCP standardizes how models connect to tools, data sources, and search backends, making it easier to attach capabilities like web lookup. That convenience is precisely why cost discipline matters: when adding a search tool is trivial, the volume of tokens flowing through pricey models can grow quickly. Techniques that move retrieval out of the model loop, cache results, or restrict output to compact answers are complementary to MCP rather than competing with it. DSG, as described, looks like one such cost-control layer rather than a replacement for tool calling.
For context, others have pursued similar goals through different means, including structured tool outputs, constrained decoding, smaller specialized models for routing, and aggressive prompt engineering to limit explanatory text. None of these fully solves the tendency of models to "think out loud," which is why architectural separation can be more reliable than prompting alone. A prompt instruction can be ignored; an architecture that never feeds raw search text into the expensive generation step cannot drift in the same way.
The figures here come from a single company's reporting and should be read as a directional result rather than a universal benchmark. Savings of that magnitude likely reflect a workload where search was both frequent and the dominant cost driver. Even so, the underlying lesson is broadly applicable: as more products bolt search onto LLMs, the cheapest token is the one never generated. Teams evaluating agentic search would do well to measure how much of their spend is producing answers versus restating retrieved data, then decide whether decoupling search from generation is worth the added system complexity.
本ページの本文・要約は AI による自動生成です。正確性は元記事 (qiita.com) をご確認ください。