#inference-optimization — TECH Dashboard

NEW blog local-llm 1h ago ·

zenn-llm

投機的デコーディングはなぜ速いのか？トイモデルで検証する This article investigates why speculative decoding accelerates LLM inference using a toy m…

AI要約投機的デコーディングが高速化する仕組みを、簡易的なトイモデルを用いて検証した記事。小さなドラフトモデルでトークンを先読みし、大きなモデルで検証する手法の効果を実験的に示し、その理論的背景を解説している。

EN This article investigates why speculative decoding accelerates LLM inference using a toy model. It experimentally demonstrates and explains the mechanism where a small draft model predicts tokens that a larger model verifies in parallel.

#llm #zenn #speculative-decoding #inference-optimization

zenn.dev →

NEW paper research 7h ago ·

arxiv-cs-ai

確率的KVルーティング: 適応的な層方向キャッシュ共有の実現 Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

AI要約 Transformerの推論コスト削減のため、層間でKVキャッシュを共有する新手法を提案。確率的ルーティングにより、各トークンが動的に異なる層のキャッシュを参照可能にし、適応的な深さ方向の共有を実現する。

EN Proposes Stochastic KV Routing, a method enabling adaptive depth-wise KV cache sharing across Transformer layers. Tokens probabilistically route to different layers' caches, reducing inference costs while maintaining model quality.

#arxiv #paper #kv-cache #transformer

arxiv.org →

#inference-optimization page 1/1 · 2 total

投機的デコーディングはなぜ速いのか？トイモデルで検証する This article investigates why speculative decoding accelerates LLM inference using a toy m…

確率的KVルーティング: 適応的な層方向キャッシュ共有の実現 Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing