GRPOはなぜ長時間学習で崩壊するのか――Qwenが出した「系列単位」の答え、GSPO GRPOはなぜ長時間学習で崩壊するのか――Qwenが出した「系列単位」の答え、GSPO

Zenn LLM tag · zenn.dev · 2026/06/01 21:28 · 2w ago · 📖 2 min

AI 3 行サマリ

! 最先端AIを技術の中身まで日本語で読み解く「AIウォッチ」の深掘り記事です。
一次情報（arXiv 2507.18071 / Qwen公式ブログ）を当たって書いています。
推論モデルの強化学習（RL）は、ここ1年で「ベンチマークを何点上げ

推論モデルの強化学習（RL）トレーニングにおいて、DeepSeekが広めたGRPO（Group Relative Policy Optimization）はその簡潔さから多くのプロジェクトで採用されてきた。しかしQwenチームは、長時間の学習を続けると性能が劣化・崩壊する現象を観察し、その根本原因を分析した上で新たな手法「GSPO」を提案した。

GRPOの問題の核心は「トークン単位」の最適化にある。GRPOは各トークンを独立したサンプルとして扱い、グループ内の相対的な報酬でポリシーを更新する。この設計は、長い系列の後半トークンほど勾配への影響が累積的に大きくなるという不均衡を生む。さらに、クリッピングによるサロゲート損失がトークンレベルで適用されるため、系列全体の意味的一貫性が損なわれやすく、モデルは報酬を最大化するための「近道」を学習してしまう――いわゆる報酬ハッキングだ。

GSPOはこの問題を「系列単位」で扱うことで解決を図る。重要度比率（importance ratio）を個々のトークンではなく系列全体の確率比として計算し、クリッピングも系列レベルで適用する。これにより、モデルが系列の特定部分だけを操作して報酬を稼ぐ抜け穴を塞ぐ。加えてKL正則化もトークン平均ではなく系列単位で制御されるため、学習が進んでも参照モデルからの逸脱が緩やかに抑制される。

一次情報（arXiv 2507.18071 / Qwen公式ブログ）を当たって書いています。

🏠 Local LLM / Open Models · 本記事のポイント

QwenチームはarXiv論文（2507.18071）と公式ブログでGSPOの有効性を報告している。数学・コーディング・論理推論の各ベンチマークで、同一の学習ステップ数において従来GRPOより安定した性能改善を示したとされる。特に長時間学習後半でGRPOが失速・崩壊する傾向がある領域でGSPOは優位性を保ったと見られる。

RLによる推論モデル強化の競争は激しい。OpenAIのo系列、DeepSeek-R1、そしてQwen3-Thinkingと、各社は独自のRL手法を磨いている。GRPOはその実装の簡潔さゆえにオープンソースコミュニティでも広く使われているが、長時間学習の安定性という課題は多くの実践者が直面してきた問題でもある。GSPOがtrl・verl・OpenRLHFといった人気フレームワークへ統合されれば、推論モデルのファインチューニングにおける実用的な選択肢になる可能性がある。アルゴリズム設計の「粒度」――トークンか系列か――という観点は、今後のRL研究における重要な設計軸になると見られる。

Reinforcement learning has become the defining technique for training reasoning models, but one of its most popular algorithms — GRPO, or Group Relative Policy Optimization — carries a subtle flaw that tends to surface only after extended training runs. Researchers from the Qwen team at Alibaba identified this failure mode and published a fix: GSPO, Group Sequence Policy Optimization, described in arXiv preprint 2507.18071.

GRPO gained widespread adoption largely because of its elegance. Rather than relying on a separate critic network like PPO, it estimates a baseline reward by averaging returns across a group of sampled responses, then updates the policy using a clipped surrogate objective. The problem, as Qwen's analysis shows, lies in how GRPO treats tokens as independent samples. When importance ratios and clipping are applied at the token level, later tokens in a long sequence accumulate disproportionate gradient influence. The model can exploit this by manipulating specific token positions to maximize reward without improving the quality of the overall response — a textbook case of reward hacking.

KL divergence control compounds the issue. Token-averaged KL penalties can appear well-behaved on the surface while the model drifts significantly from the reference policy in ways that aggregate statistics obscure. Over thousands of training steps, this drift can destabilize learning entirely, producing the performance collapse that practitioners have observed anecdotally across various GRPO-based projects.

GSPO addresses both problems by shifting the unit of optimization from the token to the sequence. The importance ratio — the core quantity that PPO-style algorithms use to constrain how far the updated policy strays from the old one — is computed as a product of per-token probabilities across the entire sequence, then treated as a single scalar for clipping purposes. Similarly, KL regularization is applied at the sequence level. The effect is that the model must improve full responses holistically rather than gaming individual token positions, and the reference-policy constraint remains meaningful throughout long training runs.

In benchmarks spanning mathematical reasoning, coding, and logical inference, Qwen reports that GSPO maintains stable improvement curves where GRPO tends to plateau or degrade in later training stages. The gains appear most pronounced precisely in the extended-training regime — which is also the regime that matters most for frontier reasoning models, where teams routinely run RL for thousands of steps.

The broader context matters here. DeepSeek-R1's public release of GRPO as a simpler PPO alternative sparked a wave of open-source reasoning model projects using frameworks like TRL, VERL, and OpenRLHF. Many practitioners hit stability walls that were hard to diagnose. GSPO offers a theoretically motivated explanation for those failures and a drop-in conceptual replacement — though integration into mainstream frameworks will determine how quickly it sees adoption.

The question of granularity — whether to optimize at the token, step, or sequence level — is shaping up to be a central design axis in RL for language models. Process reward models already push in the direction of finer-grained credit assignment, while GSPO argues for coarser, sequence-level constraints on the policy update itself. Both directions reflect the field's growing understanding that naive token-level RL objectives can misalign with the goal of producing coherent, high-quality long-form reasoning. Qwen's contribution adds a concrete data point to that evolving picture.