Ulysses Sequence Parallelism: 100万トークン文脈の学習を可能に Ulysses Sequence Parallelism: Training with Million-Token Contexts
- Hugging Faceがブログで紹介したUlysses Sequence Parallelismは、長文脈LLM学習向けの並列化手法。
- アテンションヘッドをGPU間で分割することで通信量を抑え、100万トークン級の文脈長での訓練を現実的にする。
English summary
- Ulysses Sequence Parallelism: Training with Million-Token Contexts
Hugging Faceは、長文脈の大規模言語モデル学習を実用化するための並列化手法「Ulysses Sequence Parallelism」を解説した記事を公開した。フロンティアモデルが扱う文脈長が数十万から数百万トークンへと拡大するなか、メモリと通信のボトルネックをどう解消するかが学習基盤の中核課題になっている。
UlyssesはもともとMicrosoft DeepSpeedチームが提唱した手法で、シーケンス次元に沿って入力トークンをGPU間で分割するのが特徴である。アテンション計算時にはall-to-all通信でテンソルを「シーケンス分割」から「ヘッド分割」へ変換し、各GPUが一部のアテンションヘッドについて全トークン分を保持して計算する。計算後に再度all-to-allで元のシーケンス分割に戻すことで、活性メモリを並列度に応じて線形に削減できる。
従来主流だったRing AttentionやMegatron系のテンソル並列と比較すると、Ulyssesは通信量がヘッド次元のみに依存し、シーケンス長に対して定数に近いスケーリングを示すと説明されている。Hugging Faceの実装ではZeRO並列やFSDPと組み合わせ、モデル重みとオプティマイザ状態を分散しつつ、シーケンス並列でアクティベーションを分散することで、100万トークン規模の文脈での訓練を可能にしているとされる。
Hugging Faceがブログで紹介したUlysses Sequence Parallelismは、長文脈LLM学習向けの並列化手法。
背景として、長文脈学習はGeminiやClaude、Llamaなどの最新世代で標準化が進む領域であり、推論側のFlashAttentionやPagedAttentionと並び、訓練側でもシーケンス並列化の重要性が高まっている。Ulyssesに加えて、Context ParallelismやRing-Flash Attentionなど類似手法も登場しており、用途やヘッド数によって最適な選択が変わる可能性がある。Hugging Faceがこれを自社の学習スタックに統合したことで、オープンソースコミュニティでも長文脈SFTやRLHFが現実的な選択肢になると見られる。
Hugging Face has published a deep dive into Ulysses Sequence Parallelism, a parallelization strategy aimed at making it practical to train large language models on context windows that stretch into the millions of tokens. As frontier models push from hundreds of thousands to a million-plus tokens, activation memory and inter-GPU communication have become the dominant bottlenecks for trainers, and sequence-level parallelism is emerging as a key answer.
Originally proposed by the Microsoft DeepSpeed team, Ulysses shards the input along the sequence dimension so each GPU holds only a slice of tokens. The clever part happens inside attention: an all-to-all collective transposes the tensors from a sequence-sharded layout into a head-sharded one, letting each GPU compute full-length attention for a subset of heads. A second all-to-all converts the result back to sequence sharding for the feed-forward layers. The net effect is that activation memory scales down roughly linearly with the degree of parallelism.
Compared with Ring Attention or Megatron-style tensor parallelism, Ulysses has the appealing property that its communication volume depends on the hidden and head dimensions rather than on sequence length, so it scales gracefully as contexts grow. The downside is that the parallel degree is bounded by the number of attention heads, which can be limiting for models with grouped-query attention or relatively few KV heads. In practice it is often combined with other forms of parallelism to sidestep that ceiling.
In the Hugging Face implementation, Ulysses is composed with ZeRO and FSDP so that weights and optimizer states are sharded across data-parallel ranks while activations are sharded across the sequence dimension. The blog reports that this combination makes training at million-token contexts tractable on existing GPU clusters, opening the door to long-context supervised fine-tuning and RLHF for the open-source community rather than only for closed labs.
The broader context is that long sequences are quickly becoming table stakes. Gemini, Claude and recent Llama releases all advertise context windows in the hundreds of thousands or beyond, and inference-side innovations like FlashAttention, PagedAttention and KV-cache compression have set expectations that training tooling must catch up with. On the training side, Ulysses sits alongside Context Parallelism in NVIDIA's Megatron-LM and the Ring-Flash Attention variants explored in academic work; the right choice likely depends on head count, model architecture and interconnect topology.
For practitioners, the takeaway is that long-context training is no longer purely a hardware problem. By integrating Ulysses into its training stack, Hugging Face is signaling that the open ecosystem can credibly target million-token workloads, though achieving good throughput in practice will still hinge on careful tuning of parallelism degrees, attention kernels and dataset packing strategies. It remains to be seen how this approach compares with hybrid schemes in production-scale runs, but it appears to be a meaningful step toward democratizing very long-context model training.
本ページの本文・要約は AI による自動生成です。正確性は元記事 (huggingface.co) をご確認ください。