GCSFSとRapid BucketでGoogle ColossusをPyTorchに直結 Speeding Up AI: Bringing Google Colossus to PyTorch via GCSFS and Rapid Bucket
- GoogleはPyTorch向けにGCSFSとRapid Bucketを統合し、分散ファイルシステムColossusへの直接アクセスを可能にした。
- これによりAI学習・チェックポイントのI/Oが大幅に高速化され、GPU/TPUの待ち時間削減が期待される。
English summary
- Google Cloud has introduced a high-performance integration that connects Rapid Storage directly to PyTorch via the fsspec interface to eliminate AI training bottlenecks.
- By utilizing Google’s Colossus
Googleは、社内で長年使われてきた分散ファイルシステムColossusを、PyTorchユーザーがより直接的に活用できる仕組みを公開した。GCSFS(Google Cloud Storage FUSE)と新しいRapid Bucketを組み合わせることで、AI学習ジョブにおけるストレージI/Oのボトルネックを大幅に緩和することを狙う。
ColossusはGoogle検索やYouTube、Borgなどを支えるエクサバイト規模の分散ストレージ基盤で、これまでGoogle Cloud Storage(GCS)のバックエンドとして間接的に利用されてきた。今回のRapid Bucketは、従来のGCSバケットよりも低レイテンシ・高スループットを実現する新タイプで、特に小さなオブジェクトの大量読み書きが発生するAIワークロードに最適化されているとされる。GCSFSを介してマウントすれば、PyTorchのDataLoaderやチェックポイント保存処理がローカルファイルシステムと同等の感覚で扱える。
大規模モデル学習では、GPUやTPUの計算待ちを生むのはしばしばI/Oであり、特にチェックポイントの書き込みは数百GB規模に達する。Rapid Bucketによりこの待ち時間を短縮できれば、高価なアクセラレータの稼働率向上に直結する可能性がある。
GoogleはPyTorch向けにGCSFSとRapid Bucketを統合し、分散ファイルシステムColossusへの直接アクセスを可能にした。
類似の方向性として、AWSはS3 Express One Zoneを、Microsoft AzureはAzure Managed Lustreを提供しており、ハイパースケーラ各社がAI向け高速ストレージ層を競って整備している状況だ。PyTorch側でもTorchData、torch.distributed.checkpointなどI/O最適化の取り組みが進んでおり、Googleの今回の統合はその流れに沿うものと位置付けられる。
なお実際の性能向上幅はモデルサイズやアクセスパターンに大きく依存するため、利用前にベンチマークを行うことが望ましい。Colossusという普段は見えないインフラがAI開発者の手元に近づいてきた点は、クラウドストレージとAIの結合がより深化している象徴と見ることができる。
Google has unveiled a tighter integration between PyTorch and Colossus, the company's internal exabyte-scale distributed file system, by exposing it through GCSFS (Google Cloud Storage FUSE) and a new offering called Rapid Bucket. The goal is to remove storage I/O as a bottleneck in modern AI training pipelines, where GPUs and TPUs frequently sit idle waiting for data or checkpoint writes to complete.
Colossus has long been the foundation beneath Google services such as Search, YouTube, and the Borg cluster manager, and it has historically powered Google Cloud Storage as an implementation detail. Rapid Bucket is positioned as a new bucket class that exposes more of Colossus's raw performance characteristics, offering lower latency and higher throughput than standard GCS buckets. This is particularly relevant for AI workloads, which tend to involve either massive sequential reads of training shards or bursty writes of multi-hundred-gigabyte model checkpoints.
By mounting a Rapid Bucket via GCSFS, PyTorch developers can interact with Colossus as if it were a local POSIX file system. Standard constructs such as DataLoader, torch.save, and torch.distributed.checkpoint work without bespoke client libraries, which lowers the barrier to adoption. For large language model training, where a single checkpoint can stall a multi-thousand-accelerator job for minutes, even modest improvements in write throughput translate directly into higher utilization of expensive hardware.
The move fits a broader industry pattern. AWS introduced S3 Express One Zone last year as a low-latency tier explicitly aimed at AI and analytics, and Microsoft offers Azure Managed Lustre alongside its standard Blob Storage for similar reasons. Hyperscalers appear to have concluded that object storage's traditional latency profile is incompatible with the demands of frontier model training, and each is responding by carving out a higher-performance tier backed by their proprietary file systems. Google's twist is that it leans on Colossus, arguably the most battle-tested of these systems, having scaled inside the company for nearly two decades.
Google Cloud has introduced a high-performance integration that connects Rapid Storage directly to PyTorch via the fsspec interface to eliminate AI training bottlenecks.
On the framework side, the PyTorch ecosystem has been steadily improving its own I/O story. TorchData, asynchronous checkpointing in torch.distributed.checkpoint, and projects like NVIDIA's DALI all attempt to overlap compute with data movement. A fast remote file system such as Rapid Bucket complements rather than replaces these efforts: the framework can pipeline reads more aggressively when the underlying store responds quickly and predictably.
Actual speedups will depend heavily on model architecture, batch size, and access patterns, so teams should benchmark against their own workloads before migrating production pipelines. Pricing and regional availability of Rapid Bucket will also factor into the decision, as higher-performance storage tiers typically carry a premium. Still, the broader signal is clear: infrastructure that was once invisible to application developers, like Colossus, is increasingly being surfaced as a first-class primitive for AI builders, reflecting just how tightly storage and accelerator economics are now intertwined.
本ページの本文・要約は AI による自動生成です。正確性は元記事 (developers.googleblog.com) をご確認ください。