IBM、企業文書向け軽量マルチモーダルモデルGranite 4.0 3B Visionを公開 Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents

Hugging Face Blog · huggingface.co · 2026/04/01 00:10 · 2mo ago · 📖 2 min

AI 3 行サマリ

IBMが企業文書処理に特化した軽量マルチモーダルモデルGranite 4.0 3B Visionを発表。
3Bパラメータながら文書理解やOCR、表・図解析で大規模モデルに匹敵する性能を示し、Apache 2.0で公開された。
エンタープライズ用途を意識した設計が特徴。

IBMは、企業文書処理に特化した軽量マルチモーダルモデル「Granite 4.0 3B Vision」を公開した。3Bという小規模パラメータながら、OCR、表・チャート理解、レイアウト解析といった業務文書特有のタスクで実用的な精度を狙った設計が特徴で、Apache 2.0ライセンスで提供される。

本モデルはGranite 4.0テキストモデル系列を基盤に、ビジョンエンコーダを統合した構成と見られる。請求書、契約書、レポート、スキャンPDFといった非構造データから情報を抽出するユースケースを想定し、DocVQAやChartQA、InfographicVQAなどの文書系ベンチマークで、同規模もしくはそれ以上のオープンモデルと競合する数値が示されているという。IBMはエンタープライズ採用を念頭に、ハルシネーション抑制やガバナンス、長文・多ページ文書への対応を重視していると説明している。

3Bクラスのコンパクトさは、オンプレミスやエッジ、規制産業での自社ホスティングに適する点で意義が大きい。GPUメモリ要件が低く、量子化と組み合わせれば単一の中位GPU、あるいはCPU推論でも回せる可能性がある。watsonx.aiやTransformers、vLLMなど主要な推論スタックでの利用を想定した配布形態が用意されているとされる。

3Bパラメータながら文書理解やOCR、表・図解析で大規模モデルに匹敵する性能を示し、Apache 2.0で公開された。

🏠 Local LLM / Open Models · 本記事のポイント

背景として、文書AI領域ではMicrosoftのFlorence-2やPhi-3.5-vision、AlibabaのQwen2-VL、MistralのPixtralなど、軽量マルチモーダルの競争が激化している。中でも企業文書はレイアウトの複雑さと機密性から、クローズドAPIに頼りにくい領域であり、オープンライセンスでチューニング可能なGraniteの位置付けは明確だ。IBMは以前からGranite Code、Granite Time Series、Granite Guardianなど用途別の小型モデル群を展開しており、本リリースもその「専門特化×小型×Apache 2.0」戦略の延長線上にあると言える。RAGパイプラインに組み込みやすい点も、実務導入を後押しする可能性がある。

IBM has released Granite 4.0 3B Vision, a compact multimodal model purpose-built for enterprise document understanding. With only three billion parameters, it targets practical accuracy on workloads like OCR, table and chart reasoning, and layout analysis, and ships under a permissive Apache 2.0 license.

The model extends IBM's Granite 4.0 text family by attaching a vision encoder, enabling it to ingest images of invoices, contracts, multi-page reports, scanned PDFs and infographics alongside text prompts. According to IBM, the model holds its own against larger open multimodal systems on document-centric benchmarks such as DocVQA, ChartQA and InfographicVQA. The training and tuning recipe appears focused on enterprise priorities: reducing hallucination on extracted fields, handling long or multi-page documents, and producing structured outputs that downstream pipelines can consume.

The small footprint is a deliberate design choice. A 3B vision-language model can run on a single mid-range GPU, and with quantization it may be feasible on commodity hardware or even CPU in some scenarios. That matters for regulated industries — finance, healthcare, the public sector — where sending documents to a hosted frontier API is often a non-starter. IBM positions the model for self-hosted deployment via watsonx.ai as well as the broader open ecosystem, including Hugging Face Transformers and likely vLLM-style serving stacks.

The release lands in an increasingly crowded small multimodal segment. Microsoft's Phi-3.5-vision and Florence-2, Alibaba's Qwen2-VL, and Mistral's Pixtral have all pushed the frontier of what compact VLMs can do, and document AI specifically has long been contested by specialist systems like Donut, LayoutLMv3 and Nougat. Granite's differentiation is less about raw benchmark supremacy and more about the bundle: a permissive license, an enterprise-friendly governance story, and a coherent family that already includes Granite Code, Granite Time Series and Granite Guardian for safety filtering.

For practitioners, the most interesting integration point is retrieval-augmented generation over document corpora. A small VLM that can read a page image directly removes a brittle OCR-then-LLM step, and the cost profile makes it realistic to process millions of pages in batch. It is reasonable to expect community fine-tunes targeting specific verticals such as insurance claims or clinical forms to appear quickly, given the open weights. Whether Granite 4.0 3B Vision becomes a default choice will depend on real-world robustness on messy scans and non-English documents, areas where benchmark numbers historically tell only part of the story.