Sentence Transformersでマルチモーダル埋め込みとリランカーをサポート Multimodal Embedding & Reranker Models with Sentence Transformers

Hugging Face Blog · huggingface.co · 2026/04/09 09:00 · 2mo ago · 📖 2 min

AI 3 行サマリ

Sentence Transformersが画像やマルチモーダル入力に対応するよう拡張された。
CLIPやSigLIPなどのモデルを共通APIで扱え、テキスト・画像横断の埋め込みやリランキングが可能になり、検索やRAGの構築が容易になる。

Hugging Faceは、定番ライブラリSentence Transformersをマルチモーダル対応へと拡張したと発表した。テキスト中心だった同ライブラリが画像入力もネイティブに扱えるようになり、テキストと画像をまたいだ検索やリランキングが共通のシンプルなAPIで実現できる。

新機能では、CLIPやSigLIPといった代表的なビジョン言語モデルをSentenceTransformerクラスから直接読み込み、テキストと画像を同一の埋め込み空間にエンコードできる。さらに、CrossEncoder相当のリランカーもマルチモーダル入力を受け付けるようになり、画像クエリとテキスト候補、あるいはその逆といった組み合わせでスコアリングが可能になったとされる。学習面でも既存の対比学習・損失関数群がマルチモーダルペアに適用でき、ドメイン特化のファインチューニングがしやすくなる。

背景として、RAG(検索拡張生成)やセマンティック検索の用途が文書だけでなくスクリーンショット、図表、商品画像、PDFのページ画像にまで拡張している事情がある。ColPaliやVisRAGに代表される「ページを画像として埋め込む」アプローチが注目を集め、JinaやVoyage、Cohereなどもマルチモーダル埋め込みAPIを提供する流れが強まっている。Sentence Transformersは長らくオープンソースの埋め込み開発のデファクトであっただけに、マルチモーダル対応は同分野の標準化を後押しする可能性がある。

CLIPやSigLIPなどのモデルを共通APIで扱え、テキスト・画像横断の埋め込みやリランキングが可能になり、検索やRAGの構築が容易になる。

🏠 Local LLM / Open Models · 本記事のポイント

また、Transformersライブラリ側でビジョン言語モデルのサポートが整備されたことが今回の統合を支えていると見られる。利用者にとっては、テキスト用に書いたパイプラインを最小限の変更で画像対応へ拡張できる点が実務的な価値となるだろう。検索品質の評価やハードネガティブマイニングといった既存のベストプラクティスをそのまま転用できることも魅力だ。

Hugging Face has extended Sentence Transformers, one of the most widely used embedding libraries in the open-source ecosystem, with native support for multimodal models. The update lets developers encode both text and images through the same familiar API, enabling cross-modal retrieval and reranking with very little code change.

At the core of the release is the ability to load vision-language models such as CLIP and SigLIP directly through the SentenceTransformer class and project text and images into a shared embedding space. The CrossEncoder-style reranker interface has likewise been extended to accept multimodal inputs, so an image query can be scored against text candidates and vice versa. Existing contrastive training utilities and loss functions also carry over, meaning practitioners can fine-tune multimodal encoders on domain-specific pairs using the same training loops they already know from text-only work.

The motivation is clear when looking at how retrieval workloads have evolved. Modern RAG pipelines increasingly need to index not just clean text but screenshots, charts, product photos, and rendered PDF pages. Approaches like ColPali and VisRAG, which treat document pages as images and rely on visual embeddings, have demonstrated that bypassing OCR can yield strong results on visually rich documents. Commercial providers including Jina, Voyage, and Cohere have responded with hosted multimodal embedding endpoints, and a robust open-source counterpart has been a notable gap.

Because Sentence Transformers is the de facto framework that many downstream tools — from LangChain and LlamaIndex integrations to vector database tutorials — build upon, formal multimodal support is likely to accelerate standardisation across the stack. Users should be able to swap a text encoder for a multimodal one with minimal refactoring, while still benefitting from established practices such as hard-negative mining, Matryoshka representation learning, and evaluation harnesses like MTEB.

The integration also appears to lean on the broader maturation of vision-language model support inside the Transformers library itself, which has steadily added unified processors and image handling over the last year. While the post focuses on currently supported architectures like CLIP and SigLIP, the design suggests room to plug in newer multimodal backbones as they appear, though specific roadmap details are not committed to. For teams building visual search, e-commerce ranking, or document understanding systems, the update lowers the barrier to experimenting with image-aware embeddings without leaving the Sentence Transformers ergonomics behind.