意味埋め込みを社会指標へ:プロキシ前提の妥当性検証 The Proxy Presumption: From Semantic Embeddings to Valid Social Measures

arXiv cs.CL · arxiv.org · 2026/05/12 13:00 · 1d ago · 📖 2 min

AI 3 行サマリ

LLMの意味埋め込みを社会科学の測定指標として用いる「プロキシ前提」を批判的に検討した論文。
埋め込みが構成概念を正しく代理しているかの妥当性検証手順を提案し、社会測定への応用上の落とし穴を整理する。

English summary

This paper scrutinises the 'proxy presumption' that semantic embeddings from LLMs can serve as valid measures of social constructs, proposing validation procedures and highlighting pitfalls when embeddings are treated as substitutes for traditional social science measurements.

大規模言語モデル(LLM)から得られる意味埋め込みを、世論やイデオロギー、文化的価値観といった社会的構成概念の「測定値」として利用する研究が急増している。本論文はこの潮流を「プロキシ前提(proxy presumption)」と呼び、その妥当性を計量社会科学の観点から批判的に検証する。

著者らの問題意識は明快だ。埋め込み空間上の距離やクラスタリングは、表面的にはアンケート尺度のような連続値を生み出すが、それが本当に研究者の意図する構成概念(construct)を捉えているかは自明ではない。心理測定論で言うところの内容的妥当性、収束的妥当性、判別的妥当性といった伝統的な検証手順が、埋め込みベースの測定にはほとんど適用されていないと指摘する。

論文では、埋め込みを社会指標として用いる際の典型的な失敗モードを整理していると見られる。例えば、モデルの学習データに含まれる文化的バイアスが測定値に混入する問題、プロンプトや前処理の微小な変更で結果が大きく揺らぐ頑健性の欠如、そして外部の正解データ(調査票回答など)との相関が必ずしも構成概念の妥当性を保証しないという推論上の飛躍などが挙げられる。

埋め込みが構成概念を正しく代理しているかの妥当性検証手順を提案し、社会測定への応用上の落とし穴を整理する。

🔬 Research · 本記事のポイント

関連する動きとして、政治学や経済学ではすでにword2vecやBERT、近年ではOpenAIのtext-embedding系モデルを用いて議員発言のイデオロギー位置推定や消費者選好の抽出を行う研究が蓄積している。一方で、Bender らによる確率的オウム論や、埋め込みに含まれる社会的ステレオタイプを扱うWEATなどのバイアス検出研究は、こうした応用に対する警鐘として参照されてきた。本論文はその系譜上に位置づけられ、応用研究者向けに具体的なバリデーション手順を提示することで、単なる批判に留まらない実務的貢献を目指していると考えられる。

LLM由来の指標を政策評価や社会調査の代替として用いる動きは今後も加速する可能性が高いが、その際にはモデル更新による測定値の非互換性や、再現性の確保といった運用面の課題も無視できない。本研究は、埋め込みを「便利な数値」として無批判に使うのではなく、社会測定の文脈で何を測っているのかを問い直す出発点になり得る。

Semantic embeddings extracted from large language models are increasingly treated as ready-made measurements of social constructs such as ideology, public opinion, cultural values, or consumer preferences. This paper interrogates that practice, which the authors label the 'proxy presumption', and asks whether the resulting numbers actually measure what researchers claim they measure.

The core argument is methodological rather than technical. Distances and clusters in embedding space superficially resemble the continuous scales produced by survey instruments, but their construct validity is rarely examined. Classical psychometric checks such as content validity, convergent and discriminant validity, and test-retest reliability have well-developed analogues in the social sciences, yet embedding-based measures are often validated only by reporting a correlation with a single external benchmark. The authors argue this is insufficient evidence that an embedding captures the latent construct of interest rather than some correlated artefact of the training corpus.

The paper appears to catalogue typical failure modes for embedding-as-measurement workflows. These likely include sensitivity to prompt wording and preprocessing, instability across model versions, contamination by cultural and demographic biases baked into web-scale training data, and the inferential leap from predictive accuracy to construct validity. A model that predicts survey responses well may still be exploiting stylistic cues rather than the underlying attitude, which has serious implications when the measure is used downstream in causal or descriptive analyses.

The broader research context is worth noting. Political scientists have used word2vec and BERT to scale legislator speech, marketers exploit embeddings to map brand positioning, and a growing body of work in computational social science treats OpenAI or open-source embeddings as off-the-shelf instruments. Parallel critiques, from Bender and colleagues on stochastic parrots to bias diagnostics such as WEAT and SEAT, have flagged that embeddings encode social stereotypes and historical asymmetries. The present paper can be read as situating those concerns within a measurement-theoretic frame and translating them into actionable validation steps for applied researchers.

Practical implications extend beyond academic methodology. If embeddings are to feed into policy evaluation, market research, or content moderation tooling, then version drift becomes a measurement problem: an upgrade from one embedding model to another can silently change the scale, breaking comparability across studies or across time. Reproducibility is similarly fragile when proprietary endpoints are involved, since the underlying model may be retrained without notice. The authors seem to advocate for explicit validation protocols, transparent reporting of model provenance, and triangulation with traditional instruments rather than wholesale replacement.

None of this implies that embeddings are unusable as social measures. The more constructive reading is that they should be treated like any other instrument: calibrated, validated against multiple criteria, and accompanied by an honest account of what they do and do not capture. As LLM-derived indicators proliferate in empirical research, frameworks of the kind proposed here may become a prerequisite for credible inference rather than an optional methodological refinement.

#arxiv #paper #embeddings #computational-social-science #measurement-validity #llm-evaluation

SourcearXiv cs.CLT1
Source Avg ★ 1.0
Type論文
Importance ★ 情報 (top 100% in Research)
Half-life 🏛️ 長期 (アーキテクチャ)
LangEN
Collected2026/05/13 07:55

元記事を読む

arxiv.org

本ページの本文・要約は AI による自動生成です。正確性は元記事 (arxiv.org) をご確認ください。

🔬 Research の他の記事 もっと見る →

🔬 Research の他の記事もっと見る →