What is multimodal Pathos analysis and how does it improve emotional recognition in political speeches?

Multimodal Pathos analysis combines vocal tone with textual semantics using LLMs like Gemini 2.5 Flash, processing audio and transcripts simultaneously. It achieves strong correlation with expert scores (rho=+0.664, p<0.001), far surpassing acoustic-only models (rho=+0.097), proving semantic understanding is key to accurate political sentiment detection.

Why do traditional acoustic emotion models fail in political contexts?

Traditional acoustic models rely solely on pitch, speech rate, and volume, completely stripping away text semantics. Systematic evaluation of the EMO-DB dataset reveals deep flaws in existing acoustic benchmarks: performative nature, cultural bias, and class incompatibility cause them to fail in real-world political communication.

What are the implications of this research for affective computing and political communication?

The study proves semantic understanding significantly outweighs acoustic cues in high-context domains like politics and law. Future directions include incorporating facial expressions and gaze tracking as additional modalities, and building datasets that better reflect real-world cultural diversity.

超越聲學情緒識別：基於大語言模型的 multimodal Pathos 政治演講分析

Q: Why do traditional acoustic emotion models fail in political contexts?

Traditional acoustic models rely solely on pitch, speech rate, and volume, completely stripping away text semantics. Systematic evaluation of the EMO-DB dataset reveals deep flaws in existing acoustic benchmarks: performative nature, cultural bias, and class incompatibility cause them to fail in real-world political communication.

Q: What are the implications of this research for affective computing and political communication?

The study proves semantic understanding significantly outweighs acoustic cues in high-context domains like politics and law. Future directions include incorporating facial expressions and gaze tracking as additional modalities, and building datasets that better reflect real-world cultural diversity.

本文探討了聲學情緒識別能否作為政治演講中 Pathos（情感訴求）維度的有效代理指標。研究以德國聯邦議院議員 Felix Banaszak 的演講為案例，對比了三種分析模態：基於聲學特徵的 emotion2vec_plus_large 模型、結合音頻與文本的 Gemini 2.5 Flash 大語言模型，以及基於多智能體協作的 TRUST-Pathos 評分系統。結果顯示，Gemini 的效價評分與 TRUST-Pathos 呈顯著強相關（rho = +0.664），而傳統聲學模型的效價評分無顯著相關性。研究還通過評估 EMO-DB 數據集揭示了現有聲學基準在表演性質、文化偏見及類別不兼容方面的缺陷。結論表明，基於大語言模型的多模態分析在捕捉語義定義的政治情感方面顯著優於單一聲學模型，為政治傳播與情感計算提供了新範式。

Sources

arXiv