Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis of Political Speeches Using Large Language Models

This study investigates whether acoustic emotion recognition models can serve as effective proxy indicators for Pathos (emotional appeal) in political speeches. Using German Federal Parliament member Felix Banaszak's speech as a case study, the research compares three analytical modalities: an emotion2vec_plus_large model based on acoustic features, the Gemini 2.5 Flash large language model combining audio and text, and a TRUST-Pathos scoring system based on multi-agent collaboration. Results show that Gemini's Valence scores exhibit a significant strong correlation with TRUST-Pathos (rho = +0.664), while traditional acoustic models show no significant correlation. The study also reveals, through systematic evaluation of the EMO-DB dataset, that existing acoustic benchmarks suffer from limitations in performative nature, cultural bias, and class incompatibility. The findings demonstrate that multimodal analysis powered by large language models significantly outperforms single-modality acoustic models in capturing semantically defined political emotion, offering a new paradigm for political communication and affective computing.

Background and Context

The intersection of political communication and affective computing has long struggled with the precise quantification of "Pathos," defined as the capacity of a speaker to influence an audience through emotional appeal. Traditional methodologies in this domain have predominantly relied on acoustic feature extraction, utilizing metrics such as pitch, speech rate, and volume to infer emotional states. While these acoustic proxies offer a structured approach to emotion detection, they inherently ignore the deeper semantic layers of language, which are often the primary carriers of political intent and emotional nuance. This limitation becomes particularly acute in complex political contexts where the meaning of an utterance is inextricably linked to its linguistic content rather than its vocal delivery alone.

This study addresses this critical gap by proposing and validating a multimodal analysis framework powered by Large Language Models (LLMs). The core objective is to determine whether existing acoustic emotion recognition models can serve as effective proxy indicators for Pathos in political speeches, or if a paradigm shift toward semantic understanding is necessary. By introducing the TRUST multi-agent LLM pipeline as an operationalized benchmark for Pathos, the research seeks to answer a fundamental question: Can pure acoustic signals capture the emotional dimensions of political discourse as effectively as models that integrate text and audio? The findings challenge the prevailing assumption that acoustic features are sufficient for high-stakes emotional analysis, suggesting instead that semantic comprehension is indispensable for accurate political sentiment assessment.

To rigorously test these hypotheses, the research employs a tripartite analytical framework. First, it utilizes emotion2vec_plus_large, a state-of-the-art acoustic speech emotion recognition model, which extracts continuous arousal and valence values from pure audio signals using post-processed Russell circumplex projection. This represents the current pinnacle of unimodal acoustic analysis but is deliberately stripped of textual context. Second, the study leverages Gemini 2.5 Flash, a large language model capable of processing both audio and transcribed text simultaneously. This multimodal input allows for a synthesis of vocal tone and linguistic content, enabling deeper emotional inference. Finally, the TRUST-Pathos scoring system, generated by a supervised ensemble of three advocate LLMs, serves as the ground truth benchmark. This multi-agent design ensures robustness and diversity in evaluation standards, mitigating the biases inherent in single-model assessments.

Deep Analysis

The empirical validation of these models was conducted using a comprehensive case study of a full speech delivered by Felix Banaszak, a member of the German Federal Parliament. The speech was segmented into 51 distinct clips, totaling 245 seconds, providing a realistic and high-context dataset for analysis. The consistency of each model's output against the TRUST-Pathos benchmark was evaluated using Spearman rank correlation coefficients. The results revealed a stark divergence in performance between unimodal acoustic models and multimodal LLMs. Specifically, the Valence scores generated by Gemini 2.5 Flash exhibited a strong and statistically significant positive correlation with the TRUST-Pathos benchmark (rho = +0.664, p < 0.001). This indicates that the integration of textual semantics with audio features allows the model to accurately capture the nuanced emotional appeals characteristic of political rhetoric.

In sharp contrast, the emotion2vec acoustic model demonstrated a near-zero correlation with the benchmark (rho = +0.097, p = 0.499). This lack of significant correlation underscores the fundamental failure of pure acoustic features to detect semantically defined political emotions. The acoustic model, while capable of detecting basic vocal variations, proved incapable of distinguishing between emotionally charged political statements and neutral ones when stripped of their linguistic context. This finding validates the hypothesis that in political communication, the "what" is often more emotionally significant than the "how," rendering traditional acoustic proxies inadequate for deep affective analysis.

Furthermore, the study conducted a systematic quality assessment of the EMO-DB (Berlin Emotional Speech Database), a standard benchmark used in acoustic emotion research. The evaluation revealed severe limitations within this dataset, including a heavy reliance on performative acting rather than natural emotional expression, significant cultural biases, and class incompatibility issues. These flaws in foundational datasets help explain why traditional acoustic models perform poorly in real-world political scenarios. The artificial nature of EMO-DB fails to replicate the complex, context-dependent emotional dynamics of genuine political discourse, leading to a generalization gap that acoustic models cannot bridge without semantic grounding.

Industry Impact

The implications of these findings extend beyond academic validation, signaling a potential restructuring of how industries approach emotion detection in high-stakes environments. For the open-source community and developers of affective computing tools, the study challenges the dominance of acoustic-only paradigms. It demonstrates that in domains such as politics, law, and diplomacy, where context is king, semantic understanding must take precedence over vocal analysis. Consequently, the development of next-generation emotion analysis tools must integrate the reasoning capabilities of large language models rather than relying solely on acoustic sensor data. This shift requires a rethinking of data pipelines, moving from isolated audio processing to integrated multimodal architectures that can parse both text and sound simultaneously.

For industrial applications, particularly in political monitoring and public opinion analysis, the ability to accurately quantify Pathos is a critical asset. The superior performance of multimodal LLMs suggests that organizations can achieve far more reliable insights into public sentiment and political messaging by adopting these advanced frameworks. This could lead to more sophisticated tools for tracking political discourse, analyzing campaign strategies, and understanding voter sentiment. However, it also raises important considerations regarding the computational resources and data privacy requirements associated with processing large volumes of multimodal data, necessitating robust infrastructure and ethical guidelines.

Additionally, the critical evaluation of existing benchmarks like EMO-DB calls for a community-wide effort to construct more realistic and culturally diverse multimodal datasets. Current benchmarks often fail to represent the global diversity of political expression and emotional display, leading to biased models that perform well in controlled settings but fail in the wild. By advocating for datasets that reflect real-world complexity, the study pushes the field toward more equitable and practical solutions. This push for better data quality is essential for ensuring that affective computing tools are fair, accurate, and applicable across different cultural and political contexts.

Outlook

Looking forward, the success of the multimodal framework presented in this study lays the groundwork for even more sophisticated forms of emotional analysis. The integration of large language models with audio and text has proven effective, but the next logical step involves the inclusion of visual cues such as facial expressions and gaze tracking. Video-based multimodal analysis could provide an even richer understanding of political emotion, capturing non-verbal signals that complement vocal and linguistic content. This evolution promises to enhance the precision of affective computing in political monitoring, enabling analysts to detect subtle shifts in speaker confidence, sincerity, and emotional engagement that might be missed by audio-text models alone.

The broader impact of this research extends to the field of human-computer interaction (HCI). As AI systems become more integrated into social and political spheres, the ability to understand and respond to human emotion accurately becomes paramount. The paradigm shift from acoustic features to semantic understanding offers a template for developing AI systems that are not only technically proficient but also socially intelligent. These systems can engage in more nuanced interactions, providing better support in areas such as mental health, education, and customer service, where emotional intelligence is critical.

Finally, this study highlights the critical role of AI in social science research. By providing a robust method for quantifying emotional dimensions in political speech, it enables researchers to conduct large-scale, data-driven analyses of political communication. This can lead to new insights into the dynamics of political influence, the effectiveness of different rhetorical strategies, and the emotional drivers of public opinion. As the technology matures, the collaboration between computer scientists and social scientists will likely deepen, fostering a more comprehensive understanding of the complex interplay between language, emotion, and power in the digital age.