Mixture-of-Experts for Multimodal Emotion Recognition in Conversations

This paper proposes an MoE-based multimodal emotion recognition approach for conversations, simultaneously analyzing text, voice tone, and facial expressions. The MoE mechanism dynamically selects the most effective feature combinations for emotion detection.

Unlike traditional feature concatenation, MoE lets different expert networks focus on different modality combinations and emotion types. Some emotions are better expressed through tone (e.g., sarcasm), while others rely on facial expressions (e.g., surprise).

The paper achieves new SOTA results on multiple conversational emotion benchmarks, valuable for building more natural AI conversational systems that understand not just what is said, but how.

Emotion recognition in conversations is a core HCI challenge. Human emotion expression is inherently multimodal—through text, tone, rhythm, facial expressions, and body language.

Method

Three independent encoders: BERT for text, WavLM for audio, Vision Transformer for facial expressions. MoE layer dynamically fuses features from all three modalities.

MoE Fusion

8 expert networks learning different modality combination patterns. Gating network dynamically selects 2-3 most relevant experts. More efficient than feature concatenation since different emotions express differently across modalities.

Results

New SOTA on IEMOCAP (+2.3% weighted F1), MELD (+1.8%), with key improvements in sarcasm and masked negative emotions.

Industry Trend Connection

Important Multimodal AI advance in emotion understanding. For Agentic AI, understanding user emotions helps Agents adjust interaction strategies. MoE architecture aligns with Self-Improving AI through dynamic expert selection for efficient inference.

In-Depth Analysis and Industry Outlook

From a broader perspective, this development reflects the accelerating trend of AI technology transitioning from laboratories to industrial applications. Industry analysts widely agree that 2026 will be a pivotal year for AI commercialization. On the technical front, large model inference efficiency continues to improve while deployment costs decline, enabling more SMEs to access advanced AI capabilities. On the market front, enterprise expectations for AI investment returns are shifting from long-term strategic value to short-term quantifiable gains.

However, the rapid proliferation of AI also brings new challenges: increasing complexity of data privacy protection, growing demands for AI decision transparency, and difficulties in cross-border AI governance coordination. Regulatory authorities across multiple countries are closely monitoring these developments, attempting to balance innovation promotion with risk prevention. For investors, identifying AI companies with truly sustainable competitive advantages has become increasingly critical as the market transitions from hype to value validation.

From a supply chain perspective, the upstream infrastructure layer is experiencing consolidation and restructuring, with leading companies expanding competitive barriers through vertical integration. The midstream platform layer sees a flourishing open-source ecosystem that lowers barriers to AI application development. The downstream application layer shows accelerating AI penetration across traditional industries including finance, healthcare, education, and manufacturing.