Measuring Lexical vs. Acoustic Emotion Cues Reliance in Audio LLMs
A comprehensive benchmark revealing that most audio-language models over-rely on text and miss critical prosodic cues—especially when words and tone conflict.
 
            LISTEN is a novel benchmark designed to evaluate multimodal audio-language models on their ability to understand and distinguish between lexical and acoustic emotional cues in speech. The benchmark consists of four main experiment types:
Emotion recognition with neutral transcriptions across modalities
3 variantsLexical and acoustic cues convey the same emotion
3 variantsLexical and acoustic cues convey conflicting emotions
3 variantsNon-verbal vocalizations without lexical content
1 variantTest your emotional intelligence against leading audio LLMs. Identify true emotions when words and tone conflict—see if you rely more on text or prosody.
Performance of state-of-the-art audio-language models on the LISTEN benchmark. Click on column headers to sort.
| Rank | Model | Type | Overall Average | Neutral-Text (Text) | Neutral-Text (Audio) | Neutral-Text (Both) | Emotion-Matched (Text) | Emotion-Matched (Audio) | Emotion-Matched (Both) | Emotion-Mismatched (Text) | Emotion-Mismatched (Audio) | Emotion-Mismatched (Both) | Paralinguistic (Audio) | 
|---|
Detailed model performance across different modalities and experimental conditions
 
                    Comprehensive comparison of model performance across seven modalities: Neutral-Text (Text/Audio), Emotion-Matched (Text/Audio), Emotion-Mismatched (Text/Audio), and Paralinguistic (Audio). Gemini 2.5 Pro demonstrates the most balanced performance, while Qwen3-Omni-30B excels in Emotion-Matched conditions.
Task: Emotion recognition with neutral transcriptions
Variants:
Text: Neutral text transcription onlyAudio: Audio with emotional prosodyText+Audio: Both modalities (neutral text + emotional audio)Purpose: Assess if models can recognize emotion from prosody when text is neutral
Task: Emotion recognition when lexical and acoustic cues agree
Variants:
Text: Emotional text onlyAudio: Audio with matching emotional prosodyText+Audio: Both modalities with matching emotionsPurpose: Baseline performance when both modalities provide consistent emotional information
Task: Emotion recognition when lexical and acoustic cues conflict
Variants:
Text: Emotional text (conflicting with audio emotion)Audio: Audio with conflicting emotional prosodyText+Audio: Both modalities with conflicting emotionsPurpose: Test whether models rely more on lexical or acoustic cues when they conflict
Task: Emotion recognition from non-verbal vocalizations
Variants:
Audio: Non-verbal sounds (laughter, sighs, gasps, etc.)Purpose: Evaluate understanding of purely acoustic emotional cues without lexical content
Representative examples from each experimental condition showing the task format and model predictions
SAMPLE_7c8b53fb
                        SAMPLE_9b76ea7d
                        SAMPLE_955399e0
                        SAMPLE_c52e71d0
                        MUStARD_PRO_1_7575_u_3B
                        SAMPLE_54df39ff
                        IEMOCAP_Session5_Ses05F_impro03_F006
                        If you use LISTEN in your research, please cite:
@misc{deli2025listen,
  title={LISTEN: Lexical vs. Acoustic Emotion Benchmark for Audio Language Models},
  author={Deli, Jingyi C.},
  year={2025},
  publisher={GitHub},
  howpublished={\url{https://github.com/DeliJingyiC/LISTEN}},
  note={Dataset available at: \url{https://huggingface.co/datasets/delijingyic/VibeCheck}}
}