Do Audio LLMs LISTENTranscribe?

Measuring Lexical vs. Acoustic Emotion Cues Reliance in Audio LLMs

A comprehensive benchmark revealing that most audio-language models over-rely on text and miss critical prosodic cues—especially when words and tone conflict.

Model Performance Across Modalities

Overview

LISTEN is a novel benchmark designed to evaluate multimodal audio-language models on their ability to understand and distinguish between lexical and acoustic emotional cues in speech. The benchmark consists of four main experiment types:

1

Neutral-Text

Emotion recognition with neutral transcriptions across modalities

3 variants
2

Emotion-Matched

Lexical and acoustic cues convey the same emotion

3 variants
3

Emotion-Mismatched

Lexical and acoustic cues convey conflicting emotions

3 variants
4

Paralinguistic

Non-verbal vocalizations without lexical content

1 variant

Leaderboard

Performance of state-of-the-art audio-language models on the LISTEN benchmark. Click on column headers to sort.

Rank Model Type Overall Average Neutral-Text (Text) Neutral-Text (Audio) Neutral-Text (Both) Emotion-Matched (Text) Emotion-Matched (Audio) Emotion-Matched (Both) Emotion-Mismatched (Text) Emotion-Mismatched (Audio) Emotion-Mismatched (Both) Paralinguistic (Audio)

Notes:

  • Overall Average: Mean accuracy across all audio and text+audio results from the four experimental conditions (7 modalities total, excluding text-only)
  • Weighted Accuracy: Accuracy weighted by class distribution
  • UAR: Unweighted Average Recall (mean of per-class recalls)
  • Macro F1: Unweighted mean of per-class F1 scores
  • Micro F1: F1 score calculated globally across all classes
  • Baseline Models: Uniform Guess and Majority Guess are not ranked with other models

Performance Visualization

Detailed model performance across different modalities and experimental conditions

Detailed radar chart showing model performance across all modalities

Comprehensive comparison of model performance across seven modalities: Neutral-Text (Text/Audio), Emotion-Matched (Text/Audio), Emotion-Mismatched (Text/Audio), and Paralinguistic (Audio). Gemini 2.5 Pro demonstrates the most balanced performance, while Qwen3-Omni-30B excels in Emotion-Matched conditions.

Experiment Details

1 Neutral-Text

Task: Emotion recognition with neutral transcriptions

Variants:

  • Text: Neutral text transcription only
  • Audio: Audio with emotional prosody
  • Text+Audio: Both modalities (neutral text + emotional audio)

Purpose: Assess if models can recognize emotion from prosody when text is neutral

2 Emotion-Matched

Task: Emotion recognition when lexical and acoustic cues agree

Variants:

  • Text: Emotional text only
  • Audio: Audio with matching emotional prosody
  • Text+Audio: Both modalities with matching emotions

Purpose: Baseline performance when both modalities provide consistent emotional information

3 Emotion-Mismatched

Task: Emotion recognition when lexical and acoustic cues conflict

Variants:

  • Text: Emotional text (conflicting with audio emotion)
  • Audio: Audio with conflicting emotional prosody
  • Text+Audio: Both modalities with conflicting emotions

Purpose: Test whether models rely more on lexical or acoustic cues when they conflict

4 Paralinguistic

Task: Emotion recognition from non-verbal vocalizations

Variants:

  • Audio: Non-verbal sounds (laughter, sighs, gasps, etc.)

Purpose: Evaluate understanding of purely acoustic emotional cues without lexical content

Condition Examples

Representative examples from each experimental condition showing the task format and model predictions

1

Neutral-Text (Text-only)

Sample ID: SAMPLE_7c8b53fb
Transcription: "Kids are talking by the door."
Prompt: Read the transcription and classify the emotion. Based on the content of this text, what emotion would the person likely be feeling?
Choices: A. anger | B. fear | C. disgust | D. neutral | E. sadness | F. surprise | G. calm | H. happiness
Ground Truth: neutral
Model Prediction: D (neutral) âś“
The model correctly identifies the statement's emotion (neutral) from lexical cues alone.
2

Emotion-Matched (Text-only)

Sample ID: SAMPLE_9b76ea7d
Transcription: "What the hell is this?"
Prompt: Read the transcription and classify the emotion. From the semantic content alone, what emotion is being expressed?
Choices: A. neutral | B. sadness | C. excitement | D. frustration | E. fear | F. disgust | G. happiness | H. anger | I. surprise
Ground Truth: frustration
Model Prediction: D (neutral) âś—
The model misclassified an explicitly frustration utterance as neutral, illustrating overgeneralization across semantically related negative emotions.
3

Emotion-Mismatched (Text-only)

Sample ID: SAMPLE_955399e0
Transcription: "You're right, the party's fantastic. Please, tell me more. I haven't heard enough about it all week because hearing about that never gets old!"
Prompt: Read the transcription and classify the emotion. What emotion is conveyed by the words in this statement?
Choices: A. surprise | B. excitement | C. sadness | D. disgust | E. fear | F. neutral | G. anger | H. happiness | I. frustration | J. ridicule
Ground Truth (Explicit): excitement
Model Prediction: B (excitement) âś“
Note: Audio conveys ridicule (conflict)
Although the lexical content expresses excitement, the corresponding audio conveys ridicule, highlighting the designed lexical–prosodic conflict.
3

Emotion-Mismatched (Audio-only)

Sample ID: SAMPLE_c52e71d0
Audio: MUStARD_PRO_1_7575_u_3B
Prompt: Listen to the audio and classify the emotion. What emotion is communicated through the speaker's vocal prosody?
Choices: A. disgust | B. neutral | C. ridicule | D. frustration | E. sadness | F. anger | G. excitement | H. fear | I. surprise | J. happiness
Ground Truth (Implicit): anger
Model Prediction: G (excitement) âś—
The lexical content is superficially positive, but the prosody expresses irritation and anger. The model incorrectly predicts excitement, indicating difficulty in resolving sarcastic or contrastive vocal tone.
4

Paralinguistic (Audio-only)

Sample ID: SAMPLE_54df39ff
Audio: IEMOCAP_Session5_Ses05F_impro03_F006
Content: Nonverbal laughter
Prompt: Listen to the audio and classify the emotion. What emotional tone is conveyed by the literal meaning of this statement?
Choices: A. anger | B. happiness | C. fear | D. sadness | E. surprise | F. frustration | G. excitement | H. disgust | I. neutral
Ground Truth: excitement
Model Prediction: B (happiness) âś—
The utterance contains only nonverbal laughter. The model incorrectly classifies it as happiness, revealing challenges in distinguishing subtle affective intent from nonverbal vocalizations.

Citation

If you use LISTEN in your research, please cite:

@misc{deli2025listen,
  title={LISTEN: Lexical vs. Acoustic Emotion Benchmark for Audio Language Models},
  author={Deli, Jingyi C.},
  year={2025},
  publisher={GitHub},
  howpublished={\url{https://github.com/DeliJingyiC/LISTEN}},
  note={Dataset available at: \url{https://huggingface.co/datasets/delijingyic/VibeCheck}}
}