Do Audio LLMs LISTENTranscribe?

Measuring Lexical vs. Acoustic Emotion Cues Reliance in Audio LLMs

A comprehensive benchmark revealing that most audio-language models over-rely on text and miss critical acoustic cues—especially when words and tone conflict.

LISTEN Benchmark Illustration

Overview

LISTEN is a novel benchmark designed to evaluate multimodal audio-language models on their ability to understand and distinguish between lexical and acoustic emotional cues in speech. The benchmark consists of four main experiment types:

1

Neutral-Text

Emotion recognition with neutral transcriptions across modalities

3 variants
2

Emotion-Matched

Lexical and acoustic cues convey the same emotion

3 variants
3

Emotion-Mismatched

Lexical and acoustic cues convey conflicting emotions

3 variants
4

Paralinguistic

Non-verbal vocalizations without lexical content

1 variant

Leaderboard

Performance of state-of-the-art audio-language models on the LISTEN benchmark. Click on column headers to sort.

Rank Model Type Overall Average Neutral-Text (Text) Neutral-Text (Audio) Neutral-Text (Both) Emotion-Matched (Text) Emotion-Matched (Audio) Emotion-Matched (Both) Emotion-Mismatched (Text) Emotion-Mismatched (Audio) Emotion-Mismatched (Both) Paralinguistic (Audio)

Notes:

  • Overall Average: Mean accuracy across all audio and text+audio results from the four experimental conditions (7 modalities total, excluding text-only)
  • Weighted Accuracy: Accuracy weighted by class distribution
  • UAR: Unweighted Average Recall (mean of per-class recalls)
  • Macro F1: Unweighted mean of per-class F1 scores
  • Micro F1: F1 score calculated globally across all classes
  • Baseline Models: Uniform Guess and Majority Guess are not ranked with other models

Performance Visualization

Detailed model performance across different modalities and experimental conditions

Detailed radar chart showing model performance across all modalities

Comprehensive comparison of model performance across seven modalities: Neutral-Text (Text/Audio), Emotion-Matched (Text/Audio), Emotion-Mismatched (Text/Audio), and Paralinguistic (Audio). Gemini 2.5 Pro demonstrates the most balanced performance, while Qwen3-Omni-30B excels in Emotion-Matched conditions.

Experiment Details

1 Neutral-Text

Task: Emotion recognition with neutral transcriptions

Variants:

  • Text: Neutral text transcription only
  • Audio: Audio with emotional prosody
  • Text+Audio: Both modalities (neutral text + emotional audio)

Purpose: Assess if models can recognize emotion from prosody when text is neutral

2 Emotion-Matched

Task: Emotion recognition when lexical and acoustic cues agree

Variants:

  • Text: Emotional text only
  • Audio: Audio with matching emotional prosody
  • Text+Audio: Both modalities with matching emotions

Purpose: Baseline performance when both modalities provide consistent emotional information

3 Emotion-Mismatched

Task: Emotion recognition when lexical and acoustic cues conflict

Variants:

  • Text: Emotional text (conflicting with audio emotion)
  • Audio: Audio with conflicting emotional prosody
  • Text+Audio: Both modalities with conflicting emotions

Purpose: Test whether models rely more on lexical or acoustic cues when they conflict

4 Paralinguistic

Task: Emotion recognition from non-verbal vocalizations

Variants:

  • Audio: Non-verbal sounds (laughter, sighs, gasps, etc.)

Purpose: Evaluate understanding of purely acoustic emotional cues without lexical content

Condition Examples

Representative examples from each experimental condition showing the task format and model predictions

1

Neutral-Text (Text-only)

Sample ID: SAMPLE_7c8b53fb
Transcription: "It's elevn o'clock."
Prompt: Read the transcription and classify the emotion. Based on the content of this text, what emotion would the person likely be feeling?
Choices: A. anger | B. fear | C. disgust | D. neutral | E. sadness | F. surprise | G. calm | H. happiness
Ground Truth: neutral
Model Prediction: D (neutral) ✓
The model correctly identifies the statement's emotion (neutral) from lexical cues alone.
2

Emotion-Matched (Text-only)

Sample ID: SAMPLE_9b76ea7d
Transcription: "You are cruising for a bruising. You are in so much trouble."
Prompt: Read the transcription and classify the emotion. From the semantic content alone, what emotion is being expressed?
Choices: A. neutral | B. sadness | C. excitement |D. frustration | E. fear | F. disgust | G. happiness | H. anger | I. surprise
Ground Truth: anger
Model Prediction: A (neutral) ✗
The model misclassified an explicitly angry utterance as neutral, illustrating overgeneralization across semantically related negative emotions.
3

Emotion-Mismatched (Text-only)

Sample ID: SAMPLE_955399e0
Transcription: "You're right, the party's fantastic. Please, tell me more. I haven't heard enough about it all week because hearing about that never gets old!"
Prompt: Read the transcription and classify the emotion. What emotion is conveyed by the words in this statement?
Choices: A. surprise | B. excitement | C. sadness | D. disgust | E. fear | F. neutral | G. anger | H. happiness | I. frustration | J. ridicule
Ground Truth (Explicit): excitement
Model Prediction: B (excitement) ✓
4

Neutral-Text (Audio-only)

Sample ID: SAMPLE_7c8b53fb
Audio: CREMAD_train_0333
Prompt: Listen to the audio and classify the emotion. What emotion is communicated through the speaker's vocal prosody?
Choices: A. surprise | B. sadness | C. fear | D. anger | E. calm |F. happiness | G. neutral | H. disgust
Ground Truth : anger
Model Prediction: D (anger) ✓
5

Emotion-Matched (Audio-only)

Sample ID: SAMPLE_dd0f6e9d
Audio: IEMOCAP_Session5_Ses05M_script01_1b_F030
Prompt: Listen to the audio and classify the emotion. What emotion is communicated through the speaker's vocal prosody?
Choices: A. frustration | B. anger | C. neutral | D. excitement | E. happiness |F. surprise | G. disgust | H. fear | I. sadness
Ground Truth : anger
Model Prediction: B (anger) ✓
6

Emotion-Mismatched (Audio-only)

Sample ID: SAMPLE_c52e71d0
Audio: MUStARD_PRO_1_7575_u_3B
Prompt: Listen to the audio and classify the emotion. What emotion is communicated through the speaker's vocal prosody?
Choices: A. disgust | B. neutral | C. ridicule | D. frustration | E. sadness | F. anger | G. excitement | H. fear | I. surprise | J. happiness
Ground Truth (Implicit): anger
Model Prediction: G (excitement) ✗
7

Neutral-Text (Text+Audio)

Sample ID: SAMPLE_7c8b53fb
Audio: CREMAD_train_0333
Transcription: "It's elevn o'clock."
Prompt: Listen to the audio and read the transcription, then classify the emotion. What emotion does the speaker convey through their tone?
Choices: A. surprise | B. sadness | C. fear | D. anger | E. calm |F. happiness | G. neutral | H. disgust
Ground Truth : anger
Model Prediction: D (anger) ✓
8

Emotion-Matched (Text+Audio)

Sample ID: SAMPLE_dd0f6e9d
Audio: IEMOCAP_Session5_Ses05M_script01_1b_F030
Transcription: "You are cruising for a bruising. You are in so much trouble."
Prompt: Listen to the audio and classify the emotion. What emotion is communicated through the speaker's vocal prosody?
Choices: A. frustration | B. anger | C. neutral | D. excitement | E. happiness |F. surprise | G. disgust | H. fear | I. sadness
Ground Truth : anger
Model Prediction: B (anger) ✓
9

Emotion-Mismatched (Text+Audio)

Sample ID: SAMPLE_c52e71d0
Audio: MUStARD_PRO_1_7575_u_3B
Transcription: "You're right, the party's fantastic. Please, tell me more. I haven't heard enough about it all week because hearing about that never gets old!"
Prompt: Listen to the audio and classify the emotion. What emotion is communicated through the speaker's vocal prosody?
Choices: A. disgust | B. neutral | C. ridicule | D. frustration | E. sadness | F. anger | G. excitement | H. fear | I. surprise | J. happiness
Ground Truth (Implicit): anger
Model Prediction: G (excitement) ✗
10

Paralinguistic (Audio-only)

Sample ID: SAMPLE_54df39ff
Audio: IEMOCAP_Ses01F_script01_3_F012
Content: Nonverbal laughter
Prompt: Listen to the audio and classify the emotion. What emotional tone is conveyed by the literal meaning of this statement?
Choices: A. anger | B. happiness | C. fear | D. sadness | E. surprise | F. frustration | G. excitement | H. disgust | I. neutral
Ground Truth: happiness
Model Prediction: B (happiness) ✓
The utterance contains only nonverbal sighs. The model correctly classifies it as happiness.

Citation

If you use LISTEN in your research, please cite:

@misc{chen2025audiollmsreallylisten,
                    title={Do Audio LLMs Really LISTEN, or Just Transcribe? Measuring Lexical vs. Acoustic Emotion Cues Reliance}, 
                    author={Jingyi Chen and Zhimeng Guo and Jiyun Chun and Pichao Wang and Andrew Perrault and Micha Elsner},
                    year={2025},
                    eprint={2510.10444},
                    archivePrefix={arXiv},
                    primaryClass={cs.CL},
                    url={https://arxiv.org/abs/2510.10444}, 
              }