LISTEN: Lexical vs. Acoustic Emotion Benchmark

Overview

LISTEN is a novel benchmark designed to evaluate multimodal audio-language models on their ability to understand and distinguish between lexical and acoustic emotional cues in speech. The benchmark consists of four main experiment types:

Neutral-Text

Emotion recognition with neutral transcriptions across modalities

3 variants

Emotion-Matched

Lexical and acoustic cues convey the same emotion

3 variants

Emotion-Mismatched

Lexical and acoustic cues convey conflicting emotions

3 variants

Paralinguistic

Non-verbal vocalizations without lexical content

1 variant

Leaderboard

Performance of state-of-the-art audio-language models on the LISTEN benchmark. Click on column headers to sort.

Rank	Model	Type	Overall Average	Neutral-Text (Text)	Neutral-Text (Audio)	Neutral-Text (Both)	Emotion-Matched (Text)	Emotion-Matched (Audio)	Emotion-Matched (Both)	Emotion-Mismatched (Text)	Emotion-Mismatched (Audio)	Emotion-Mismatched (Both)	Paralinguistic (Audio)

Notes:

Overall Average: Mean accuracy across all audio and text+audio results from the four experimental conditions (7 modalities total, excluding text-only)
Weighted Accuracy: Accuracy weighted by class distribution
UAR: Unweighted Average Recall (mean of per-class recalls)
Macro F1: Unweighted mean of per-class F1 scores
Micro F1: F1 score calculated globally across all classes
Baseline Models: Uniform Guess and Majority Guess are not ranked with other models

Performance Visualization

Detailed model performance across different modalities and experimental conditions

Detailed radar chart showing model performance across all modalities

Comprehensive comparison of model performance across seven modalities: Neutral-Text (Text/Audio), Emotion-Matched (Text/Audio), Emotion-Mismatched (Text/Audio), and Paralinguistic (Audio). Gemini 2.5 Pro demonstrates the most balanced performance, while Qwen3-Omni-30B excels in Emotion-Matched conditions.

Experiment Details

1 Neutral-Text

Task: Emotion recognition with neutral transcriptions

Variants:

Text: Neutral text transcription only
Audio: Audio with emotional prosody
Text+Audio: Both modalities (neutral text + emotional audio)

Purpose: Assess if models can recognize emotion from prosody when text is neutral

2 Emotion-Matched

Task: Emotion recognition when lexical and acoustic cues agree

Variants:

Text: Emotional text only
Audio: Audio with matching emotional prosody
Text+Audio: Both modalities with matching emotions

Purpose: Baseline performance when both modalities provide consistent emotional information

3 Emotion-Mismatched

Task: Emotion recognition when lexical and acoustic cues conflict

Variants:

Text: Emotional text (conflicting with audio emotion)
Audio: Audio with conflicting emotional prosody
Text+Audio: Both modalities with conflicting emotions

Purpose: Test whether models rely more on lexical or acoustic cues when they conflict

4 Paralinguistic

Task: Emotion recognition from non-verbal vocalizations

Variants:

Audio: Non-verbal sounds (laughter, sighs, gasps, etc.)

Purpose: Evaluate understanding of purely acoustic emotional cues without lexical content

Condition Examples

Representative examples from each experimental condition showing the task format and model predictions

Neutral-Text (Text-only)

Sample ID: SAMPLE_7c8b53fb

Transcription: "It's elevn o'clock."

Prompt: Read the transcription and classify the emotion. Based on the content of this text, what emotion would the person likely be feeling?

Ground Truth: neutral

Model Prediction: D (neutral) ✓

The model correctly identifies the statement's emotion (neutral) from lexical cues alone.

Emotion-Matched (Text-only)

Sample ID: SAMPLE_9b76ea7d

Transcription: "You are cruising for a bruising. You are in so much trouble."

Prompt: Read the transcription and classify the emotion. From the semantic content alone, what emotion is being expressed?

Ground Truth: anger

Model Prediction: A (neutral) ✗

The model misclassified an explicitly angry utterance as neutral, illustrating overgeneralization across semantically related negative emotions.

Emotion-Mismatched (Text-only)

Sample ID: SAMPLE_955399e0

Transcription: "You're right, the party's fantastic. Please, tell me more. I haven't heard enough about it all week because hearing about that never gets old!"

Prompt: Read the transcription and classify the emotion. What emotion is conveyed by the words in this statement?

Ground Truth (Explicit): excitement

Model Prediction: B (excitement) ✓

Neutral-Text (Audio-only)

Sample ID: SAMPLE_7c8b53fb

Audio: CREMAD_train_0333

Prompt: Listen to the audio and classify the emotion. What emotion is communicated through the speaker's vocal prosody?

Ground Truth : anger

Model Prediction: D (anger) ✓

Emotion-Matched (Audio-only)

Sample ID: SAMPLE_dd0f6e9d

Audio: IEMOCAP_Session5_Ses05M_script01_1b_F030

Prompt: Listen to the audio and classify the emotion. What emotion is communicated through the speaker's vocal prosody?

Ground Truth : anger

Model Prediction: B (anger) ✓

Emotion-Mismatched (Audio-only)

Sample ID: SAMPLE_c52e71d0

Audio: MUStARD_PRO_1_7575_u_3B

Prompt: Listen to the audio and classify the emotion. What emotion is communicated through the speaker's vocal prosody?

Ground Truth (Implicit): anger

Model Prediction: G (excitement) ✗

Neutral-Text (Text+Audio)

Sample ID: SAMPLE_7c8b53fb

Audio: CREMAD_train_0333

Transcription: "It's elevn o'clock."

Prompt: Listen to the audio and read the transcription, then classify the emotion. What emotion does the speaker convey through their tone?

Ground Truth : anger

Model Prediction: D (anger) ✓

Emotion-Matched (Text+Audio)

Sample ID: SAMPLE_dd0f6e9d

Audio: IEMOCAP_Session5_Ses05M_script01_1b_F030

Transcription: "You are cruising for a bruising. You are in so much trouble."

Prompt: Listen to the audio and classify the emotion. What emotion is communicated through the speaker's vocal prosody?

Ground Truth : anger

Model Prediction: B (anger) ✓

Emotion-Mismatched (Text+Audio)

Sample ID: SAMPLE_c52e71d0

Audio: MUStARD_PRO_1_7575_u_3B

Transcription: "You're right, the party's fantastic. Please, tell me more. I haven't heard enough about it all week because hearing about that never gets old!"

Prompt: Listen to the audio and classify the emotion. What emotion is communicated through the speaker's vocal prosody?

Ground Truth (Implicit): anger

Model Prediction: G (excitement) ✗

Paralinguistic (Audio-only)

Sample ID: SAMPLE_54df39ff

Audio: IEMOCAP_Ses01F_script01_3_F012

Content: Nonverbal laughter

Prompt: Listen to the audio and classify the emotion. What emotional tone is conveyed by the literal meaning of this statement?

Ground Truth: happiness

Model Prediction: B (happiness) ✓

The utterance contains only nonverbal sighs. The model correctly classifies it as happiness.

Citation

If you use LISTEN in your research, please cite:

@misc{chen2025audiollmsreallylisten,
                    title={Do Audio LLMs Really LISTEN, or Just Transcribe? Measuring Lexical vs. Acoustic Emotion Cues Reliance}, 
                    author={Jingyi Chen and Zhimeng Guo and Jiyun Chun and Pichao Wang and Andrew Perrault and Micha Elsner},
                    year={2025},
                    eprint={2510.10444},
                    archivePrefix={arXiv},
                    primaryClass={cs.CL},
                    url={https://arxiv.org/abs/2510.10444}, 
              }