LISTEN: Lexical vs. Acoustic Emotion Benchmark

Overview

LISTEN is a novel benchmark designed to evaluate multimodal audio-language models on their ability to understand and distinguish between lexical and acoustic emotional cues in speech. The benchmark consists of four main experiment types:

1

Neutral-Text

Emotion recognition with neutral transcriptions across modalities

3 variants

2

Emotion-Matched

Lexical and acoustic cues convey the same emotion

3 variants

3

Emotion-Mismatched

Lexical and acoustic cues convey conflicting emotions

3 variants

4

Paralinguistic

Non-verbal vocalizations without lexical content

1 variant

Leaderboard

Performance of state-of-the-art audio-language models on the LISTEN benchmark. Click on column headers to sort.

Rank	Model	Type	Overall Average	Neutral-Text (Text)	Neutral-Text (Audio)	Neutral-Text (Both)	Emotion-Matched (Text)	Emotion-Matched (Audio)	Emotion-Matched (Both)	Emotion-Mismatched (Text)	Emotion-Mismatched (Audio)	Emotion-Mismatched (Both)	Paralinguistic (Audio)

Notes:

Overall Average: Mean accuracy across all audio and text+audio results from the four experimental conditions (7 modalities total, excluding text-only)
Weighted Accuracy: Accuracy weighted by class distribution
UAR: Unweighted Average Recall (mean of per-class recalls)
Macro F1: Unweighted mean of per-class F1 scores
Micro F1: F1 score calculated globally across all classes
Baseline Models: Uniform Guess and Majority Guess are not ranked with other models

Performance Visualization

Detailed model performance across different modalities and experimental conditions

Detailed radar chart showing model performance across all modalities

Comprehensive comparison of model performance across seven modalities: Neutral-Text (Text/Audio), Emotion-Matched (Text/Audio), Emotion-Mismatched (Text/Audio), and Paralinguistic (Audio). Gemini 2.5 Pro demonstrates the most balanced performance, while Qwen3-Omni-30B excels in Emotion-Matched conditions.

Experiment Details

1 Neutral-Text

Task: Emotion recognition with neutral transcriptions

Variants:

Text: Neutral text transcription only
Audio: Audio with emotional prosody
Text+Audio: Both modalities (neutral text + emotional audio)

Purpose: Assess if models can recognize emotion from prosody when text is neutral

2 Emotion-Matched

Task: Emotion recognition when lexical and acoustic cues agree

Variants:

Text: Emotional text only
Audio: Audio with matching emotional prosody
Text+Audio: Both modalities with matching emotions

Purpose: Baseline performance when both modalities provide consistent emotional information

3 Emotion-Mismatched

Task: Emotion recognition when lexical and acoustic cues conflict

Variants:

Text: Emotional text (conflicting with audio emotion)
Audio: Audio with conflicting emotional prosody
Text+Audio: Both modalities with conflicting emotions

Purpose: Test whether models rely more on lexical or acoustic cues when they conflict

4 Paralinguistic

Task: Emotion recognition from non-verbal vocalizations

Variants:

Audio: Non-verbal sounds (laughter, sighs, gasps, etc.)

Purpose: Evaluate understanding of purely acoustic emotional cues without lexical content

Condition Examples

Representative examples from each experimental condition showing the task format and model predictions

1

Neutral-Text (Text-only)

Sample ID: SAMPLE_7c8b53fb

Transcription: "Kids are talking by the door."

Prompt: Read the transcription and classify the emotion. Based on the content of this text, what emotion would the person likely be feeling?

Ground Truth: neutral

Model Prediction: D (neutral) ✓

The model correctly identifies the statement's emotion (neutral) from lexical cues alone.

2

Emotion-Matched (Text-only)

Sample ID: SAMPLE_9b76ea7d

Transcription: "What the hell is this?"

Prompt: Read the transcription and classify the emotion. From the semantic content alone, what emotion is being expressed?

Ground Truth: frustration

Model Prediction: D (neutral) ✗

The model misclassified an explicitly frustration utterance as neutral, illustrating overgeneralization across semantically related negative emotions.

3

Emotion-Mismatched (Text-only)

Sample ID: SAMPLE_955399e0

Transcription: "You're right, the party's fantastic. Please, tell me more. I haven't heard enough about it all week because hearing about that never gets old!"

Prompt: Read the transcription and classify the emotion. What emotion is conveyed by the words in this statement?

Ground Truth (Explicit): excitement

Model Prediction: B (excitement) ✓

Note: Audio conveys ridicule (conflict)

Although the lexical content expresses excitement, the corresponding audio conveys ridicule, highlighting the designed lexical–prosodic conflict.

3

Emotion-Mismatched (Audio-only)

Sample ID: SAMPLE_c52e71d0

Audio: MUStARD_PRO_1_7575_u_3B

Prompt: Listen to the audio and classify the emotion. What emotion is communicated through the speaker's vocal prosody?

Ground Truth (Implicit): anger

Model Prediction: G (excitement) ✗

The lexical content is superficially positive, but the prosody expresses irritation and anger. The model incorrectly predicts excitement, indicating difficulty in resolving sarcastic or contrastive vocal tone.

4

Paralinguistic (Audio-only)

Sample ID: SAMPLE_54df39ff

Audio: IEMOCAP_Session5_Ses05F_impro03_F006

Content: Nonverbal laughter

Prompt: Listen to the audio and classify the emotion. What emotional tone is conveyed by the literal meaning of this statement?

Ground Truth: excitement

Model Prediction: B (happiness) ✗

The utterance contains only nonverbal laughter. The model incorrectly classifies it as happiness, revealing challenges in distinguishing subtle affective intent from nonverbal vocalizations.

Citation

If you use LISTEN in your research, please cite:

@misc{deli2025listen,
  title={LISTEN: Lexical vs. Acoustic Emotion Benchmark for Audio Language Models},
  author={Deli, Jingyi C.},
  year={2025},
  publisher={GitHub},
  howpublished={\url{https://github.com/DeliJingyiC/LISTEN}},
  note={Dataset available at: \url{https://huggingface.co/datasets/delijingyic/VibeCheck}}
}

Do Audio LLMs LISTENTranscribe?

Overview

Neutral-Text

Emotion-Matched

Emotion-Mismatched

Paralinguistic

Can You Beat the AI?

Leaderboard

Notes:

Performance Visualization

Experiment Details

1 Neutral-Text

2 Emotion-Matched

3 Emotion-Mismatched

4 Paralinguistic

Condition Examples

Neutral-Text (Text-only)

Emotion-Matched (Text-only)

Emotion-Mismatched (Text-only)

Emotion-Mismatched (Audio-only)

Paralinguistic (Audio-only)

Citation