Scaling Instance-level Emotional Speech Conversion

Amazon Prime Video, *The Ohio State University
Under review 2024

Abstract

Instance-level speech emotion conversion is the task of transferring the emotion of a reference speech sample to another speech, while retaining content and voice identity of the input. However, the state of the art (SOTA) performance for this task is bounded by the limited speech emotion data and annotations, including emotion categories, number of speakers and the variety of speech content. In this paper, we propose an approach to overcome these barriers using a three-stage approach, consisting of pretraining on a large scale synthetic dataset, followed by supervised finetuning on a small amount of real annotated emotional speech data, and an online reinforcement learning step to enforce emotion similarity. To build the synthetic dataset, we leverage a text-to-speech model (MeloTTS) to generate 7000 speech samples in nine emotion categories for a reference voice, and use a voice conversion model (OpenVoice) to convert these samples to a population of 250 speakers. Experimental results show that our three stage training approach leads to improved performance for content and speaker identity preservation, while achieving a comparable emotion accuracy to SOTA methods.

MY ALT TEXT

1) The specific angry emotional tone and expression is extracted in the reference audio (the red one on the left top). (2)The emotion expression in the input audio (the blue one on the left bottom) is modified to mimic the extracted specific angry emotional tone and expression extracted from the reference audio while maintain input audio’s speaker voice and content.

Demo Audios

Neutral to Surprise

Input Audio Reference Audio Our Output Audio AINN Output Audio

Neutral to Happy

Input Audio Reference Audio Our Output Audio AINN Output Audio

Neutral to Angry

Input Audio Reference Audio Our Output Audio AINN Output Audio

Neutral to Sad

Input Audio Reference Audio Our Output Audio AINN Output Audio

BibTeX

@article{chen2024scaling,
  title={Scaling Instance-level Emotional Speech Conversion},
  author={Chen, Jingyi and Huynh, Cong Phuoc and Sadoughi, Najmeh and Yanamandra, Abhishek and
Jain, Abhinav and Ke, Zemian and Liu, Zhu and Bhat, Vimal},
  under review,
  year={2024}
}