Scaling instance-level emotional speech conversion

Scaling Instance-level Emotional Speech Conversion

Amazon Prime Video, ^*The Ohio State University
Under review 2024

Abstract

Instance-level speech emotion conversion is the task of transferring the emotion of a reference speech sample to another speech, while retaining content and voice identity of the input. However, the state of the art (SOTA) performance for this task is bounded by the limited speech emotion data and annotations, including emotion categories, number of speakers and the variety of speech content. In this paper, we propose an approach to overcome these barriers using a three-stage approach, consisting of pretraining on a large scale synthetic dataset, followed by supervised finetuning on a small amount of real annotated emotional speech data, and an online reinforcement learning step to enforce emotion similarity. To build the synthetic dataset, we leverage a text-to-speech model (MeloTTS) to generate 7000 speech samples in nine emotion categories for a reference voice, and use a voice conversion model (OpenVoice) to convert these samples to a population of 250 speakers. Experimental results show that our three stage training approach leads to improved performance for content and speaker identity preservation, while achieving a comparable emotion accuracy to SOTA methods.

Neutral to Surprise

Input Audio

Reference Audio

Our Output Audio

AINN Output Audio

Neutral to Happy

Input Audio

Reference Audio

Our Output Audio

AINN Output Audio

Neutral to Angry

Input Audio

Reference Audio

Our Output Audio

AINN Output Audio

Neutral to Sad

Input Audio

Reference Audio

Our Output Audio

AINN Output Audio

@article{chen2024scaling, title={Scaling Instance-level Emotional Speech Conversion}, author={Chen, Jingyi and Huynh, Cong Phuoc and Sadoughi, Najmeh and Yanamandra, Abhishek and Jain, Abhinav and Ke, Zemian and Liu, Zhu and Bhat, Vimal}, under review, year={2024} }

Scaling Instance-level Emotional Speech Conversion

Abstract

Demo Audios

Neutral to Surprise

Neutral to Happy

Neutral to Angry

Neutral to Sad

BibTeX