Instance-level speech emotion conversion is the task of transferring the emotion of a reference speech sample to another speech, while retaining content and voice identity of the input. However, the state of the art (SOTA) performance for this task is bounded by the limited speech emotion data and annotations, including emotion categories, number of speakers and the variety of speech content. In this paper, we propose an approach to overcome these barriers using a three-stage approach, consisting of pretraining on a large scale synthetic dataset, followed by supervised finetuning on a small amount of real annotated emotional speech data, and an online reinforcement learning step to enforce emotion similarity. To build the synthetic dataset, we leverage a text-to-speech model (MeloTTS) to generate 7000 speech samples in nine emotion categories for a reference voice, and use a voice conversion model (OpenVoice) to convert these samples to a population of 250 speakers. Experimental results show that our three stage training approach leads to improved performance for content and speaker identity preservation, while achieving a comparable emotion accuracy to SOTA methods.
Input Audio | Reference Audio | Our Output Audio | AINN Output Audio |
---|---|---|---|
Input Audio | Reference Audio | Our Output Audio | AINN Output Audio |
---|---|---|---|
Input Audio | Reference Audio | Our Output Audio | AINN Output Audio |
---|---|---|---|
Input Audio | Reference Audio | Our Output Audio | AINN Output Audio |
---|---|---|---|
@article{chen2024scaling,
title={Scaling Instance-level Emotional Speech Conversion},
author={Chen, Jingyi and Huynh, Cong Phuoc and Sadoughi, Najmeh and Yanamandra, Abhishek and
Jain, Abhinav and Ke, Zemian and Liu, Zhu and Bhat, Vimal},
under review,
year={2024}
}