DLPO: Fine-Tuning Text-to-Speech Diffusion Models Using Reinforcement Learning with Human Feedback

Fine-Tuning Text-to-Speech Diffusion Models Using Reinforcement Learning with Human Feedback

¹Department of Linguistics, The Ohio State University
²Department of Computer Science and Engineering, The Ohio State University
³Amazon

Abstract

Diffusion models produce high-fidelity speech but are inefficient for real-time use due to long denoising steps and challenges in modeling intonation and rhythm. To improve this, we propose Diffusion Loss-Guided Policy Optimization (DLPO), an RLHF framework for TTS diffusion models.

DLPO integrates the original training loss into the reward function, preserving generative capabilities while reducing inefficiencies. Using naturalness scores as feedback, DLPO aligns reward optimization with the diffusion model's structure, improving speech quality.

We evaluate DLPO on WaveGrad 2, a non-autoregressive diffusion-based TTS model. Results show significant improvements in objective metrics (UTMOS 3.65, NISQA 4.02) and subjective evaluations, with DLPO audio preferred 67% of the time. These findings demonstrate DLPO's potential for efficient, high-quality diffusion TTS in real-time, resource-limited settings.

Method Overview

🔧 Diffusion Loss-Guided Policy Optimization (DLPO)

Unlike existing RLHF methods that rely solely on external rewards or KL regularization, DLPO introduces a unique approach by directly integrating the diffusion model's original training loss into the reward function. This innovation serves two critical purposes:

🎯 Preserve Model Capabilities

By aligning the reward function with the original diffusion training objective, DLPO ensures the model maintains its ability to generate high-quality speech while adapting to human feedback.

🛡️ Prevent Overfitting

The original diffusion loss acts as a stabilizing regularizer, balancing external reward optimization with preservation of the model's probabilistic structure.

Experimental Results

📈 Comparison of RL Fine-tuning Methods

Method	UTMOS ↑	NISQA ↑	WER ↓
Ground Truth	4.20	4.37	0.99%
WaveGrad 2R (Baseline)	2.90	3.74	1.5%
RWR	2.18	3.00	8.9%
DDPO	2.69	2.96	2.1%
DPOK	3.18	3.76	1.1%
KLinR	3.02	3.73	1.3%
DLPO (Ours)	3.65	4.02	1.2%

🎧 Human Evaluation Results

67% of listeners preferred DLPO-generated audio
14% preferred baseline WaveGrad 2R
19% rated as about the same
Statistical significance: p < 10^-16 (binomial test)

⚡ Key Improvements

+26% improvement in UTMOS score
+7.5% improvement in NISQA score
20% reduction in word error rate
Maintains computational efficiency

Related Work & Discussion

This work addresses the unique challenges of fine-tuning TTS diffusion models using reinforcement learning techniques. While existing methods like RWR and DDPO struggle with the temporal and acoustic demands of TTS, DLPO provides a tailored solution.

By integrating the diffusion model's original training loss into the reward function, DLPO stabilizes training, prevents overfitting, and enables task-specific adaptations. This approach demonstrates the importance of leveraging task-specific regularization to address the complexities of sequential data generation.

Our findings establish DLPO as a robust framework for advancing diffusion-based TTS synthesis and set a foundation for broader applications in resource-constrained and real-time scenarios.

@article{dlpo2025, title={Fine-Tuning Text-to-Speech Diffusion Models Using Reinforcement Learning with Human Feedback}, author={Jingyi Chen, Ju Seung Byun, Micha Elsner, Pichao Wang, Andrew Perrault}, journal={Interspeech}, year={2025} }