Fine-Tuning Text-to-Speech Diffusion Models Using Reinforcement Learning with Human Feedback

1Department of Linguistics, The Ohio State University
2Department of Computer Science and Engineering, The Ohio State University
3Amazon

🎵 Audio Demos

DLPO significantly improves speech quality and naturalness in TTS diffusion models. Listen to the difference!

Abstract

Diffusion models produce high-fidelity speech but are inefficient for real-time use due to long denoising steps and challenges in modeling intonation and rhythm. To improve this, we propose Diffusion Loss-Guided Policy Optimization (DLPO), an RLHF framework for TTS diffusion models.

DLPO integrates the original training loss into the reward function, preserving generative capabilities while reducing inefficiencies. Using naturalness scores as feedback, DLPO aligns reward optimization with the diffusion model's structure, improving speech quality.

We evaluate DLPO on WaveGrad 2, a non-autoregressive diffusion-based TTS model. Results show significant improvements in objective metrics (UTMOS 3.65, NISQA 4.02) and subjective evaluations, with DLPO audio preferred 67% of the time. These findings demonstrate DLPO's potential for efficient, high-quality diffusion TTS in real-time, resource-limited settings.

Method Overview

🔧 Diffusion Loss-Guided Policy Optimization (DLPO)

Unlike existing RLHF methods that rely solely on external rewards or KL regularization, DLPO introduces a unique approach by directly integrating the diffusion model's original training loss into the reward function. This innovation serves two critical purposes:

🎯 Preserve Model Capabilities

By aligning the reward function with the original diffusion training objective, DLPO ensures the model maintains its ability to generate high-quality speech while adapting to human feedback.

🛡️ Prevent Overfitting

The original diffusion loss acts as a stabilizing regularizer, balancing external reward optimization with preservation of the model's probabilistic structure.

DLPO: Diffusion Loss-Guided Policy Optimization Input Text "Hello world" WaveGrad 2 TTS Model Diffusion Process x_T → x_0 Generated Audio 🎵 UTMOS ⭐ Reward Model Naturalness Score: r(x₀, c) Diffusion Loss 🔧 ||ε̃(x_t,t) - ε_θ(x_t,c,t)||² Preserves Model Capabilities DLPO Objective 🚀 -α·r(x₀,c) - β·||ε̃-ε_θ||² ✨ Reward Optimization 🛡️ Capability Preservation 🎯 Prevents Overfitting Policy Update 📈 θ ← θ + ∇_θ J_DLPO(θ) Improved Speech Quality! Iterative Refinement url(#arrowhead)"/> 🔑 Key Innovation Integrating original diffusion loss into reward function prevents capability degradation while optimizing for human preferences! 🎉 📈 DLPO Benefits 🎯 UTMOS: 2.90 → 3.65 (+26%) 🎵 NISQA: 3.74 → 4.02 (+7.5%) 📝 WER: 1.5% → 1.2% (-20%) 👥 67% human preference! 🎊 🧮 DLPO Objective Function: 𝔼[−α·r(x₀,c) − β·‖ε̃(xₜ,t) − εθ(xₜ,c,t)‖²] α: reward weight ⚖️ β: diffusion loss weight 🔧 r(x₀,c): UTMOS score ⭐ 1. Text Input 2. Audio Generation 3. Quality Assessment 4. Combined Optimization

Experimental Results

📈 Comparison of RL Fine-tuning Methods

Method UTMOS ↑ NISQA ↑ WER ↓
Ground Truth 4.20 4.37 0.99%
WaveGrad 2R (Baseline) 2.90 3.74 1.5%
RWR 2.18 3.00 8.9%
DDPO 2.69 2.96 2.1%
DPOK 3.18 3.76 1.1%
KLinR 3.02 3.73 1.3%
DLPO (Ours) 3.65 4.02 1.2%

🎧 Human Evaluation Results

  • 67% of listeners preferred DLPO-generated audio
  • 14% preferred baseline WaveGrad 2R
  • 19% rated as about the same
  • Statistical significance: p < 10-16 (binomial test)

⚡ Key Improvements

  • +26% improvement in UTMOS score
  • +7.5% improvement in NISQA score
  • 20% reduction in word error rate
  • Maintains computational efficiency

Related Work & Discussion

This work addresses the unique challenges of fine-tuning TTS diffusion models using reinforcement learning techniques. While existing methods like RWR and DDPO struggle with the temporal and acoustic demands of TTS, DLPO provides a tailored solution.

By integrating the diffusion model's original training loss into the reward function, DLPO stabilizes training, prevents overfitting, and enables task-specific adaptations. This approach demonstrates the importance of leveraging task-specific regularization to address the complexities of sequential data generation.

Our findings establish DLPO as a robust framework for advancing diffusion-based TTS synthesis and set a foundation for broader applications in resource-constrained and real-time scenarios.

BibTeX

@article{dlpo2025,
  title={Fine-Tuning Text-to-Speech Diffusion Models Using Reinforcement Learning with Human Feedback},
  author={Jingyi Chen, Ju Seung Byun, Micha Elsner, Pichao Wang, Andrew Perrault},
  journal={Interspeech},
  year={2025}
}