Unlike existing RLHF methods that rely solely on external rewards or KL regularization, DLPO introduces a unique approach by directly integrating the diffusion model's original training loss into the reward function. This innovation serves two critical purposes:
🎯 Preserve Model Capabilities
By aligning the reward function with the original diffusion training objective, DLPO ensures the model maintains its ability to generate high-quality speech while adapting to human feedback.
🛡️ Prevent Overfitting
The original diffusion loss acts as a stabilizing regularizer, balancing external reward optimization with preservation of the model's probabilistic structure.