Jingyi Chen | The Ohio State University

About Me

I am a Ph.D. Candidate in Computational Linguistics at The Ohio State University, specializing in speech synthesis, multimodal large language models, and reinforcement learning for audio. I work under the supervision of Dr. Micha Elsner and Dr. Andrew Perrault, with committee members Dr. Eric Fosler-Lussier and Dr. Cynthia Clopper. My research focuses on advancing speech synthesis through reinforcement learning and diffusion models, developing speech emotion conversion systems, and creating benchmarks for evaluating multimodal LLMs on emotional speech understanding.

Previously, I completed my M.S. in Computer Science & Engineering at OSU. I have industry experience as an Applied Scientist Intern at Amazon, including at Amazon DEX AI (Summer 2025), where I built LLM-based ranking systems for product recommendations, and at Amazon Prime Video (Summer 2024), where I developed production-ready speech emotion transfer systems.

Research Interests

Speech Synthesis: Text-to-speech systems, diffusion models, emotional speech generation, GANs for speech representation learning
Multimodal Large Language Models: Speech-text cooperation, instruction tuning, semantic-emotion disentanglement
Reinforcement Learning for Audio: RLHF, reward-based optimization, model fine-tuning

News

[Early 2026] Paper “Do Audio LLMs Really LISTEN, or Just Transcribe?” accepted to EACL 2026.
[Oct. 2025] Released LISTEN benchmark for evaluating lexical vs. acoustic cue reliance in audio LLMs.
[Aug. 2025] Completed internship at Amazon DEX AI, where I built LLM-based ranking systems for low-consideration purchases.
[Aug. 2025] Started new research project on Social-Emotional Speech Dialogue Benchmark for Multimodal LLMs.
[May 2025] Paper “Reinforcement Learning for Fine-tuning Text-to-speech Diffusion Models” accepted to Interspeech 2025 (Oral Presentation).
[Jan. 2025] Released comprehensive emotion transfer dataset with 27K audio samples and published project page.
[Aug. 2024] Completed internship at Amazon Prime Video, delivered speech-to-speech emotion transfer model to production.
[Aug. 2023] Paper “Exploring How Generative Adversarial Networks Learn Phonological Representations” accepted to ACL 2023 with Area Chair Awards.

Publications

EACL

Do Audio LLMs Really LISTEN, or Just Transcribe? Measuring Lexical vs. Acoustic Emotion Cues Reliance

Jingyi Chen, Zhimeng Guo, Jiyun Chun, Pichao Wang, Andrew Perrault, Micha Elsner

European Chapter of the Association for Computational Linguistics (EACL), 2026.

PDF Code Project Page Dataset BibTeX

@inproceedings{chen2026listen,
  title={Do Audio LLMs Really LISTEN, or Just Transcribe? Measuring Lexical vs. Acoustic Emotion Cues Reliance},
  author={Jingyi Chen and Zhimeng Guo and Jiyun Chun and Pichao Wang and Andrew Perrault and Micha Elsner},
  booktitle={Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL)},
  year={2026},
}

Interspeech

Reinforcement Learning for Fine-tuning Text-to-speech Diffusion Models

Jingyi Chen, Ju-Seung Byun, Micha Elsner, Pichao Wang, Andrew Perrault

Interspeech, 2025.

PDF Code Project Page BibTeX Oral Presentation

@misc{chen2025finetuningtexttospeechdiffusionmodels,
  title={Fine-Tuning Text-to-Speech Diffusion Models Using Reinforcement Learning with Human Feedback},
  author={Jingyi Chen and Ju-Seung Byun and Micha Elsner and Pichao Wang and Andrew Perrault},
  year={2025},
  eprint={2508.03123},
  archivePrefix={arXiv},
  primaryClass={cs.SD},
  url={https://arxiv.org/abs/2508.03123},
}

ACL

Exploring How Generative Adversarial Networks Learn Phonological Representations

Jingyi Chen, Micha Elsner

Annual Meeting of the Association for Computational Linguistics (ACL), 2023.

PDF Code BibTeX Area Chair Awards, Oral Presentation

@misc{chen2023exploringgenerativeadversarialnetworks,
  title={Exploring How Generative Adversarial Networks Learn Phonological Representations},
  author={Jingyi Chen and Micha Elsner},
  year={2023},
  eprint={2305.12501},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2305.12501},
}

TTIC Workshop

A Curriculum Learning Paradigm for Speech Emotion Transfer

Jingyi Chen, Pichao Wang, Andrew Perrault, Micha Elsner

TTIC Speech & Audio Foundation Models Workshop, 2025.

Project Page BibTeX

@inproceedings{chen2025emotion,
  title={A Curriculum Learning Paradigm for Speech Emotion Transfer},
  author={Chen, Jingyi and Wang, Pichao and Perrault, Andrew and Elsner, Micha},
  booktitle={TTIC Speech \& Audio Foundation Models Workshop},
  year={2025}
}

Memory retrieval as pressure towards chunking in morphological inflection

Micha Elsner, Jingyi Chen, Andrea Sims

Computational Linguistics, 2025.

BibTeX Journal Article

@article{elsner2025memory,
  title={Memory retrieval as pressure towards chunking in morphological inflection},
  author={Elsner, Micha and Chen, Jingyi and Sims, Andrea},
  journal={Computational Linguistics},
  year={2025}
}

Services

Conference Reviewers

Blog

A collection of technical notes on speech, language, and machine learning. [View all →]

Autoregressive Models for Speech

EnCodec: High-Fidelity Neural Audio Codec with Streaming and Variable Bitrate — Encoder/decoder architecture, RVQ, MS-STFT discriminator, loss balancer, streaming mode, ablation results.
Codec-based TTS Pipeline: RVQ, Semantic Tokens, and Acoustic Tokens — RVQ mechanics, delay pattern, semantic vs. acoustic tokens, codebook collapse, exposure bias, EnCodec vs. DAC vs. Mimi.