基本信息
- 标题: "CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model"
- 作者:
- 01 Zhen Ye,
- 02 Wei Xue,
- 03 Xu Tan,
- 04 Jie Chen,
- 05 Qifeng Liu,
- 06 Yike Guo
- 链接:
- ArXiv v4
- Publication ACM MM2023
- Github
- Demo
- 文件:
展开原文
Denoising Diffusion Probabilistic Models (DDPMs) have shown promising performance for speech synthesis. However, a large number of iterative steps are required to achieve high sample quality, which restricts the inference speed. Maintaining sample quality while increasing sampling speed has become a challenging task. In this paper, we propose a Consistency Model-based Speech synthesis method, CoMoSpeech, which achieve speech synthesis through a single diffusion sampling step while achieving high audio quality. The consistency constraint is applied to distill a consistency model from a well-designed diffusion-based teacher model, which ultimately yields superior performances in the distilled CoMoSpeech. Our experiments show that by generating audio recordings by a single sampling step, the CoMoSpeech achieves an inference speed more than 150 times faster than real-time on a single NVIDIA A100 GPU, which is comparable to FastSpeech2, making diffusion-sampling based speech synthesis truly practical. Meanwhile, objective and subjective evaluations on text-to-speech and singing voice synthesis show that the proposed teacher models yield the best audio quality, and the one-step sampling based CoMoSpeech achieves the best inference speed with better or comparable audio quality to other conventional multi-step diffusion model baselines. Audio samples are available at this https URL.
去噪扩散概率模型 (Denoising Diffusion Probabilistic Models, DDPMs) 在语音合成方面展示了有前景的性能. 然而, 为了实现高样本质量需要大量的迭代步数, 这限制了推理速度. 保持样本质量的同时提高采样速度已经成为一个具有挑战性的任务.
本文提出了基于一致性模型的语音合成方法 CoMoSpeech, 它通过单步扩散采样实现语音合成, 同时保持高音质. 从一个精心设计的基于扩散的教师模型中应用一致性约束提取出一致性模型, 最终在蒸馏的 CoMoSpeech 中获得卓越的性能.
我们的实验表明通过单词采样步骤来生成音频, CoMoSpeech 在单个 Nvidia A100 GPU 上实现了推理速度超过 150 倍的实时速度, 这与 FastSpeech2 相当, 使得基于扩散的语音合成真正具备实际应用价值. 同时, 在文本转语音和歌声合成方面的客观和主观评估表明所提出的教师模型产生了最佳音频质量, 且基于单步采样的 CoMoSpeech 获得了与其他多步扩散模型基线相当或更好的音频质量且具有最佳的推理速度.
音频示例可在此处 获得.