Skip to content

Commit fe2b2a8

Browse files
committed
Update Sec.02
1 parent f639912 commit fe2b2a8

File tree

1 file changed

+51
-4
lines changed
  • Surveys/2024.12.09_Towards_Controllable_Speech_Synthesis_in_the_Era_of_LLM_23P

1 file changed

+51
-4
lines changed

Surveys/2024.12.09_Towards_Controllable_Speech_Synthesis_in_the_Era_of_LLM_23P/Sec.02.md

Lines changed: 51 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -394,11 +394,23 @@ VITS 将非自回归合成和随机潜在变量建模相结合, 实现了实时
394394

395395
## E·Acoustic Feature Representations: 声学特征表示
396396

397+
<details>
398+
<summary>展开原文</summary>
399+
397400
In TTS, the choice of acoustic feature representations impacts the model's flexibility, quality, expressiveness, and controllability.
398401
This subsection investigates continuous representations and discrete tokens as shown in Fig.02, along with their pros and cons for TTS applications.
399402

403+
</details>
404+
<br>
405+
406+
在 TTS 中, 声学特征表示的选择影响着模型的灵活性, 质量, 表现力, 以及可控性.
407+
本小节研究了图 02 中所示的连续表示和离散 Token, 以及它们在 TTS 应用中的优缺点.
408+
400409
### Continuous Representations: 连续表示
401410

411+
<details>
412+
<summary>展开原文</summary>
413+
402414
Continuous representations (e.g., mel-spectrograms) of intermediate acoustic features use a continuous feature space to represent speech signals.
403415
These representations often involve acoustic features that capture frequency, pitch, and other characteristics without discretizing the signal.
404416
The advantages of continuous features are:
@@ -409,17 +421,52 @@ The advantages of continuous features are:
409421
GAN-based ([HiFi-GAN [116]](../../Models/Vocoder/2020.10.12_HiFi-GAN.md); [Parallel WaveGAN [144]](../../Models/Vocoder/2019.10.25_Parallel_WaveGAN.md); [MelGAN [145]](../../Models/Vocoder/2019.10.08_MelGAN.md)) and diffusion-based methods ([FastDiff [147]](../../Models/Vocoder/2022.04.21_FastDiff.md); [DiffWave [148]](../../Models/Vocoder/2020.09.21_DiffWave.md)) often utilize continuous feature representations, i.e., mel-spectrograms.
410422
However, continuous representations are typically more computationally demanding and require larger models and memory, especially in high-resolution audio synthesis.
411423

424+
</details>
425+
<br>
426+
427+
中间声学特征的连续表示 (如梅尔频谱图) 使用连续特征空间来表示语音信号.
428+
这些表示往往涉及到捕捉频率, 音高和其他特性的声学特征而不离散化信号.
429+
430+
连续特征的优点有:
431+
1) 连续表示保留细粒度细节, 允许更富有表现力和听感自然的语音合成.
432+
2) 因为连续表示固有地捕捉音调音高和重音的变化, 它们很适合用于韵律控制和情感 TTS.
433+
3) 连续表示对信息损失更鲁棒, 可以避免量化失真, 能获得更光滑, 更少失真的音频.
434+
435+
基于 GAN ([HiFi-GAN [116]](../../Models/Vocoder/2020.10.12_HiFi-GAN.md); [Parallel WaveGAN [144]](../../Models/Vocoder/2019.10.25_Parallel_WaveGAN.md); [MelGAN [145]](../../Models/Vocoder/2019.10.08_MelGAN.md)) 和基于扩散 ([FastDiff [147]](../../Models/Vocoder/2022.04.21_FastDiff.md); [DiffWave [148]](../../Models/Vocoder/2020.09.21_DiffWave.md)) 的模型通常使用连续特征表示, 即梅尔频谱图.
436+
然而, 连续表示通常需要更多的计算需求和更大的模型和内存, 尤其是在高分辨率语音合成中.
437+
412438
### Discrete Tokens: 离散 Tokens
413439

440+
<details>
441+
<summary>展开原文</summary>
442+
414443
In discrete token-based TTS, the intermediate acoustic features (e.g., quantized units or phoneme-like tokens) are discrete values, similar to words or phonemes in languages.
415-
These are often produced using quantization techniques or learned embeddings, such as HuBERT~\cite{hsu2021hubert} and SoundStream~\cite{zeghidour2021soundstream}.
444+
These are often produced using quantization techniques or learned embeddings, such as [HuBERT [166]](../../Models/SpeechRepresentation/2021.06.14_HuBERT.md) and [SoundStream [168]](../../Models/SpeechCodec/2021.07.07_SoundStream.md).
416445
The advantages of discrete tokens are:
417446
1) Discrete tokens can encode phonemes or sub-word units, making them concise and less computationally demanding to handle.
418447
2) Discrete tokens often allow TTS systems to require fewer samples to learn and generalize, as the representations are compact and simplified.
419448
3) Using discrete tokens simplifies cross-modal TTS applications like voice cloning or translation-based TTS, as they map well to text-like representations such as LLM tokens.
420449

421-
LLM-based~\cite{wang2024maskgct,zhou2024voxinstruct,ji2024controlspeech,[InstructTTS [105]](../../Models/Acoustic/2023.01.31_InstructTTS.md)} and zero-shot TTS methods~\cite{[CosyVoice [17]](../../Models/SpeechLM/2024.07.07_CosyVoice.md); [MaskGCT]wang2024maskgct,ju2024naturalspeech3} often adopt discrete tokens as their acoustic features.
450+
LLM-based ([MaskGCT [78]](../../Models/SpeechLM/2024.09.01_MaskGCT.md); [VoxInstruct [103]](../../Models/SpeechLM/2024.08.28_VoxInstruct.md); [InstructTTS [105]](../../Models/Acoustic/2023.01.31_InstructTTS.md); [ControlSpeech [106]](../../Models/SpeechLM/2024.06.03_ControlSpeech.md)) and zero-shot TTS methods ([CosyVoice [17]](../../Models/SpeechLM/2024.07.07_CosyVoice.md); [MaskGCT [78]](../../Models/SpeechLM/2024.09.01_MaskGCT.md); [NaturalSpeech3 [87]](../../Models/Diffusion/2024.03.05_NaturalSpeech3.md)) often adopt discrete tokens as their acoustic features.
422451
However, discrete representation learning may result in information loss or lack the nuanced details that can be captured in continuous representations.
423452

424-
Table~\ref{tab:sec5_controllable_methods_ar} and~\ref{tab:sec5_controllable_methods_nar} summarize the types of acoustic features of representative methods.
425-
Table \ref{tab:sec2_quantization} summarizes popular open-source speech quantization methods.
453+
Table.04 and Table.03 summarize the types of acoustic features of representative methods.
454+
Table.02 summarizes popular open-source speech quantization methods.
455+
456+
</details>
457+
<br>
458+
459+
在基于离散 Token 的 TTS 中, 中间声学特征 (如量化单元或类似音素的 Token) 是离散值, 类似于语言中的词或音素.
460+
它们通常使用量化技术或学习到的嵌入来产生, 如 [HuBERT [166]](../../Models/SpeechRepresentation/2021.06.14_HuBERT.md)[SoundStream [168]](../../Models/SpeechCodec/2021.07.07_SoundStream.md).
461+
离散 Token 的优点有:
462+
1) 离散 Token 可以编码音素或子词单元, 使得它们更简洁, 并降低处理的计算需求.
463+
2) 离散 Token 往往允许 TTS 系统学习和泛化所需的样本更少, 因为表示是紧凑和简化的.
464+
3) 使用离散 Token 简化了跨模态 TTS 应用, 如声音克隆或基于翻译的 TTS, 因为它们可以很好地映射到类似于文本的表示 (如 LLM Token).
465+
466+
基于 LLM 的方法 ([MaskGCT [78]](../../Models/SpeechLM/2024.09.01_MaskGCT.md); [VoxInstruct [103]](../../Models/SpeechLM/2024.08.28_VoxInstruct.md); [InstructTTS [105]](../../Models/Acoustic/2023.01.31_InstructTTS.md); [ControlSpeech [106]](../../Models/SpeechLM/2024.06.03_ControlSpeech.md)) 和零样本 TTS 方法 ([CosyVoice [17]](../../Models/SpeechLM/2024.07.07_CosyVoice.md); [MaskGCT [78]](../../Models/SpeechLM/2024.09.01_MaskGCT.md); [NaturalSpeech3 [87]](../../Models/Diffusion/2024.03.05_NaturalSpeech3.md)) 都使用离散 Token 来作为声学特征.
467+
468+
然而, 离散表示学习可能会导致信息损失或缺乏连续表示中可以捕获的细节.
469+
470+
表 4 和表 3 总结了代表性方法的声学特征类型.
471+
472+
表 2 总结了流行的开源语音量化方法. #TODO CSV

0 commit comments

Comments
 (0)