You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, continuous representations are typically more computationally demanding and require larger models and memory, especially in high-resolution audio synthesis.
In discrete token-based TTS, the intermediate acoustic features (e.g., quantized units or phoneme-like tokens) are discrete values, similar to words or phonemes in languages.
415
-
These are often produced using quantization techniques or learned embeddings, such as HuBERT~\cite{hsu2021hubert} and SoundStream~\cite{zeghidour2021soundstream}.
444
+
These are often produced using quantization techniques or learned embeddings, such as [HuBERT[166]](../../Models/SpeechRepresentation/2021.06.14_HuBERT.md)and [SoundStream[168]](../../Models/SpeechCodec/2021.07.07_SoundStream.md).
416
445
The advantages of discrete tokens are:
417
446
1) Discrete tokens can encode phonemes or sub-word units, making them concise and less computationally demanding to handle.
418
447
2) Discrete tokens often allow TTS systems to require fewer samples to learn and generalize, as the representations are compact and simplified.
419
448
3) Using discrete tokens simplifies cross-modal TTS applications like voice cloning or translation-based TTS, as they map well to text-like representations such as LLM tokens.
420
449
421
-
LLM-based~\cite{wang2024maskgct,zhou2024voxinstruct,ji2024controlspeech,[InstructTTS [105]](../../Models/Acoustic/2023.01.31_InstructTTS.md)} and zero-shot TTS methods~\cite{[CosyVoice [17]](../../Models/SpeechLM/2024.07.07_CosyVoice.md); [MaskGCT]wang2024maskgct,ju2024naturalspeech3} often adopt discrete tokens as their acoustic features.
450
+
LLM-based ([MaskGCT [78]](../../Models/SpeechLM/2024.09.01_MaskGCT.md); [VoxInstruct [103]](../../Models/SpeechLM/2024.08.28_VoxInstruct.md); [InstructTTS [105]](../../Models/Acoustic/2023.01.31_InstructTTS.md); [ControlSpeech [106]](../../Models/SpeechLM/2024.06.03_ControlSpeech.md)) and zero-shot TTS methods ([CosyVoice [17]](../../Models/SpeechLM/2024.07.07_CosyVoice.md); [MaskGCT[78]](../../Models/SpeechLM/2024.09.01_MaskGCT.md); [NaturalSpeech3 [87]](../../Models/Diffusion/2024.03.05_NaturalSpeech3.md)) often adopt discrete tokens as their acoustic features.
422
451
However, discrete representation learning may result in information loss or lack the nuanced details that can be captured in continuous representations.
423
452
424
-
Table~\ref{tab:sec5_controllable_methods_ar} and~\ref{tab:sec5_controllable_methods_nar} summarize the types of acoustic features of representative methods.
425
-
Table \ref{tab:sec2_quantization} summarizes popular open-source speech quantization methods.
453
+
Table.04 and Table.03 summarize the types of acoustic features of representative methods.
454
+
Table.02 summarizes popular open-source speech quantization methods.
0 commit comments