更新 Transformer TTS

SapphireLab · SapphireLab · commit 1df320bde33d · 2025-02-17T21:19:29.000+08:00
diff --git a/Models/Acoustic/2018.09.19_TransformerTTS.md b/Models/Acoustic/2018.09.19_TransformerTTS.md
@@ -23,9 +23,7 @@
 
 ## Abstract: 摘要
 
-<table>
-<tr>
-<td width="50%">
+<table><tr><td width="50%">
 
 Although end-to-end neural text-to-speech (TTS) methods (such as **Tacotron2**) are proposed and achieve state-of-the-art performance, they still suffer from two problems:
 1. low efficiency during training and inference;
@@ -34,55 +32,76 @@ Although end-to-end neural text-to-speech (TTS) methods (such as **Tacotron2**)
 Inspired by the success of Transformer network in neural machine translation (NMT), in this paper, we introduce and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in **Tacotron2**.
 With the help of multi-head self-attention, the hidden states in the encoder and decoder are constructed in parallel, which improves training efficiency.
 Meanwhile, any two inputs at different times are connected directly by a self-attention mechanism, which solves the long range dependency problem effectively.
-Using phoneme sequences as input, our Transformer TTS network generates mel spectrograms, followed by a WaveNet vocoder to output the final audio results.
+Using phoneme sequences as input, our ***Transformer TTS*** network generates mel spectrograms, followed by a **WaveNet** vocoder to output the final audio results.
 Experiments are conducted to test the efficiency and performance of our new network.
-For the efficiency, our Transformer TTS network can speed up the training about 4.25 times faster compared with **Tacotron2**.
+For the efficiency, our ***Transformer TTS*** network can speed up the training about 4.25 times faster compared with **Tacotron2**.
 For the performance, rigorous human tests show that our proposed model achieves state-of-the-art performance (outperforms **Tacotron2** with a gap of 0.048) and is very close to human quality (4.39 vs 4.44 in MOS).
 
-</td>
-<td>
+</td><td>
 
-</td>
-</tr>
-</table>
+尽管端到端神经文本转语音方法 (如 **Tacotron2**) 被提出并达到 SoTA 性能, 但它们仍存在两个问题:
+1. 训练和推理的低效率;
+2. 使用当前的循环神经网络难以建模长期依赖.
+
+受到 Transformer 网络在神经机器翻译 (Neural Machine Translation, NMT) 中的成功的启发, 我们在本文中引入并修改多头注意力机制以替换 RNN 结构和 **Tacotron2** 中的原始注意力机制.
+借助多头自注意力, 编码器和解码器的隐藏状态可以并行构造, 提高了训练效率.
+同时, 在不同时点的任意两个输入通过自注意力机制直接连接, 有效解决了长期依赖问题.
+使用音素序列作为输入, 我们的 ***Transformer TTS*** 网络生成梅尔频谱图, 后跟 **WaveNet** 声码器来输出最终音频结果.
+我们构造了实验用于测试我们新网络的效率和性能.
+- 效率方面, 我们的 ***Transformer TTS*** 网络比 **Tacotron2** 快 4.25 倍的训练速度.
+- 性能方面, 我们对比了人工测试结果, 证明了我们提出的模型在性能上超过 **Tacotron2** (相差 0.048) 并且与人类水平接近 (MOS 4.39 vs 4.44).
+
+</td></tr></table>
 
 ## 1·Introduction: 引言
 
+<table><tr><td width="50%">
+
+</td></tr></table>
+
 ## 2·Related Works: 相关工作
 
+<table><tr><td width="50%">
+
+</td></tr></table>
+
 ## 3·Methodology: 方法
 
 ![](Images/2018.09.19_TransformerTTS_Fig.03.jpg)
 
+<table><tr><td width="50%">
+
+</td></tr></table>
+
 ## 4·Experiments: 实验
 
+<table><tr><td width="50%">
+
+</td></tr></table>
+
 ## 5·Results: 结果
 
+<table><tr><td width="50%">
+
+</td></tr></table>
+
 ## 6·Conclusions: 结论
 
-<table>
-<tr>
-<td width="50%">
+<table><tr><td width="50%">
 
-We propose a neural TTS model based on **Tacotron2** and Transformer, and make some modification to adapt Transformer to neural TTS task.
+We propose a neural TTS model based on **Tacotron2** and **Transformer**, and make some modification to adapt **Transformer** to neural TTS task.
 Our model generates audio samples of which quality is very closed to human recording, and enables parallel training and learning long-distance dependency so that the training is sped up and the audio prosody is much more smooth.
 We find that batch size is crucial for training stability, and more layers can refine the detail of generated mel spectrograms especially for high frequency regions thus improve model performance.
 
-</td>
-<td>
+</td><td>
 
-</td>
-</tr>
-<tr>
-<td>
+</td></tr>
+<tr><td>
 
-Even thought Transformer has enabled parallel training, autoregressive model still suffers from two problems, which are slow inference and exploration bias.
+Even thought **Transformer** has enabled parallel training, autoregressive model still suffers from two problems, which are slow inference and exploration bias.
 Slow inference is due to the dependency of previous frames when infer current frame, so that the inference is sequential, while exploration bias comes from the autoregressive error accumulation.
 We may solve them both at once by building a non-autoregressive model, which is also our current research in progress.
 
-</td>
-<td>
+</td><td>
 
-</td>
-</tr>
-</table>
+</td></tr></table>