Skip to content

Commit 1df320b

Browse files
committed
更新 Transformer TTS
1 parent 3091c47 commit 1df320b

File tree

1 file changed

+45
-26
lines changed

1 file changed

+45
-26
lines changed

Models/Acoustic/2018.09.19_TransformerTTS.md

Lines changed: 45 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -23,9 +23,7 @@
2323

2424
## Abstract: 摘要
2525

26-
<table>
27-
<tr>
28-
<td width="50%">
26+
<table><tr><td width="50%">
2927

3028
Although end-to-end neural text-to-speech (TTS) methods (such as **Tacotron2**) are proposed and achieve state-of-the-art performance, they still suffer from two problems:
3129
1. low efficiency during training and inference;
@@ -34,55 +32,76 @@ Although end-to-end neural text-to-speech (TTS) methods (such as **Tacotron2**)
3432
Inspired by the success of Transformer network in neural machine translation (NMT), in this paper, we introduce and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in **Tacotron2**.
3533
With the help of multi-head self-attention, the hidden states in the encoder and decoder are constructed in parallel, which improves training efficiency.
3634
Meanwhile, any two inputs at different times are connected directly by a self-attention mechanism, which solves the long range dependency problem effectively.
37-
Using phoneme sequences as input, our Transformer TTS network generates mel spectrograms, followed by a WaveNet vocoder to output the final audio results.
35+
Using phoneme sequences as input, our ***Transformer TTS*** network generates mel spectrograms, followed by a **WaveNet** vocoder to output the final audio results.
3836
Experiments are conducted to test the efficiency and performance of our new network.
39-
For the efficiency, our Transformer TTS network can speed up the training about 4.25 times faster compared with **Tacotron2**.
37+
For the efficiency, our ***Transformer TTS*** network can speed up the training about 4.25 times faster compared with **Tacotron2**.
4038
For the performance, rigorous human tests show that our proposed model achieves state-of-the-art performance (outperforms **Tacotron2** with a gap of 0.048) and is very close to human quality (4.39 vs 4.44 in MOS).
4139

42-
</td>
43-
<td>
40+
</td><td>
4441

45-
</td>
46-
</tr>
47-
</table>
42+
尽管端到端神经文本转语音方法 (如 **Tacotron2**) 被提出并达到 SoTA 性能, 但它们仍存在两个问题:
43+
1. 训练和推理的低效率;
44+
2. 使用当前的循环神经网络难以建模长期依赖.
45+
46+
受到 Transformer 网络在神经机器翻译 (Neural Machine Translation, NMT) 中的成功的启发, 我们在本文中引入并修改多头注意力机制以替换 RNN 结构和 **Tacotron2** 中的原始注意力机制.
47+
借助多头自注意力, 编码器和解码器的隐藏状态可以并行构造, 提高了训练效率.
48+
同时, 在不同时点的任意两个输入通过自注意力机制直接连接, 有效解决了长期依赖问题.
49+
使用音素序列作为输入, 我们的 ***Transformer TTS*** 网络生成梅尔频谱图, 后跟 **WaveNet** 声码器来输出最终音频结果.
50+
我们构造了实验用于测试我们新网络的效率和性能.
51+
- 效率方面, 我们的 ***Transformer TTS*** 网络比 **Tacotron2** 快 4.25 倍的训练速度.
52+
- 性能方面, 我们对比了人工测试结果, 证明了我们提出的模型在性能上超过 **Tacotron2** (相差 0.048) 并且与人类水平接近 (MOS 4.39 vs 4.44).
53+
54+
</td></tr></table>
4855

4956
## 1·Introduction: 引言
5057

58+
<table><tr><td width="50%">
59+
60+
</td></tr></table>
61+
5162
## 2·Related Works: 相关工作
5263

64+
<table><tr><td width="50%">
65+
66+
</td></tr></table>
67+
5368
## 3·Methodology: 方法
5469

5570
![](Images/2018.09.19_TransformerTTS_Fig.03.jpg)
5671

72+
<table><tr><td width="50%">
73+
74+
</td></tr></table>
75+
5776
## 4·Experiments: 实验
5877

78+
<table><tr><td width="50%">
79+
80+
</td></tr></table>
81+
5982
## 5·Results: 结果
6083

84+
<table><tr><td width="50%">
85+
86+
</td></tr></table>
87+
6188
## 6·Conclusions: 结论
6289

63-
<table>
64-
<tr>
65-
<td width="50%">
90+
<table><tr><td width="50%">
6691

67-
We propose a neural TTS model based on **Tacotron2** and Transformer, and make some modification to adapt Transformer to neural TTS task.
92+
We propose a neural TTS model based on **Tacotron2** and **Transformer**, and make some modification to adapt **Transformer** to neural TTS task.
6893
Our model generates audio samples of which quality is very closed to human recording, and enables parallel training and learning long-distance dependency so that the training is sped up and the audio prosody is much more smooth.
6994
We find that batch size is crucial for training stability, and more layers can refine the detail of generated mel spectrograms especially for high frequency regions thus improve model performance.
7095

71-
</td>
72-
<td>
96+
</td><td>
7397

74-
</td>
75-
</tr>
76-
<tr>
77-
<td>
98+
</td></tr>
99+
<tr><td>
78100

79-
Even thought Transformer has enabled parallel training, autoregressive model still suffers from two problems, which are slow inference and exploration bias.
101+
Even thought **Transformer** has enabled parallel training, autoregressive model still suffers from two problems, which are slow inference and exploration bias.
80102
Slow inference is due to the dependency of previous frames when infer current frame, so that the inference is sequential, while exploration bias comes from the autoregressive error accumulation.
81103
We may solve them both at once by building a non-autoregressive model, which is also our current research in progress.
82104

83-
</td>
84-
<td>
105+
</td><td>
85106

86-
</td>
87-
</tr>
88-
</table>
107+
</td></tr></table>

0 commit comments

Comments
 (0)