You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Models/Acoustic/2018.09.19_TransformerTTS.md
+45-26Lines changed: 45 additions & 26 deletions
Original file line number
Diff line number
Diff line change
@@ -23,9 +23,7 @@
23
23
24
24
## Abstract: 摘要
25
25
26
-
<table>
27
-
<tr>
28
-
<tdwidth="50%">
26
+
<table><tr><tdwidth="50%">
29
27
30
28
Although end-to-end neural text-to-speech (TTS) methods (such as **Tacotron2**) are proposed and achieve state-of-the-art performance, they still suffer from two problems:
31
29
1. low efficiency during training and inference;
@@ -34,55 +32,76 @@ Although end-to-end neural text-to-speech (TTS) methods (such as **Tacotron2**)
34
32
Inspired by the success of Transformer network in neural machine translation (NMT), in this paper, we introduce and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in **Tacotron2**.
35
33
With the help of multi-head self-attention, the hidden states in the encoder and decoder are constructed in parallel, which improves training efficiency.
36
34
Meanwhile, any two inputs at different times are connected directly by a self-attention mechanism, which solves the long range dependency problem effectively.
37
-
Using phoneme sequences as input, our Transformer TTS network generates mel spectrograms, followed by a WaveNet vocoder to output the final audio results.
35
+
Using phoneme sequences as input, our ***Transformer TTS*** network generates mel spectrograms, followed by a **WaveNet** vocoder to output the final audio results.
38
36
Experiments are conducted to test the efficiency and performance of our new network.
39
-
For the efficiency, our Transformer TTS network can speed up the training about 4.25 times faster compared with **Tacotron2**.
37
+
For the efficiency, our ***Transformer TTS*** network can speed up the training about 4.25 times faster compared with **Tacotron2**.
40
38
For the performance, rigorous human tests show that our proposed model achieves state-of-the-art performance (outperforms **Tacotron2** with a gap of 0.048) and is very close to human quality (4.39 vs 4.44 in MOS).
41
39
42
-
</td>
43
-
<td>
40
+
</td><td>
44
41
45
-
</td>
46
-
</tr>
47
-
</table>
42
+
尽管端到端神经文本转语音方法 (如 **Tacotron2**) 被提出并达到 SoTA 性能, 但它们仍存在两个问题:
We propose a neural TTS model based on **Tacotron2** and Transformer, and make some modification to adapt Transformer to neural TTS task.
92
+
We propose a neural TTS model based on **Tacotron2** and **Transformer**, and make some modification to adapt **Transformer** to neural TTS task.
68
93
Our model generates audio samples of which quality is very closed to human recording, and enables parallel training and learning long-distance dependency so that the training is sped up and the audio prosody is much more smooth.
69
94
We find that batch size is crucial for training stability, and more layers can refine the detail of generated mel spectrograms especially for high frequency regions thus improve model performance.
70
95
71
-
</td>
72
-
<td>
96
+
</td><td>
73
97
74
-
</td>
75
-
</tr>
76
-
<tr>
77
-
<td>
98
+
</td></tr>
99
+
<tr><td>
78
100
79
-
Even thought Transformer has enabled parallel training, autoregressive model still suffers from two problems, which are slow inference and exploration bias.
101
+
Even thought **Transformer** has enabled parallel training, autoregressive model still suffers from two problems, which are slow inference and exploration bias.
80
102
Slow inference is due to the dependency of previous frames when infer current frame, so that the inference is sequential, while exploration bias comes from the autoregressive error accumulation.
81
103
We may solve them both at once by building a non-autoregressive model, which is also our current research in progress.
0 commit comments