Skip to content

Latest commit

 

History

History
190 lines (150 loc) · 19 KB

2023.09.06_Matcha-TTS.md

File metadata and controls

190 lines (150 loc) · 19 KB

Matcha-TTS

基本信息
  • 标题: "Matcha-TTS: A Fast TTS Architecture with Conditional Flow Matching"
  • 作者:
    • 01 Shivam Mehta
    • 02 Ruibo Tu
    • 03 Jonas Beskow
    • 04 Eva Szekely
    • 05 Gustav Eje Henter
  • 链接:
  • 文件:

Abstract: 摘要

We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM). This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching. Careful design choices additionally ensure each synthesis step is fast to run. The method is probabilistic, non-autoregressive, and learns to speak from scratch without external alignments. Compared to strong pre-trained baseline models, the Matcha-TTS system has the smallest memory footprint, rivals the speed of the fastest model on long utterances, and attains the highest mean opinion score in a listening test. Please see this https URL for audio examples, code, and pre-trained models.

1.Introduction: 引言

Diffusion probabilistic models (DPMs) (cf. NCSN) are currently setting new standards in deep generative modelling on continuous-valued data-generation tasks such as image synthesis (Guided_Diffusion; LDM), motion synthesis \cite{alexanderson2023listen,mehta2023diff}, and speech synthesis (WaveGrad; WaveGrad2; Grad-TTS; Diff-TTS; DiffWave) - the topic of this paper. DPMs define a diffusion process which transforms the data (a.k.a. target) distribution to a prior (a.k.a. source) distribution, e.g., a Gaussian. They then learn a sampling process that reverses the diffusion process. The two processes can be formulated as forward- and reverse-time stochastic differential equations (SDEs) \cite{song2021score}. Solving a reverse-time SDE initial value problem generates samples from the learnt data distribution. Furthermore, each reverse-time SDE has a corresponding ordinary differential equation (ODE), called the probability flow ODE \cite{song2021score,albergo2022building}, which describes (and samples from) the exact same distribution as the SDE. The probability flow ODE is a deterministic process for turning source samples into data samples, similar to continuous-time normalising flows (CNF) \cite{chen2018neural}, but without the need to backpropagate through expensive ODE solvers or approximate the reverse ODE using adjoint variables \cite{chen2018neural}.

The SDE formulation of DPMs is trained by approximating the score function (the gradients of the log probability density) of the data distribution \cite{song2021score}. The training objective takes the form of a mean squared error (MSE) which can be derived from an Evidence Lower BOund (ELBO) on the likelihood. This is fast and simple and, unlike typical normalizing flow models, does not impose any restrictions on model architecture. But whilst they allow efficient training without numerical SDE/ODE solvers, DPMs suffer from slow synthesis speed, since each sample requires numerous iterations (steps), computed in sequence, to accurately solve the SDE. Each such step requires that an entire neural network be evaluated. This slow synthesis speed has long been the main practical issue with DPMs.

This paper introduces Matcha-TTS, a probabilistic and non-autoregressive, fast-to-sample-from TTS acoustic model based on continuous normalizing flows. (We call our approach Matcha-TTS because it uses flow matching for TTS, and because the name sounds similar to "matcha tea", which some people prefer over Taco(tron)s.) There are two main innovations:

  • To begin with, we propose an improved encoder-decoder TTS architecture that uses a combination of 1D CNNs and Transformers in the decoder. This reduces memory consumption and is fast to evaluate, improving synthesis speed.
  • Second, we train these models using optimal-transport conditional flow matching (OT-CFM) \cite{lipman2023flow}, which is a new method to learn ODEs that sample from a data distribution. Compared to conventional CNFs and score-matching probability flow ODEs, OT-CFM defines simpler paths from source to target, enabling accurate synthesis in fewer steps than DPMs.

Experimental results demonstrate that both innovations accelerate synthesis, reducing the trade-off between speed and synthesis quality. Despite being fast and lightweight, Matcha-TTS learns to speak and align without requiring an external aligner. Compared to strong pre-trained baseline models, Matcha-TTS achieves fast synthesis with better naturalness ratings. Audio examples and code are provided at https://shivammehta25.github.io/Matcha-TTS/.

2.Related Works: 相关工作

2.1.Recent Encoder-Decoder TTS Architectures

DPMs have been applied to numerous speech-synthesis tasks with impressive results, including waveform generation (WaveGrad; DiffWave) and end-to-end TTS (WaveGrad2). Diff-TTS was first to apply DPMs for acoustic modelling. Shortly after, Grad-TTS conceptualised the diffusion process as an SDE. Although these models, and descendants like Fast Grad-TTS, are non-autoregressive, TorToiSe-TTS demonstrated DPMs in an autoregressive TTS model with quantised latents.

The above models -- like many modern TTS acoustic models -- use an encoder-decoder architecture with Transformer blocks in the encoder. Many models, e.g., [FastSpeech](../Acoustic/2019.05.22_FastSpeech.md FastSpeech2, use sinusoidal position embeddings for positional dependences. This has been found to generalise poorly to long sequences; cf. AliBi. Glow-TTS, VITS, and Grad-TTS instead use relative positional embeddings \cite{shaw2018self}. Unfortunately, these treat inputs outside a short context window as a "bag of words", often resulting in unnatural prosody. LinearSpeech instead employed rotational position embeddings (RoPE) in RoFormer, which have computational and memory advantages over relative embeddings and generalise to longer distances (wennberg2021case; AliBi). Matcha-TTS thus uses Transformers with RoPE in the encoder, reducing RAM use compared to Grad-TTS. We believe ours is the first SDE or ODE-based TTS method to use RoPE.

Modern TTS architectures also differ in terms of decoder network design. The normalising-flow based methods Glow-TTS and OverFlow use dilated 1D-convolutions. DPM-based methods like Diff-TTS and ProDiff likewise use 1D convolutions to synthesise mel spectrograms. Grad-TTS, in contrast, uses a U-Net with 2D-convolutions. This treats mel spectrograms as images and implicitly assumes translation invariance in both time and frequency. However, speech mel-spectra are not fully translation-invariant along the frequency axis, and 2D decoders generally require more memory as they introduce an extra dimension to the tensors. Meanwhile, non-probabilistic models like FastSpeech 1 and 2 have demonstrated that decoders with (1D) Transformers can learn long-range dependencies and fast, parallel synthesis. Matcha-TTS also uses Transformers in the decoder, but in a 1D U-Net design inspired by the 2D U-Nets in the Stable Diffusion image-generation model (LDM).

Whilst some TTS systems, e.g., [FastSpeech](../Acoustic/2019.05.22_FastSpeech.mdly on externally-supplied alignments, most systems are capable of learning to speak and align at the same time, although it has been found to be important to encourage or enforce monotonic alignments \cite{watts2019where,mehta2022neuralhmm} for fast and effective training. One mechanism for this is monotonic alignment search (MAS), used by, e.g., Glow-TTS and VITS. Grad-TTS, in particular, uses a MAS-based mechanism which they term prior loss to quickly learn to align input symbols with output frames. These alignments are also used to train a deterministic duration predictor minimizing MSE in the log domain. Matcha-TTS uses these same methods for alignment and duration modelling. Finally, Matcha-TTS differs by using snake beta activations from BigVGAN in all decoder feedforward layers.

2.2.Flow Matching and TTS

Currently, some of the highest-quality TTS systems either utilize DPMs (Grad-TTS; TorToiSe-TTS) or discrete-time normalizing flows (VITS; OverFlow), with continuous-time flows being less explored. Lipman et al.\ \cite{lipman2023flow} recently introduced a framework for synthesis using ODEs that unifies and extends probability flow ODEs and CNFs. They were then able to present an efficient approach to learn ODEs for synthesis, using a simple vector-field regression loss called conditional flow matching (CFM), as an alternative to learning score functions for DPMs or using numerical ODE solvers at training time like classic CNFs \cite{chen2018neural}. Crucially, by leveraging ideas from optimal transport, CFM can be set up to yield ODEs that have simple vector fields that change little during the process of mapping samples from the source distribution onto the data distribution, since it essentially just transports probability mass along straight lines. This technique is called OT-CFM; rectified flows \cite{liu2023flow} represent concurrent work with a similar idea. The simple paths mean that the ODE can be solved accurately using few discretization steps, i.e., accurate model samples can be drawn with fewer neural-network evaluations than DPMs, enabling much faster synthesis for the same quality.

CFM is a new technique that differs from earlier approaches to speed up SDE/ODE-based TTS, which most often were based on distillation (e.g., ProDiff; Fast Grad-TTS; CoMoSpeech). Prior to Matcha-TTS, the only public preprint on CFM-based acoustic modelling was the Voicebox model from Meta. Voicebox (VB) is a system that performs various text-guided speech-infilling tasks based on large-scale training data, with its English variant (VB-En) being trained on 60k hours of proprietary data. VB differs substantially from Matcha-TTS: VB performs TTS, denoising, and text-guided acoustic infilling trained using a combination of masking and CFM, whereas Matcha-TTS is a pure TTS model trained solely using OT-CFM. VB uses convolutional positional encoding with AliBi self-attention bias, whilst our text encoder uses RoPE. In contrast to VB, we train on standard data and make code and checkpoints publicly available. VB-En consumes 330M parameters, which is 18 times larger than the Matcha-TTS model in our experiments. Also, VB uses external alignments for training whereas Matcha-TTS learns to speak without them.

3.Methodology: 方法

We now outline flow-matching training (in \cref{ssec:cfm}) and then (in \cref{ssec:matcha}) give details on our proposed TTS architecture.

3.1.Optimal-Transport Conditional Flow Matching (OT-CFM)

We here give a high-level overview of flow matching, first introducing the probability-density path generated by a vector field and then leading into the OT-CFM objective used in our proposed method. Notation and definitions mainly follow \cite{lipman2023flow}.

Let $x$ denote an observation in the data space $\mathbb{R}^d$, sampled from a complicated, unknown data distribution $q(x)$. A probability density path is a time-dependent probability density function, $p_t: [0,1]\times \mathbb{R}^d \rightarrow \mathbb{R}>0$. One way to generate samples from the data distribution $q$ is to construct a probability density path $p_t$, where $t \in [0,1]$ and $p_0(x) = \mathcal{N}(x; \boldsymbol{0},\boldsymbol{I})$ is a prior distribution, such that $p_1(x)$ approximates the data distribution $q(x)$. For example, CNFs first define a vector field $v_t: [0, 1] \times \mathbb{R}^d \rightarrow \mathbb{R}^d$, which generates the flow $\phi_t: [0, 1] \times \mathbb{R}^d \rightarrow \mathbb{R}^d$ through the ODE

$$ \dfrac{\text{d}}{\text{d}t}\phi_t(x) = v_t(\phi_t(x)) \qquad\phi_0(x)=x $$

This generates the path $p_t$ as the marginal probability distribution of the data points. We can sample from the approximated data distribution $p_1$ by solving the initial value problem in Eq.\ \eqref{eq:ode}.

Suppose there exists a known vector field $u_t$ that generates a probability path $p_t$ from $p_0$ to $p_1\approx q$. The flow matching loss is $$ \mathcal{L}{FM}(\theta) = \mathbb{E}{t, p_t(x)}\Vert u_t(x) - v_t(x; \theta) \Vert^2 $$

where $t\sim\mathbb{U}[0,1]$ and $v_t(x; \theta)$ is a neural network with parameters $\theta$. Nevertheless, flow matching is intractable in practice because it is non-trivial to get access to the vector field $u_t$ and the target probability $p_t$. Therefore, conditional flow matching instead considers $$ \mathcal{L}{CFM} (\theta) = \mathbb{E}{t, q(x_1),p_t(x|x_1)}\Vert u_t(x\vert x_1) - v_t(x; \theta) \Vert^2 $$

This replaces the intractable marginal probability densities and the vector field with conditional probability densities and conditional vector fields. Crucially, these are in general tractable and have closed-form solutions, and one can furthermore show that $\mathcal{L}{CFM}(\theta)$ and $\mathcal{L}{FM}(\theta)$ both have identical gradients with respect to $\theta$ \cite{lipman2023flow}.

Matcha-TTS is trained using optimal-transport conditional flow matching (OT-CFM) \cite{lipman2023flow}, which is a CFM variant with particularly simple gradients. The OT-CFM loss function can be written $$ \mathcal{L} (\theta) =\mathbb{E}_{t, q(x_1),p_0(x_0)} \Vert u^{\mathrm{OT}}_t(\phi^{\mathrm{OT}}_t(x)| x_1)- v_t(\phi^{\mathrm{OT}}_t(x) | \boldsymbol{\mu}; \theta) \Vert^2 $$

defining $\phi^{\mathrm{OT}}t(x) = (1 - (1-\sigma{\mathrm{min}})t)x_0 + t x_1$ as the flow from $x_0$ to $x_1$ where each datum $x_1$ is matched to a random sample $x_0\sim\mathcal{N}(\boldsymbol{0},\boldsymbol{I})$ as in \cite{lipman2023flow}. Its gradient vector field -- whose expected value is the target for the learning -- is then $u^{\mathrm{OT}}_t(\phi^{\mathrm{OT}}t(x_0)\vert x_1) = x_1-(1-\sigma{\mathrm{min}})x_0$, which is linear, time-invariant, and only depends on $x_0$ and $x_1$. These properties enable easier and faster training, faster generation, and better performance compared to DPMs.

In the case of Matcha-TTS, $x_1$ are acoustic frames and $\boldsymbol{\mu}$ are the conditional mean values of those frames, predicted from text using the architecture described in the next section. $\sigma_{\mathrm{min}}$ is a hyperparameter with a small value (1e-4 in our experiments).

3.2.Proposed Architecture

Matcha-TTS is a non-autoregressive encoder-decoder architecture for neural TTS. An overview of the architecture is provided in \cref{fig: model architecture overview}. Text encoder and duration predictor architectures follow (Glow-TTS; Grad-TTS), but use rotational position embeddings in RoFormer instead of relative ones. Alignment and duration-model training follow use MAS and the prior loss $\mathcal{L}_{\mathrm{enc}}$ as described in Grad-TTS. The predicted durations, rounded up, are used to upsample (duplicate) the vectors output by the encoder to obtain $\boldsymbol{\mu}$, the predicted average acoustic features (e.g., mel-spectrogram) given the text and the chosen durations. This mean is used to condition the decoder that predicts the vector field $v_t(\phi^{\mathrm{OT}}_t(x_0) | \boldsymbol{\mu}; \theta)$ used for synthesis, but is not used as the mean for the initial noise samples $x_0$ (unlike Grad-TTS).

\cref{fig: decoder architecture} shows the Matcha-TTS decoder architecture. Inspired by LDM, it is a U-Net containing 1D convolutional residual blocks to downsample and upsample the inputs, with the flow-matching step $t\in[0,1]$ embedded as in Grad-TTS. Each residual block is followed by a Transformer block, whose feed forward nets use snake beta activations (BigVGAN). These Transformers do not use any position embeddings, since between-phone positional information already has been baked in by the encoder, and the convolution and downsampling operations act to interpolate these between frames within the same phone and distinguish their relative positions from each other. This decoder network is significantly faster to evaluate and consumes less memory than the 2D convolutional-only U-Net used by Grad-TTS.

4.Experiments: 实验

5.Results: 结果

6.Conclusions: 结论

We have introduced Matcha-TTS, a fast, probabilistic, and high-quality ODE-based TTS acoustic model trained using conditional flow matching. The approach is non-autoregressive, memory efficient, and jointly learns to speak and align. Compared to three strong pre-trained baselines, Matcha-TTS provides superior speech naturalness and can match the speed of the fastest model on long utterances. Our experiments show that both the new architecture and the new training contribute to these improvements.

Compelling future work includes making the model multi-speaker, adding probabilistic duration modelling, and applications to challenging, diverse data such as spontaneous speech \cite{szekely2019spontaneous}.