You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Models/SpeechLM/ST2S/2025.02.16_FELLE.md
+28-7Lines changed: 28 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -171,19 +171,21 @@ where time $t$ is uniformly sampled from $\mathcal{U}[0,1]$, data points $x_1$ a
171
171
172
172
## 3·Methodology: 方法
173
173
174
-
### Problem Formulation: Token-wise Flow Matching for AR Model
174
+
### Problem Formulation: Token-wise Flow Matching for AR Model <br> 问题形式化: 自回归模型的 Token 级流匹配
175
175
176
176
<table><tr><tdwidth="50%">
177
177
178
178
Following **MELLE**'s autoregressive language modeling framework for mel-spectrogram prediction, we reformulate zero-shot TTS through a hierarchical flow-matching mechanism at each prediction step.
179
179
Each mel-spectrogram frame $\bm{x}^i \in \mathbb{R}^D$ (where $D$ denotes the mel-band dimension) is treated as a continuous token, generated sequentially through an autoregressive process.
180
180
Given an input text sequence $\bm{y} = [y^0, \ldots, y^{N-1}]$, speech prompt $\bm{\widehat{x}}$, and previously generated tokens $\bm{x}^{<i} = [\bm{x}^0, \ldots, \bm{x}^{i-1}]$, the model predicts the current token $\bm{x}^i$ by integrating language model guidance into the flow-matching paradigm.
181
+
181
182
The joint distribution is decomposed autoregressively as:
0 commit comments