Skip to content

Commit 8b2f66f

Browse files
committed
更新 FELLE 方法
1 parent d45cb4c commit 8b2f66f

File tree

1 file changed

+28
-7
lines changed

1 file changed

+28
-7
lines changed

Models/SpeechLM/ST2S/2025.02.16_FELLE.md

Lines changed: 28 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -171,19 +171,21 @@ where time $t$ is uniformly sampled from $\mathcal{U}[0,1]$, data points $x_1$ a
171171

172172
## 3·Methodology: 方法
173173

174-
### Problem Formulation: Token-wise Flow Matching for AR Model
174+
### Problem Formulation: Token-wise Flow Matching for AR Model <br> 问题形式化: 自回归模型的 Token 级流匹配
175175

176176
<table><tr><td width="50%">
177177

178178
Following **MELLE**'s autoregressive language modeling framework for mel-spectrogram prediction, we reformulate zero-shot TTS through a hierarchical flow-matching mechanism at each prediction step.
179179
Each mel-spectrogram frame $\bm{x}^i \in \mathbb{R}^D$ (where $D$ denotes the mel-band dimension) is treated as a continuous token, generated sequentially through an autoregressive process.
180180
Given an input text sequence $\bm{y} = [y^0, \ldots, y^{N-1}]$, speech prompt $\bm{\widehat{x}}$, and previously generated tokens $\bm{x}^{<i} = [\bm{x}^0, \ldots, \bm{x}^{i-1}]$, the model predicts the current token $\bm{x}^i$ by integrating language model guidance into the flow-matching paradigm.
181+
181182
The joint distribution is decomposed autoregressively as:
182183
$$
183184
\begin{aligned}
184-
p(\bm{X} \! \mid\!\bm{y})\!
185-
&= \prod_{i=0}^{L-1} p(\bm{x}^i \mid \bm{x}^{<i}, \bm{y}, \bm{\widehat{x}}) \\
186-
&=\! \prod_{i=0}^{L-1} p_{\theta_\text{FM}}(\bm{x}^i \mid \bm{z}^i), \bm{z}^i\!=\!f_{\theta_\text{LM}}(\bm{x}^{<i}, \bm{y},\bm{\widehat{x}} \notag) .
185+
p(\bm{X} \! \mid\!\bm{y})\!
186+
&= \prod_{i=0}^{L-1} p(\bm{x}^i \mid \bm{x}^{<i}, \bm{y}, \bm{\widehat{x}}) \\
187+
&= \prod_{i=0}^{L-1} p_{\theta_\text{FM}}(\bm{x}^i \mid \bm{z}^i),\\
188+
\bm{z}^i&=f_{\theta_\text{LM}}(\bm{x}^{<i}, \bm{y},\bm{\widehat{x}}).
187189
\end{aligned}
188190
$$
189191

@@ -192,9 +194,28 @@ The language model $f_{\theta_\text{LM}}(\cdot)$ generates hidden state $\bm{z}^
192194

193195
</td><td>
194196

197+
遵循 **MELLE** 用于梅尔频谱预测的自回归语言建模框架, 我们通过在每个预测步采用分层流匹配机制重新构建零样本文本转语音.
198+
每个梅尔频谱帧 $\bm{x}^i \in \mathbb{R}^D$ ($D$ 表示梅尔频带维度) 视为一个连续的 Token, 通过自回归过程顺序生成.
199+
给定输入文本序列 $\bm{y} = [y^0, \ldots, y^{N-1}]$, 语音提示 $\bm{\widehat{x}}$, 以及之前生成的 Token $\bm{x}^{<i} = [\bm{x}^0, \ldots, \bm{x}^{i-1}]$, 模型通过将语言模型引导融合到流匹配范式中以预测当前 Token $\bm{x}^i$.
200+
201+
联合分布被自回归地分解为:
202+
203+
$$
204+
\begin{aligned}
205+
p(\bm{X} \! \mid\!\bm{y})\!
206+
&= \prod_{i=0}^{L-1} p(\bm{x}^i \mid \bm{x}^{<i}, \bm{y}, \bm{\widehat{x}}) \\
207+
&= \prod_{i=0}^{L-1} p_{\theta_\text{FM}}(\bm{x}^i \mid \bm{z}^i),\\
208+
\bm{z}^i&=f_{\theta_\text{LM}}(\bm{x}^{<i}, \bm{y},\bm{\widehat{x}}).
209+
\end{aligned}
210+
$$
211+
212+
- $\bm{X} = [\bm{x}^0, \ldots, \bm{x}^{L-1}] \in \mathbb{R}^{L \times D}$ 表示完整的梅尔频谱序列, $L$ 表示梅尔频谱帧总数.
213+
- 语言模型 $f_{\theta_\text{LM}}(\cdot)$ 生成隐藏状态 $\bm{z}^i$, 它捕获语言内容和声学上下文,
214+
- $p_{\theta_\text{FM}}(\cdot \mid \bm{z}^i)$ 表示流匹配模块, 它将先验分布转换为以 $\bm{z}^i$ 为条件的条件分布.
215+
195216
</td></tr></table>
196217

197-
### Architecture
218+
### Architecture: 架构
198219

199220
<table><tr><td width="50%">
200221

@@ -213,7 +234,7 @@ At each timestep, the framework relies on the previous mel-spectrogram distribut
213234

214235
</td></tr></table>
215236

216-
### Autoregressive Language Model
237+
### Autoregressive Language Model: 自回归语言模型
217238

218239
<table><tr><td width="50%">
219240

@@ -226,7 +247,7 @@ The output at each time step subsequently serves as a conditioning input for the
226247

227248
</td></tr></table>
228249

229-
### Coarse-to-Fine Flow Matching
250+
### Coarse-to-Fine Flow Matching: 从粗到细的流匹配
230251

231252
<table><tr><td width="50%">
232253

0 commit comments

Comments
 (0)