更新 FELLE 方法

SapphireLab · SapphireLab · commit 8b2f66fcb13f · 2025-03-04T00:30:41.000+08:00
diff --git a/Models/SpeechLM/ST2S/2025.02.16_FELLE.md b/Models/SpeechLM/ST2S/2025.02.16_FELLE.md
@@ -171,19 +171,21 @@ where time $t$ is uniformly sampled from $\mathcal{U}[0,1]$, data points $x_1$ a
 
 ## 3·Methodology: 方法
 
-### Problem Formulation: Token-wise Flow Matching for AR Model
+### Problem Formulation: Token-wise Flow Matching for AR Model <br> 问题形式化: 自回归模型的 Token 级流匹配
 
 <table><tr><td width="50%">
 
 Following **MELLE**'s autoregressive language modeling framework for mel-spectrogram prediction, we reformulate zero-shot TTS through a hierarchical flow-matching mechanism at each prediction step.
 Each mel-spectrogram frame $\bm{x}^i \in \mathbb{R}^D$ (where $D$ denotes the mel-band dimension) is treated as a continuous token, generated sequentially through an autoregressive process.
 Given an input text sequence $\bm{y} = [y^0, \ldots, y^{N-1}]$, speech prompt $\bm{\widehat{x}}$, and previously generated tokens $\bm{x}^{<i} = [\bm{x}^0, \ldots, \bm{x}^{i-1}]$, the model predicts the current token $\bm{x}^i$ by integrating language model guidance into the flow-matching paradigm.
+
 The joint distribution is decomposed autoregressively as:
 $$
 \begin{aligned}
-    p(\bm{X} \! \mid\!\bm{y})\!
-   &= \prod_{i=0}^{L-1} p(\bm{x}^i \mid \bm{x}^{<i}, \bm{y}, \bm{\widehat{x}})  \\
-   &=\! \prod_{i=0}^{L-1} p_{\theta_\text{FM}}(\bm{x}^i \mid \bm{z}^i), \bm{z}^i\!=\!f_{\theta_\text{LM}}(\bm{x}^{<i}, \bm{y},\bm{\widehat{x}} \notag) .
+p(\bm{X} \! \mid\!\bm{y})\!
+&= \prod_{i=0}^{L-1} p(\bm{x}^i \mid \bm{x}^{<i}, \bm{y}, \bm{\widehat{x}})  \\
+&= \prod_{i=0}^{L-1} p_{\theta_\text{FM}}(\bm{x}^i \mid \bm{z}^i),\\
+\bm{z}^i&=f_{\theta_\text{LM}}(\bm{x}^{<i}, \bm{y},\bm{\widehat{x}}).
 \end{aligned}
 $$
 
@@ -192,9 +194,28 @@ The language model $f_{\theta_\text{LM}}(\cdot)$ generates hidden state $\bm{z}^
 
 </td><td>
 
+遵循 **MELLE** 用于梅尔频谱预测的自回归语言建模框架, 我们通过在每个预测步采用分层流匹配机制重新构建零样本文本转语音.
+每个梅尔频谱帧 $\bm{x}^i \in \mathbb{R}^D$ ($D$ 表示梅尔频带维度) 视为一个连续的 Token, 通过自回归过程顺序生成.
+给定输入文本序列 $\bm{y} = [y^0, \ldots, y^{N-1}]$, 语音提示 $\bm{\widehat{x}}$, 以及之前生成的 Token $\bm{x}^{<i} = [\bm{x}^0, \ldots, \bm{x}^{i-1}]$, 模型通过将语言模型引导融合到流匹配范式中以预测当前 Token $\bm{x}^i$.
+
+联合分布被自回归地分解为:
+
+$$
+\begin{aligned}
+p(\bm{X} \! \mid\!\bm{y})\!
+&= \prod_{i=0}^{L-1} p(\bm{x}^i \mid \bm{x}^{<i}, \bm{y}, \bm{\widehat{x}})  \\
+&= \prod_{i=0}^{L-1} p_{\theta_\text{FM}}(\bm{x}^i \mid \bm{z}^i),\\
+\bm{z}^i&=f_{\theta_\text{LM}}(\bm{x}^{<i}, \bm{y},\bm{\widehat{x}}).
+\end{aligned}
+$$
+
+- $\bm{X} = [\bm{x}^0, \ldots, \bm{x}^{L-1}] \in \mathbb{R}^{L \times D}$ 表示完整的梅尔频谱序列, $L$ 表示梅尔频谱帧总数.
+- 语言模型 $f_{\theta_\text{LM}}(\cdot)$ 生成隐藏状态 $\bm{z}^i$, 它捕获语言内容和声学上下文,
+- $p_{\theta_\text{FM}}(\cdot \mid \bm{z}^i)$ 表示流匹配模块, 它将先验分布转换为以 $\bm{z}^i$ 为条件的条件分布.
+
 </td></tr></table>
 
-### Architecture
+### Architecture: 架构
 
 <table><tr><td width="50%">
 
@@ -213,7 +234,7 @@ At each timestep, the framework relies on the previous mel-spectrogram distribut
 
 </td></tr></table>
 
-### Autoregressive Language Model
+### Autoregressive Language Model: 自回归语言模型
 
 <table><tr><td width="50%">
 
@@ -226,7 +247,7 @@ The output at each time step subsequently serves as a conditioning input for the
 
 </td></tr></table>
 
-### Coarse-to-Fine Flow Matching
+### Coarse-to-Fine Flow Matching: 从粗到细的流匹配
 
 <table><tr><td width="50%">