修改宽度

SapphireLab · SapphireLab · commit 2d8859a62322 · 2025-02-11T21:26:13.000+08:00
diff --git a/Surveys/2024.02.20__Survey__Towards_Audio_Language_Modeling_(5P).md b/Surveys/2024.02.20__Survey__Towards_Audio_Language_Modeling_(5P).md
@@ -27,7 +27,7 @@
 
 <table>
 <tr>
-<td>
+<td width="50%">
 
 Neural audio codecs are initially introduced to compress audio data into compact codes to reduce transmission latency.
 Researchers recently discovered the potential of codecs as suitable tokenizers for converting continuous audio into discrete codes, which can be employed to develop audio language models (LMs).
@@ -50,7 +50,7 @@ The paper aims to provide a thorough and systematic overview of the neural audio
 
 <table>
 <tr>
-<td>
+<td width="50%">
 
 Neural audio codec models were first introduced to compress audio for efficient data transmission.
 The encoder converts the audio into codec codes, which are then transmitted.
@@ -141,7 +141,7 @@ Through this comprehensive review, we aim to offer the community insights into t
 
 <table>
 <tr>
-<td>
+<td width="50%">
 
 Codec models aim to compress and decompress speech signals efficiently.
 Traditional codecs are developed based on psycho-acoustics and speech synthesis [21], [22].
@@ -188,7 +188,7 @@ $n_c$ represents the codebook number, `SR` represents the Sample Rate, and `BPS`
 
 <table>
 <tr>
-<td>
+<td width="50%">
 
 [SoundStream (2021)](../Models/SpeechCodec/2021.07.07_SoundStream.md) stands as one of the pioneering implementations of neural codec models, embodying a classic neural codec architecture comprising encoder, quantizer, and decoder modules.
 It utilizes the streaming [SEANets (2020)](../Models/_Basis/2020.09.04_SEANet.md) as its encoder and decoder.
@@ -309,7 +309,7 @@ Meanwhile, it also finds that incorporating semantic information in the codec to
 
 <table>
 <tr>
-<td>
+<td width="50%">
 
 We compare several techniques proposed by these codecs in [Tab.02](#Tab.02).
 The abbreviation `A-F` represents different codec models.
@@ -346,7 +346,7 @@ Tab.02: Comparison between codec implementation strategy.
 
 <table>
 <tr>
-<td>
+<td width="50%">
 
 The design of discriminators constitutes a pivotal element within codec models.
 [Encodec](../Models/SpeechCodec/2022.10.24_EnCodec.md) initially introduces the **Multi-Scale-STFT Discriminator (MS-STFTD)**.
@@ -380,7 +380,7 @@ To address this, they propose the application of a **Multi-Scale Multi-Band STFT
 
 <table>
 <tr>
-<td>
+<td width="50%">
 
 [SpeechTokenizer](../Models/SpeechCodec/2023.08.31_SpeechTokenizer.md) utilizes semantic tokens from [HuBERT L9](../Models/SpeechRepresentation/2021.06.14_HuBERT.md) as a teacher for the RVQ process.
 This guidance enables the disentanglement of content information into the first layer of the tokenizer, while paralinguistic information is retained in subsequent layers.
@@ -419,11 +419,11 @@ They demonstrate that using multiple residual groups achieves good reconstructio
 
 <table>
 <tr>
-<td>
+<td width="50%">
 
 We compare the codebook number, training data, sampling rate, and bit rate per second in [Tab.01](#Tab.01).
 From the training data perspective, [SpeechTokenizer (2023)](../Models/SpeechCodec/2023.08.31_SpeechTokenizer.md), [AudioDec (2023)](../Models/SpeechCodec/2023.05.26_AudioDec.md), and [FunCodec (2023)](../Models/SpeechCodec/2023.09.14_FunCodec.md) utilize only English speech dataset.
-[AcademiCodec/HiFi-Codec (2023)](../Models/SpeechCodec/2023.05.04_HiFi-Codec.md) incorporates bilingual speech datasets, including [AISHELL](../../Datasets/2017.09.16_AISHELL-1.md) for Chinese and [LibriTTS](../../Datasets/2019.04.05_LibriTTS.md) and [VCTK](../../Datasets/2012.08.00_VCTK.md) for English.
+[AcademiCodec/HiFi-Codec (2023)](../Models/SpeechCodec/2023.05.04_HiFi-Codec.md) incorporates bilingual speech datasets, including [AISHELL](../Datasets/2017.09.16_AISHELL-1.md) for Chinese and [LibriTTS](../Datasets/2019.04.05_LibriTTS.md) and [VCTK](../Datasets/2012.08.00_VCTK.md) for English.
 Both [DAC (2023)](../Models/SpeechCodec/2023.06.11_Descript-Audio-Codec.md), and [Encodec (2022)](../Models/SpeechCodec/2022.10.24_EnCodec.md) encompass diverse modality data, including speech, music, and audio, in the training data.
 
 </td>
@@ -432,7 +432,7 @@ Both [DAC (2023)](../Models/SpeechCodec/2023.06.11_Descript-Audio-Codec.md), and
 我们在[表 01](#Tab.01) 中比较了码本数量, 训练数据, 采样率, 和每秒比特率.
 从训练数据的角度看:
 - [SpeechTokenizer (2023)](../Models/SpeechCodec/2023.08.31_SpeechTokenizer.md), [AudioDec (2023)](../Models/SpeechCodec/2023.05.26_AudioDec.md), 和 [FunCodec (2023)](../Models/SpeechCodec/2023.09.14_FunCodec.md) 只使用了英语语音数据集.
-- [AcademiCodec/HiFi-Codec (2023)](../Models/SpeechCodec/2023.05.04_HiFi-Codec.md) 包含了双语语音数据集, 包括 [AISHELL](../../Datasets/2017.09.16_AISHELL-1.md) 用于中文, [LibriTTS](../../Datasets/2019.04.05_LibriTTS.md) 和 [VCTK](../../Datasets/2012.08.00_VCTK.md) 用于英语.
+- [AcademiCodec/HiFi-Codec (2023)](../Models/SpeechCodec/2023.05.04_HiFi-Codec.md) 包含了双语语音数据集, 包括 [AISHELL](../Datasets/2017.09.16_AISHELL-1.md) 用于中文, [LibriTTS](../Datasets/2019.04.05_LibriTTS.md) 和 [VCTK](../Datasets/2012.08.00_VCTK.md) 用于英语.
 - [DAC (2023)](../Models/SpeechCodec/2023.06.11_Descript-Audio-Codec.md) 和 [Encodec (2022)](../Models/SpeechCodec/2022.10.24_EnCodec.md) 在训练数据中包含了多种模态数据, 包括语音, 音乐, 以及音频.
 
 </td>
@@ -443,7 +443,7 @@ Both [DAC (2023)](../Models/SpeechCodec/2023.06.11_Descript-Audio-Codec.md), and
 
 <table>
 <tr>
-<td>
+<td width="50%">
 
 As shown in [Fig.02](#Fig.02), the process of neural codec-based audio language modeling begins by converting context information, such as text and MIDI, into context codes, while simultaneously encoding the audio into codec codes.
 These context and codec codes are then employed in the language modeling phase to generate the desired target codec code sequence.
@@ -478,7 +478,7 @@ Fig.02: Codec-Based Language Modeling.
 
 <table>
 <tr>
-<td>
+<td width="50%">
 
 [AudioLM (2022)](../Models/SpeechLM/2022.09.07_AudioLM.md) is the pioneering model in introducing codec codes for language modeling, utilizing a hierarchical approach that encompasses two distinct stages.
 The first stage generates semantic tokens using a self-supervised [W2V-BERT (2021)](../Models/SpeechRepresentation/2021.08.07_W2V-BERT.md) model.
@@ -633,7 +633,7 @@ With the development of these powerful speech LMs, researchers have begun to exp
 
 <table>
 <tr>
-<td>
+<td width="50%">
 
 In [Tab.03](#Tab.03), we compare the inputs, outputs, and downstream tasks of different codec-based language models.
 We also summarize that the downstream tasks conducted by different codec-based language models:
@@ -698,7 +698,7 @@ Tab.03: Codec-Based Language Models Comparison.
 
 <table>
 <tr>
-<td>
+<td width="50%">
 
 The paper fills the research blank to review the neural codec models and LMs built upon them.
 We hope the comprehensive review and comparisons can inspire future research works to boost the development of neural codec models and codec-based LMs.