Skip to content

Commit 2d8859a

Browse files
committed
修改宽度
1 parent ee3834c commit 2d8859a

File tree

1 file changed

+14
-14
lines changed

1 file changed

+14
-14
lines changed

Surveys/2024.02.20__Survey__Towards_Audio_Language_Modeling_(5P).md

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@
2727

2828
<table>
2929
<tr>
30-
<td>
30+
<td width="50%">
3131

3232
Neural audio codecs are initially introduced to compress audio data into compact codes to reduce transmission latency.
3333
Researchers recently discovered the potential of codecs as suitable tokenizers for converting continuous audio into discrete codes, which can be employed to develop audio language models (LMs).
@@ -50,7 +50,7 @@ The paper aims to provide a thorough and systematic overview of the neural audio
5050

5151
<table>
5252
<tr>
53-
<td>
53+
<td width="50%">
5454

5555
Neural audio codec models were first introduced to compress audio for efficient data transmission.
5656
The encoder converts the audio into codec codes, which are then transmitted.
@@ -141,7 +141,7 @@ Through this comprehensive review, we aim to offer the community insights into t
141141

142142
<table>
143143
<tr>
144-
<td>
144+
<td width="50%">
145145

146146
Codec models aim to compress and decompress speech signals efficiently.
147147
Traditional codecs are developed based on psycho-acoustics and speech synthesis [21], [22].
@@ -188,7 +188,7 @@ $n_c$ represents the codebook number, `SR` represents the Sample Rate, and `BPS`
188188

189189
<table>
190190
<tr>
191-
<td>
191+
<td width="50%">
192192

193193
[SoundStream (2021)](../Models/SpeechCodec/2021.07.07_SoundStream.md) stands as one of the pioneering implementations of neural codec models, embodying a classic neural codec architecture comprising encoder, quantizer, and decoder modules.
194194
It utilizes the streaming [SEANets (2020)](../Models/_Basis/2020.09.04_SEANet.md) as its encoder and decoder.
@@ -309,7 +309,7 @@ Meanwhile, it also finds that incorporating semantic information in the codec to
309309

310310
<table>
311311
<tr>
312-
<td>
312+
<td width="50%">
313313

314314
We compare several techniques proposed by these codecs in [Tab.02](#Tab.02).
315315
The abbreviation `A-F` represents different codec models.
@@ -346,7 +346,7 @@ Tab.02: Comparison between codec implementation strategy.
346346

347347
<table>
348348
<tr>
349-
<td>
349+
<td width="50%">
350350

351351
The design of discriminators constitutes a pivotal element within codec models.
352352
[Encodec](../Models/SpeechCodec/2022.10.24_EnCodec.md) initially introduces the **Multi-Scale-STFT Discriminator (MS-STFTD)**.
@@ -380,7 +380,7 @@ To address this, they propose the application of a **Multi-Scale Multi-Band STFT
380380

381381
<table>
382382
<tr>
383-
<td>
383+
<td width="50%">
384384

385385
[SpeechTokenizer](../Models/SpeechCodec/2023.08.31_SpeechTokenizer.md) utilizes semantic tokens from [HuBERT L9](../Models/SpeechRepresentation/2021.06.14_HuBERT.md) as a teacher for the RVQ process.
386386
This guidance enables the disentanglement of content information into the first layer of the tokenizer, while paralinguistic information is retained in subsequent layers.
@@ -419,11 +419,11 @@ They demonstrate that using multiple residual groups achieves good reconstructio
419419

420420
<table>
421421
<tr>
422-
<td>
422+
<td width="50%">
423423

424424
We compare the codebook number, training data, sampling rate, and bit rate per second in [Tab.01](#Tab.01).
425425
From the training data perspective, [SpeechTokenizer (2023)](../Models/SpeechCodec/2023.08.31_SpeechTokenizer.md), [AudioDec (2023)](../Models/SpeechCodec/2023.05.26_AudioDec.md), and [FunCodec (2023)](../Models/SpeechCodec/2023.09.14_FunCodec.md) utilize only English speech dataset.
426-
[AcademiCodec/HiFi-Codec (2023)](../Models/SpeechCodec/2023.05.04_HiFi-Codec.md) incorporates bilingual speech datasets, including [AISHELL](../../Datasets/2017.09.16_AISHELL-1.md) for Chinese and [LibriTTS](../../Datasets/2019.04.05_LibriTTS.md) and [VCTK](../../Datasets/2012.08.00_VCTK.md) for English.
426+
[AcademiCodec/HiFi-Codec (2023)](../Models/SpeechCodec/2023.05.04_HiFi-Codec.md) incorporates bilingual speech datasets, including [AISHELL](../Datasets/2017.09.16_AISHELL-1.md) for Chinese and [LibriTTS](../Datasets/2019.04.05_LibriTTS.md) and [VCTK](../Datasets/2012.08.00_VCTK.md) for English.
427427
Both [DAC (2023)](../Models/SpeechCodec/2023.06.11_Descript-Audio-Codec.md), and [Encodec (2022)](../Models/SpeechCodec/2022.10.24_EnCodec.md) encompass diverse modality data, including speech, music, and audio, in the training data.
428428

429429
</td>
@@ -432,7 +432,7 @@ Both [DAC (2023)](../Models/SpeechCodec/2023.06.11_Descript-Audio-Codec.md), and
432432
我们在[表 01](#Tab.01) 中比较了码本数量, 训练数据, 采样率, 和每秒比特率.
433433
从训练数据的角度看:
434434
- [SpeechTokenizer (2023)](../Models/SpeechCodec/2023.08.31_SpeechTokenizer.md), [AudioDec (2023)](../Models/SpeechCodec/2023.05.26_AudioDec.md), 和 [FunCodec (2023)](../Models/SpeechCodec/2023.09.14_FunCodec.md) 只使用了英语语音数据集.
435-
- [AcademiCodec/HiFi-Codec (2023)](../Models/SpeechCodec/2023.05.04_HiFi-Codec.md) 包含了双语语音数据集, 包括 [AISHELL](../../Datasets/2017.09.16_AISHELL-1.md) 用于中文, [LibriTTS](../../Datasets/2019.04.05_LibriTTS.md)[VCTK](../../Datasets/2012.08.00_VCTK.md) 用于英语.
435+
- [AcademiCodec/HiFi-Codec (2023)](../Models/SpeechCodec/2023.05.04_HiFi-Codec.md) 包含了双语语音数据集, 包括 [AISHELL](../Datasets/2017.09.16_AISHELL-1.md) 用于中文, [LibriTTS](../Datasets/2019.04.05_LibriTTS.md)[VCTK](../Datasets/2012.08.00_VCTK.md) 用于英语.
436436
- [DAC (2023)](../Models/SpeechCodec/2023.06.11_Descript-Audio-Codec.md)[Encodec (2022)](../Models/SpeechCodec/2022.10.24_EnCodec.md) 在训练数据中包含了多种模态数据, 包括语音, 音乐, 以及音频.
437437

438438
</td>
@@ -443,7 +443,7 @@ Both [DAC (2023)](../Models/SpeechCodec/2023.06.11_Descript-Audio-Codec.md), and
443443

444444
<table>
445445
<tr>
446-
<td>
446+
<td width="50%">
447447

448448
As shown in [Fig.02](#Fig.02), the process of neural codec-based audio language modeling begins by converting context information, such as text and MIDI, into context codes, while simultaneously encoding the audio into codec codes.
449449
These context and codec codes are then employed in the language modeling phase to generate the desired target codec code sequence.
@@ -478,7 +478,7 @@ Fig.02: Codec-Based Language Modeling.
478478

479479
<table>
480480
<tr>
481-
<td>
481+
<td width="50%">
482482

483483
[AudioLM (2022)](../Models/SpeechLM/2022.09.07_AudioLM.md) is the pioneering model in introducing codec codes for language modeling, utilizing a hierarchical approach that encompasses two distinct stages.
484484
The first stage generates semantic tokens using a self-supervised [W2V-BERT (2021)](../Models/SpeechRepresentation/2021.08.07_W2V-BERT.md) model.
@@ -633,7 +633,7 @@ With the development of these powerful speech LMs, researchers have begun to exp
633633

634634
<table>
635635
<tr>
636-
<td>
636+
<td width="50%">
637637

638638
In [Tab.03](#Tab.03), we compare the inputs, outputs, and downstream tasks of different codec-based language models.
639639
We also summarize that the downstream tasks conducted by different codec-based language models:
@@ -698,7 +698,7 @@ Tab.03: Codec-Based Language Models Comparison.
698698

699699
<table>
700700
<tr>
701-
<td>
701+
<td width="50%">
702702

703703
The paper fills the research blank to review the neural codec models and LMs built upon them.
704704
We hope the comprehensive review and comparisons can inspire future research works to boost the development of neural codec models and codec-based LMs.

0 commit comments

Comments
 (0)