You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Surveys/2024.02.20__Survey__Towards_Audio_Language_Modeling_(5P).md
+14-14Lines changed: 14 additions & 14 deletions
Original file line number
Diff line number
Diff line change
@@ -27,7 +27,7 @@
27
27
28
28
<table>
29
29
<tr>
30
-
<td>
30
+
<tdwidth="50%">
31
31
32
32
Neural audio codecs are initially introduced to compress audio data into compact codes to reduce transmission latency.
33
33
Researchers recently discovered the potential of codecs as suitable tokenizers for converting continuous audio into discrete codes, which can be employed to develop audio language models (LMs).
@@ -50,7 +50,7 @@ The paper aims to provide a thorough and systematic overview of the neural audio
50
50
51
51
<table>
52
52
<tr>
53
-
<td>
53
+
<tdwidth="50%">
54
54
55
55
Neural audio codec models were first introduced to compress audio for efficient data transmission.
56
56
The encoder converts the audio into codec codes, which are then transmitted.
@@ -141,7 +141,7 @@ Through this comprehensive review, we aim to offer the community insights into t
141
141
142
142
<table>
143
143
<tr>
144
-
<td>
144
+
<tdwidth="50%">
145
145
146
146
Codec models aim to compress and decompress speech signals efficiently.
147
147
Traditional codecs are developed based on psycho-acoustics and speech synthesis [21], [22].
@@ -188,7 +188,7 @@ $n_c$ represents the codebook number, `SR` represents the Sample Rate, and `BPS`
188
188
189
189
<table>
190
190
<tr>
191
-
<td>
191
+
<tdwidth="50%">
192
192
193
193
[SoundStream (2021)](../Models/SpeechCodec/2021.07.07_SoundStream.md) stands as one of the pioneering implementations of neural codec models, embodying a classic neural codec architecture comprising encoder, quantizer, and decoder modules.
194
194
It utilizes the streaming [SEANets (2020)](../Models/_Basis/2020.09.04_SEANet.md) as its encoder and decoder.
@@ -309,7 +309,7 @@ Meanwhile, it also finds that incorporating semantic information in the codec to
309
309
310
310
<table>
311
311
<tr>
312
-
<td>
312
+
<tdwidth="50%">
313
313
314
314
We compare several techniques proposed by these codecs in [Tab.02](#Tab.02).
315
315
The abbreviation `A-F` represents different codec models.
@@ -346,7 +346,7 @@ Tab.02: Comparison between codec implementation strategy.
346
346
347
347
<table>
348
348
<tr>
349
-
<td>
349
+
<tdwidth="50%">
350
350
351
351
The design of discriminators constitutes a pivotal element within codec models.
352
352
[Encodec](../Models/SpeechCodec/2022.10.24_EnCodec.md) initially introduces the **Multi-Scale-STFT Discriminator (MS-STFTD)**.
@@ -380,7 +380,7 @@ To address this, they propose the application of a **Multi-Scale Multi-Band STFT
380
380
381
381
<table>
382
382
<tr>
383
-
<td>
383
+
<tdwidth="50%">
384
384
385
385
[SpeechTokenizer](../Models/SpeechCodec/2023.08.31_SpeechTokenizer.md) utilizes semantic tokens from [HuBERT L9](../Models/SpeechRepresentation/2021.06.14_HuBERT.md) as a teacher for the RVQ process.
386
386
This guidance enables the disentanglement of content information into the first layer of the tokenizer, while paralinguistic information is retained in subsequent layers.
@@ -419,11 +419,11 @@ They demonstrate that using multiple residual groups achieves good reconstructio
419
419
420
420
<table>
421
421
<tr>
422
-
<td>
422
+
<tdwidth="50%">
423
423
424
424
We compare the codebook number, training data, sampling rate, and bit rate per second in [Tab.01](#Tab.01).
425
425
From the training data perspective, [SpeechTokenizer (2023)](../Models/SpeechCodec/2023.08.31_SpeechTokenizer.md), [AudioDec (2023)](../Models/SpeechCodec/2023.05.26_AudioDec.md), and [FunCodec (2023)](../Models/SpeechCodec/2023.09.14_FunCodec.md) utilize only English speech dataset.
426
-
[AcademiCodec/HiFi-Codec (2023)](../Models/SpeechCodec/2023.05.04_HiFi-Codec.md) incorporates bilingual speech datasets, including [AISHELL](../../Datasets/2017.09.16_AISHELL-1.md) for Chinese and [LibriTTS](../../Datasets/2019.04.05_LibriTTS.md) and [VCTK](../../Datasets/2012.08.00_VCTK.md) for English.
426
+
[AcademiCodec/HiFi-Codec (2023)](../Models/SpeechCodec/2023.05.04_HiFi-Codec.md) incorporates bilingual speech datasets, including [AISHELL](../Datasets/2017.09.16_AISHELL-1.md) for Chinese and [LibriTTS](../Datasets/2019.04.05_LibriTTS.md) and [VCTK](../Datasets/2012.08.00_VCTK.md) for English.
427
427
Both [DAC (2023)](../Models/SpeechCodec/2023.06.11_Descript-Audio-Codec.md), and [Encodec (2022)](../Models/SpeechCodec/2022.10.24_EnCodec.md) encompass diverse modality data, including speech, music, and audio, in the training data.
428
428
429
429
</td>
@@ -432,7 +432,7 @@ Both [DAC (2023)](../Models/SpeechCodec/2023.06.11_Descript-Audio-Codec.md), and
@@ -443,7 +443,7 @@ Both [DAC (2023)](../Models/SpeechCodec/2023.06.11_Descript-Audio-Codec.md), and
443
443
444
444
<table>
445
445
<tr>
446
-
<td>
446
+
<tdwidth="50%">
447
447
448
448
As shown in [Fig.02](#Fig.02), the process of neural codec-based audio language modeling begins by converting context information, such as text and MIDI, into context codes, while simultaneously encoding the audio into codec codes.
449
449
These context and codec codes are then employed in the language modeling phase to generate the desired target codec code sequence.
@@ -478,7 +478,7 @@ Fig.02: Codec-Based Language Modeling.
478
478
479
479
<table>
480
480
<tr>
481
-
<td>
481
+
<tdwidth="50%">
482
482
483
483
[AudioLM (2022)](../Models/SpeechLM/2022.09.07_AudioLM.md) is the pioneering model in introducing codec codes for language modeling, utilizing a hierarchical approach that encompasses two distinct stages.
484
484
The first stage generates semantic tokens using a self-supervised [W2V-BERT (2021)](../Models/SpeechRepresentation/2021.08.07_W2V-BERT.md) model.
@@ -633,7 +633,7 @@ With the development of these powerful speech LMs, researchers have begun to exp
633
633
634
634
<table>
635
635
<tr>
636
-
<td>
636
+
<tdwidth="50%">
637
637
638
638
In [Tab.03](#Tab.03), we compare the inputs, outputs, and downstream tasks of different codec-based language models.
639
639
We also summarize that the downstream tasks conducted by different codec-based language models:
@@ -698,7 +698,7 @@ Tab.03: Codec-Based Language Models Comparison.
698
698
699
699
<table>
700
700
<tr>
701
-
<td>
701
+
<tdwidth="50%">
702
702
703
703
The paper fills the research blank to review the neural codec models and LMs built upon them.
704
704
We hope the comprehensive review and comparisons can inspire future research works to boost the development of neural codec models and codec-based LMs.
0 commit comments