You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Models/SpeechCodec/2021.07.07_SoundStream.md
+3-3Lines changed: 3 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -84,7 +84,7 @@ To this end, one (or more) discriminators are trained jointly, with the goal of
84
84
Both the encoder and the decoder only use causal convolutions, so the overall architectural latency of the model is determined solely by the temporal resampling ratio between the original time-domain waveform and the embeddings.
85
85
In summary, we make the following key contributions:
86
86
- We propose ***SoundStream***, a neural audio codec in which all the constituent components (encoder, decoder and quantizer) are trained end-to-end with a mix of reconstruction and adversarial losses to achieve superior audio quality.
87
-
- We introduce a new residual vector quantizer, and investigate the rate-distortion-complexity trade-off simplied by its design.
87
+
- We introduce a new residual vector quantizer, and investigate the rate-distortion-complexity trade-off simplified by its design.
88
88
In addition, we propose a novel “quantizer dropout” technique for training the residual vector quantizer, which enables a single model to handle different bitrates.
89
89
- We demonstrate that learning the encoder brings a very significant coding efficiency improvement, with respect to a solution that adopts mel-spectrogram features.
90
90
- We demonstrate by means of subjective quality metrics that ***SoundStream*** outperforms both Opus and EVS over a wide range of bitrates.
@@ -589,8 +589,8 @@ Instead, decreasing the capacity of the decoder has a more significant impact on
589
589
This is aligned with recent findings in the field of neural image compression [67], which also adopt a lighter encoder and a heavier decoder.
590
590
591
591
**Vector Quantizer Depth and Codebook Size**:
592
-
The number of bits used to encode a single frame is equal to Nqlog2N, where Nq denotes the number of quantizers and N the codebook size.
593
-
Hence, it is possible to achieve the same target bitrate for different combinations of Nqand N.
592
+
The number of bits used to encode a single frame is equal to Nq log2N, where Nq denotes the number of quantizers and N the codebook size.
593
+
Hence, it is possible to achieve the same target bitrate for different combinations of Nq and N.
594
594
Table II shows three configurations, all operating at 6 kbps.
595
595
As expected, using fewer vector quantizers, each with a larger codebook, achieves the highest coding efficiency at the cost of higher computational complexity.
596
596
Remarkably, using a sequence of 80 1-bit quantizers leads only to a modest quality degradation.
0 commit comments