Skip to content

GPT‐SoVITS‐v3v4‐features (新特性)

RVC-Boss edited this page Apr 20, 2025 · 5 revisions

1-v1v2v3v4情况对比 (v3v4 compared with v2v1)

语种支持(可互相跨语种合成) GPT训练集时长 SoVITS训练集时长 推理速度 参数量 功能
v1 中日英 约2k小时 约2k小时 baseline 90M+77M baseline
v2 中日英韩粤 约2.5k小时 vq encoder约2k小时(v1冻结),一共5k小时 翻倍 90M+77M 新增语速调节,无参考文本模式,更好的混合语种切分
v3 中日英韩粤 约7k小时 vq encoder约2k小时(v1冻结),一共7k小时 约等于v2 330M+77M 大幅增加zero shot相似度;情绪表达、微调性能提升
v4 同楼上 同楼上 同楼上 同楼上 同楼上 修复了v3非整数倍上采样可能导致的电音问题,原生输出48k音频防闷
Language Support (Cross-language synthesis) GPT Training Dataset Duration SoVITS Training Dataset Duration Inference Speed Number of Parameters Features
v1 Chinese, Japanese, English about 2k hours about 2k hours baseline 90M+77M baseline
v2 Chinese, Japanese, English, Korean, Cantonese about 2.5k hours vq encoder about 2k hours (frozen from v1),5k hours in total doubled 90M+77M Added speed control, reference-free mode, better mixed-language slices
v3v4 Chinese, Japanese, English, Korean, Cantonese 约7k小时 vq encoder about 2k hours (frozen from v1),7k hours in total ~v2 330M+77M Significant enhancement in zero-shot similarity; improvements in emotional expression and fine-tuning performance.

v3v4比v2

(1)音色相似度更像,需要更少训练集来逼近本人(不训练直接使用底模的模式下音色相似性提升更大)

The timbre similarity is higher, requiring less training data to approximate the target speaker (the timbre similarity is significantly improved using the base model directly without fine-tuning).

(2)GPT合成更稳定,重复漏字(根据测试集实验指标)更少,也更容易跑出丰富情感

The GPT model is more stable, with fewer repetitions and omissions, and it is easier to generate speech with richer emotional expression.

(3)比v2更忠实于参考音频。微调场景下,v2比v3更受训练集整体平均影响,然后带一些参考音频的引导。

Compared to v2model, v3model is more faithful to the reference audio. In fine-tuning scenarios, V2 is more influenced by the overall average of the training dataset, with some guidance from the reference audio.

如果你的训练集质量比较糟糕,也许“更受训练集整体平均影响”的v2vits版本更适合你。

If your training dataset is of poor quality, the V2 (ViTS) version, which is 'more influenced by the overall average of the training dataset,' might be more suitable for you.

(4)v4修复了v3非整数倍上采样可能导致的电音问题,原生输出48k音频防闷(而v3原生输出只有24k)。作者认为v4是v3的平替,更多还需测试。

Version 4 fixes the issue of metallic artifacts in Version 3 caused by non-integer multiple upsampling, and natively outputs 48k audio to prevent muffled sound (whereas Version 3 only natively outputs 24k audio). The author considers Version 4 a direct replacement for Version 3, though further testing is still needed.

2-SeedTTS ZeroShot TTS eval testset CN

WER SIM
v1 0.025 0.526
v2 0.017 0.549
v3(8steps) 0.014 0.702
v4(8steps) 0.013 0.735
GT 0.013 0.750

这是啥:字节豆包团队发的SeedTTS论文工作给的中文测试集benchmark。GT指测试集原始说话人的真实说话语音。

What is this?

This is the Chinese TTS testset benchmark from the SeedTTS paper work released by the ByteDance Seed team. GT refers to the real speech of the original speaker in the test set.

我咋测的:用的https://github.com/BytedanceSpeech/seed-tts-eval 这里面官方给的相似度模型和ASR模型跑的

How did I test it?

I used the official similarity model and ASR model provided in https://github.com/BytedanceSpeech/seed-tts-eval to run the evaluation.

这个benchmark有啥用:测合成发音和目标发音的匹配度(WER),和音色相似度(SIM)。测不了自然性、情感丰富性和音质,前2个要人工打分。同时测试集的音色也有局限性。分数仅供参考,如果需要准确的结论请你用自己的实际场景训练集去微调和测试集去测试。论文里的分数、业界的情报和媒体宣传都是浮云,请永远相信,只有你自己的测试结论才是最真实的。

What’s the purpose of this benchmark?

It measures the pronunciation matching (WER) and timbre similarity (SIM) between synthesized and target speech. It cannot evaluate naturalness, emotional richness, or audio quality—the first two require human scoring. Additionally, the timbre in the test set has limitations. The scores are for reference only. If you need accurate conclusions, fine-tune using your own training data and test it on your own evaluation set. The scores in the paper, industry insights, and media hype are all just noise—always trust your own test results as the only reliable truth.

GT指标和"WER"这个英文都是从SeedTTS论文https://arxiv.org/pdf/2406.02430v1 的table10里抄的。

The "GT" metric and the term "WER" are taken from Table 10 in the SeedTTS paper: https://arxiv.org/pdf/2406.02430v1.

不同版本gpt-sovits合成的SeedTTS中文测试集结果放在百度网盘:

The results of the SeedTTS Chinese testset synthesized by different versions of GPT-SoVITS have been uploaded to BaiduNetdisk.

https://pan.baidu.com/s/1Fd5xjVzVa2LhI-b-FSxo8w?pwd=yp6g 提取码: yp6g

3-技术变更点 (Technical Updates)

(1)训练集增加至7k小时 (MOS分音质过滤、标点停顿校验)

The training dataset has been expanded to 7,000 hours (with MOS-based audio quality filtering and punctuation pause verification).

只使用7k小时优选训练集,更大的想象空间留给各位看官们发挥~

Only 7,000 hours of training data were used, leaving more room for imagination and creativity for the audience to explore.

(2)s2结构变更为:shortcut Conditional Flow Matching Diffusion Transformers (shortcut-CFM-DiT)

The S2 architecture has been modified to shortcut-CFM-DiT.

由于s2占整体延时比例太低,s2变复杂对于整体耗时影响不大。

Since the proportion of S2 in the overall latency is minimal, increasing the complexity of S2 has little impact on the total processing time.

音质最佳:采样步数32

Best Audio Quality: Sampling steps set to 32.

速度快:4/8步 (zero shot这档配置没啥瑕疵,少量样本微调可能需要提升步数)

Faster Speed: 4/8 steps (zero-shot configuration shows no significant flaws, though fine-tuning with a small number of samples may require increasing the steps).

s2原理的变更(基于参考音频扩散补全)导致音色相似度大幅提升。

The principle of S2 has been updated (based on reference audio diffusion outpainting), resulting in a significant improvement in timbre similarity.

由于没用端到端合成,v3使用了开源的24k的BigVGANv2参数从mel谱得到波形,

As end-to-end synthesis is not utilized, the open-source 24k BigVGANv2 parameters are employed to generate waveforms from mel-spectrograms,

但是代价是,如果想用开源的vocoder,只能遵从他们的hop参数,导致适配SSL的hop必须进行非整数倍上采样,造成了低采样步数+小样本(指100h以下)大量微调情况下可能出现电音。因此v4是作者自己训练的一版声码器,同时顺手将输出采样率从v3的24k提升到48k,不再需要后置超分网络防闷。

The cost, however, is that using an open-source vocoder requires adhering to its fixed hop size, forcing non-integer upsampling when adapting SSL's hop size. This can introduce metallic artifacts when fine-tuning extensively with small datasets (under 100 hours) and low sampling steps.

To resolve this, V4 employs a custom-trained vocoder by the author, while conveniently upgrading the output sample rate from V3's 24k to 48k—eliminating the need for a post-processing super-resolution network to prevent muffled sound.

(3)s1结构不变,更新了一版参数

The S1 architecture remains unchanged, with the parameters updated.