Skip to content

v2.5.0: Universal multilingual support, new decoder backbone and RoPE in attention encoders

Latest
Compare
Choose a tag to compare
@yqzhishen yqzhishen released this 01 Apr 14:43
· 2 commits to main since this release

Universal multilingual support (#238)

The whole dictionary and phoneme system is refactored. This repository now supports defining multiple dictionaries (languages) and merging some phonemes. This comes with a breaking change in the configuration to define datasets:

Old New
dictionary: dictionaries/opencpop-extension.txt
raw_data_dir:
  - data/xxx1/raw
  - data/xxx2/raw
speakers:
  - speaker1
  - speaker2
spk_ids: [0, 1]
test_prefixes:
  - '0:wav1'
  - '0:wav2'
  - '1:wav1'
  - '1:wav2'
dictionaries:  # multiple languages and dictionaries
  zh: dictionaries/opencpop-extension.txt
  ja: dictionaries/japanese_dict_full.txt
  en: dictionaries/ds_cmudict-07b.txt
extra_phonemes: []
merged_phoneme_groups:
  - [zh/i, ja/i, en/iy]
  - [zh/s, ja/s, en/s]
datasets:  # define all raw datasets
  - raw_data_dir: data/xxx1/raw  # equivalent to former raw_data_dir
    speaker: speaker1  # equivalent to former speakers
    spk_id: 0
    language: zh
    test_prefixes:  # similar to former test_prefixes
      - wav1
      - wav2
  - raw_data_dir: data/xxx2/raw
    speaker: speaker2
    spk_id: 1
    language: ja
    test_prefixes:
      - wav1
      - wav2

Read the documentation for a more detailed explanation.

New decoder backbone: LYNXNet (#200, #218, #225, #228)

The new backbone shows better performance on acoustic models. The way to define the model backbone also changes:

Old New
backbone_type: 'wavenet'
residual_layers: 20
residual_channels: 512
dilation_cycle_length: 4
# LYNXNet (default)
backbone_type: 'lynxnet'
backbone_args:
  num_channels: 1024
  num_layers: 6
  kernel_size: 31
  dropout_rate: 0.0
  strong_cond: true

# WaveNet
backbone_type: 'wavenet'
backbone_args:
  num_channels: 512
  num_layers: 20
  dilation_cycle_length: 4

RoPE in attention encoder (#234)

Rotary Position Embedding (RoPE) is now implemented in the FastSpeech2 attention encoders to improve their quality and save parameter count.

# encoder with RoPE
enc_ffn_kernel_size: 3
use_rope: true

# encoder without RoPE
enc_ffn_kernel_size: 9
use_rope: false

Other improvements, changes and bug fixes

  • Support MiniNSF and noise injection in NSF-HiFiGAN vocoder
  • Improve inference speed for old NSF module
  • Missing note_glide is now regarded as none instead of raising errors
  • Add R^2 score metrics for variance paremeters on TensorBoard
  • Bugfix: unexpected high CPU load during preprocessing
  • Bugfix: f0_min and f0_max take no effect on parselmouth pitch extractor
  • Bugfix: configurations are not passed correctly to pitch predictor

Some changes may not be listed above. See full change log: v2.4.0...v2.5.0