Sound demos for "A Neural Vocoder with Hierarchical Generation of Amplitude and Phase Spectra for Statistical Parametric Speech Synthesis"

Paper: arXiv

Most experiments were conducted on the female speaker slt and male speaker bdl in CMU-ARCTIC databases. Train/Valid/Test: 1000/66/66 for each speaker.

Part experiments were conducted on a Chinese feamle corpus with less 1.9 hours' training utterances (CN-S) and 19.5 hours' training utterances (CN-L). Train/Valid/Test: 800/100/100 for CN-S and 13134/100/100 for CN-L. The valid and test sets for CN-S and CN-L are the same.

 

 

Experiment I: Comparison among HiNet Vocoder and Existing Vocoders

Comparsion among speech generated by STRAIGHT, WaveNet, WaveRNN and HiNet

A. Natural acoustic features as input for speaker slt.

NaturalSTRAIGHT WaveNet WaveRNN HiNet
Example 1
Example 2
Example 3
Example 4
Example 5

B. Natural acoustic features as input for speaker bdl.

Natural STRAIGHT WaveNet WaveRNN HiNet
Example 1
Example 2
Example 3
Example 4
Example 5

C. Predicted acoustic features as input for speaker slt.

STRAIGHT WaveNet WaveRNN HiNet
Example 1
Example 2
Example 3
Example 4
Example 5

D. Predicted acoustic features as input for speaker bdl.

STRAIGHT WaveNet WaveRNN HiNet
Example 1
Example 2
Example 3
Example 4
Example 5



Experiment II: Comparison between HiNet and NSF Vocoders

Comparsion among speech generated by HiNet, NSF, HiNet-S and HiNet-S-GAN

A. Natural acoustic features as input for speaker slt.

Natural HiNet NSF HiNet-S HiNet-S-GAN
Example 1
Example 2
Example 3
Example 4
Example 5

B. Natural acoustic features as input for speaker bdl.

Natural HiNet NSF HiNet-S HiNet-S-GAN
Example 1
Example 2
Example 3
Example 4
Example 5

 

Experiment III: Experiments in Discussions

1. Impact of the amount of training data on the HiNet vocoder

Comparsion between speech generated by HiNet-S-GAN for CN-S and CN-S

Natural acoustic features as input.

Natural HiNet-S-GAN (CN-S) HiNet-S-GAN (CN-L)
Example 1
Example 2
Example 3
Example 4
Example 5

 

2. Comparison between GAN-based ASP and conventional one

Comparsion between speech generated by HiNet-S-GAN and STR-ASP+PSP-S-GAN for slt, CN-S and CN-S

A. Natural acoustic features as input for speaker slt.

Natural HiNet-S-GAN STR-ASP+PSP-S-GAN
Example 1
Example 2
Example 3
Example 4
Example 5

B. Natural acoustic features as input for speaker CN-S.

Natural HiNet-S-GAN STR-ASP+PSP-S-GAN
Example 1
Example 2
Example 3
Example 4
Example 5

C. Natural acoustic features as input for speaker CN-L.

Natural HiNet-S-GAN STR-ASP+PSP-S-GAN
Example 1
Example 2
Example 3
Example 4
Example 5

 

3. Effects of GMN

Comparsion between speech generated by HiNet-S and HiNet-S-woGMN.

Natural acoustic features as input for speaker slt.

Natural HiNet-S HiNet-S-woGMN
Example 1
Example 2
Example 3
Example 4
Example 5

 

4. Effects of pre-calculated initial phase

Comparsion between speech generated by HiNet-woPCP and HiNet

Natural acoustic features as input for speaker slt.

Natural HiNet-woPCIP HiNet
Example 1
Example 2
Example 3
Example 4
Example 5

 

5. Effects of components in loss functions

Comparsion among HiNet-L1, HiNet-L2, HiNet-L3 and HiNet

Natural acoustic features as input for speaker slt.

Natural HiNet-L1 HiNet-L2 HiNet-L3 HiNet
Example 1
Example 2
Example 3
Example 4
Example 5