Sound demos for "A Neural Vocoder with Hierarchical Generation of Amplitude and Phase Spectra for Statistical Parametric Speech Synthesis"

Sound demos for "A Neural Vocoder with Hierarchical Generation of Amplitude and Phase Spectra for Statistical Parametric Speech Synthesis"

Paper: arXiv

Most experiments were conducted on the female speaker slt and male speaker bdl in CMU-ARCTIC databases. Train/Valid/Test: 1000/66/66 for each speaker.

Part experiments were conducted on a Chinese feamle corpus with less 1.9 hours' training utterances (CN-S) and 19.5 hours' training utterances (CN-L). Train/Valid/Test: 800/100/100 for CN-S and 13134/100/100 for CN-L. The valid and test sets for CN-S and CN-L are the same.

Experiment I: Comparison among HiNet Vocoder and Existing Vocoders

Comparsion among speech generated by STRAIGHT, WaveNet, WaveRNN and HiNet

A. Natural acoustic features as input for speaker slt.

Natural	STRAIGHT	WaveNet	WaveRNN	HiNet
Example 1

Example 2

Example 3

Example 4

Example 5

B. Natural acoustic features as input for speaker bdl.

Natural	STRAIGHT	WaveNet	WaveRNN	HiNet
Example 1

Example 2

Example 3

Example 4

Example 5

C. Predicted acoustic features as input for speaker slt.

STRAIGHT	WaveNet	WaveRNN	HiNet
Example 1

Example 2

Example 3

Example 4

Example 5

D. Predicted acoustic features as input for speaker bdl.

STRAIGHT	WaveNet	WaveRNN	HiNet
Example 1

Example 2

Example 3

Example 4

Example 5

Experiment II: Comparison between HiNet and NSF Vocoders

Comparsion among speech generated by HiNet, NSF, HiNet-S and HiNet-S-GAN

A. Natural acoustic features as input for speaker slt.

Natural	HiNet	NSF	HiNet-S	HiNet-S-GAN
Example 1

Example 2

Example 3

Example 4

Example 5

B. Natural acoustic features as input for speaker bdl.

Natural	HiNet	NSF	HiNet-S	HiNet-S-GAN
Example 1

Example 2

Example 3

Example 4

Example 5

Experiment III: Experiments in Discussions

1. Impact of the amount of training data on the HiNet vocoder

Comparsion between speech generated by HiNet-S-GAN for CN-S and CN-S

Natural acoustic features as input.

Natural	HiNet-S-GAN (CN-S)	HiNet-S-GAN (CN-L)
Example 1

Example 2

Example 3

Example 4

Example 5

2. Comparison between GAN-based ASP and conventional one

Comparsion between speech generated by HiNet-S-GAN and STR-ASP+PSP-S-GAN for slt, CN-S and CN-S

A. Natural acoustic features as input for speaker slt.

Natural	HiNet-S-GAN	STR-ASP+PSP-S-GAN
Example 1

Example 2

Example 3

Example 4

Example 5

B. Natural acoustic features as input for speaker CN-S.

Natural	HiNet-S-GAN	STR-ASP+PSP-S-GAN
Example 1

Example 2

Example 3

Example 4

Example 5

C. Natural acoustic features as input for speaker CN-L.

Natural	HiNet-S-GAN	STR-ASP+PSP-S-GAN
Example 1

Example 2

Example 3

Example 4

Example 5

3. Effects of GMN

Comparsion between speech generated by HiNet-S and HiNet-S-woGMN.

Natural acoustic features as input for speaker slt.

Natural	HiNet-S	HiNet-S-woGMN
Example 1

Example 2

Example 3

Example 4

Example 5

4. Effects of pre-calculated initial phase

Comparsion between speech generated by HiNet-woPCP and HiNet

Natural acoustic features as input for speaker slt.

Natural	HiNet-woPCIP	HiNet
Example 1

Example 2

Example 3

Example 4

Example 5

5. Effects of components in loss functions

Comparsion among HiNet-L1, HiNet-L2, HiNet-L3 and HiNet

Natural acoustic features as input for speaker slt.

Natural	HiNet-L1	HiNet-L2	HiNet-L3	HiNet
Example 1

Example 2

Example 3

Example 4

Example 5