LEARNING LATENT REPRESENTATIONS FOR STYLE CONTROL AND TRANSFER IN END-TO-END SPEECH SYNTHESIS

Authors: Ya-Jie Zhang, Shifeng Pan, Lei He, Zhen-Hua Ling

Abstract: In this paper, we introduce the Variational Autoencoder (VAE) to an end-to-end speech synthesis model, to learn the latent representation of speaking styles in an unsupervised manner. The style representation learned through VAE shows good properties such as disentangling, scaling, and combination, which makes it easy for style control. Style transfer can be achieved in this framework by first inferring style representation through the recognition network of VAE, then feeding it into TTS network to guide the style in synthesizing speech. To avoid Kullback-Leibler (KL) divergence collapse in training, several techniques are adopted. Finally, the proposed model shows good performance of style control and outperforms Global Style Token (GST) model in ABX preference tests on style transfer.

Paper accepted by IEEE ICASSP 2019

Paper link:

Speech Demo

Style control

1. Interpolation

1." He had many years ago received such a description. "
reference audiogenerated audio
reference A
interpolated z
interpolated z
reference B

2. Disentangled factors

1." I suppose I may as well stay the night , "
pitch height
local pitch variation

3. Combination

1." I suppose I may as well stay the night , "
z=Az=Bz=A+B

Style transfer

1. Parallel transfer

1." Didn't you know this was the twenty-eighth of August ? "

reference audioGST baseline modelproposed model
2. She was at a loss for an answer .
reference audioGST baseline modelproposed model
3." I s'pose it's because I am Ojo the Unlucky that everyone who tries to help me gets into trouble . "
reference audioGST baseline modelproposed model
4." When thou hast shown me a little love , thou mockest me ! "
reference audioGST baseline modelproposed model
5. said the Queen of the Corn-market , in an indifferently grateful tone .
reference audioGST baseline modelproposed model

1. Non-parallel transfer

1. reference text : " I love you . Good-by - because I love you . "
target text : " She could not compose herself - Mr. Woodhouse would be alarmed - she had better go ; "
reference audioGST baseline modelproposed model
2. reference text : " Oh , what lovely trees ! "
target text : said the Queen of the Corn-market , in an indifferently grateful tone .
reference audioGST baseline modelproposed model
3. reference text : " Find some one who is real wicked , and stay with him till he repents . In that way you can do some good in the world . "
target text : He called again : the valleys and farthest hills resounded as when the sailors invoked the lost Hylas on the Mysian shore ; but no sheep .
reference audioGST baseline modelproposed model
4. reference text : " Whenever you say the word I'm ready to thrash any amount of reason into him that he's able to hold . "
target text : Oh , certainly not ; Lucy would stop with her cousin . Oh , no !
reference audioGST baseline modelproposed model
5. reference text : " is the only place in Oz where a yellow butterfly can be found . "
target text : His shoulders were broad and strong , his hands were very strong .
reference audioGST baseline modelproposed model
6. reference text : " Thy mother is yonder woman with the scarlet letter , "
target text : Now it was finished - that is to say the design ; she must stitch it together .
reference audioGST baseline modelproposed model