PhonemeVec

Email: wsmzzz@mail.ustc.edu.cn, lipchen@ustc.edu.cn, yjhu@iflytek.com, zhling@ustc.edu.cn

Abstract

Recently, fine-grained prosody representation has emerged and attracted growing attention to address the one-to-many problem in text-to-speech synthesis (TTS). This paper investigates the representation of phoneme-level prosody attributes by incorporating contextual information. The prosody representation is acquired from the low-band Mel-spectrum. The data2vec structure is utilized as the core element in the prosody modeling, performing self-supervised learning and exploiting contextual prosody information. The embedding generated by the data2vec module, referred to as PhonemeVec, serves as the phoneme-level representation that captures the contextual prosody attributes. PhonemeVec is subsequently integrated into FastSpeech2, supervising the prosody modeling of the text encoder. Experiments conducted on the Blizzard Challenge 2019 dataset show that the integration of PhonemeVec results in the synthesis of more natural speech. Additionally, Objective evaluations confirm that the application of PhonemeVec reduces the distortions between the generated speech and original recordings in terms of duration and F0.

JPG Image

Comparison among PV-FS and other TTS models on BC2019 dataset

In-domain demo