Audio samples from "PATNET : A PHONEME-LEVEL AUTOREGRESSIVE TRANSFORMER NETWORK FOR SPEECH SYNTHESIS "

Shiming Wang Liping Chen Yajun Hu Zhenhua Ling*

Email: wsmzzz@mail.ustc.edu.cn, lipchen@ustc.edu.cn, yjhu@iflytek.com, zhling@ustc.edu.cn

Abstract

Recently, fine-grained prosody representation has emerged and attracted growing attention to address the one-to-many problem in text-to-speech synthesis (TTS). This paper investigates the representation of phoneme-level prosody attributes by incorporating contextual information. The prosody representation is acquired from the low-band Mel-spectrum. The data2vec structure is utilized as the core element in the prosody modeling, performing self-supervised learning and exploiting contextual prosody information. The embedding generated by the data2vec module, referred to as PhonemeVec, serves as the phoneme-level representation that captures the contextual prosody attributes. PhonemeVec is subsequently integrated into FastSpeech2, supervising the prosody modeling of the text encoder. Experiments conducted on the Blizzard Challenge 2019 dataset show that the integration of PhonemeVec results in the synthesis of more natural speech. Additionally, Objective evaluations confirm that the application of PhonemeVec reduces the distortions between the generated speech and original recordings in terms of duration and F0.

JPG Image


Comparison among PV-FS and other TTS models on BC2019 dataset

In-domain demo

PV-FSPL-FSPnG-FSGT-Mel
1: 频繁的休市就是股市里面,投机和赌博心理的刹车片嘛。
2: 去年的这个时候,我们上线了一个产品,叫“每天听本书”。
3: 对,思考的结果也许夹杂了更多的偏见独断蛮不讲理呀。
3: 这下你就理解啦,为什么我总是冷眼旁观那些,骂骂咧咧的评论家,不管他们的文章是不是好看。
3: 结果当然是没人来了。

Out-of-domain demo

PV-FSPL-FSPnG-FS
1: 消防员为了大家赴汤蹈火,自己能够这样回报,虽然说这样一点车费可能真的不算什么。
1: 山西省万荣县的物理老师王振,前几天他挂着吊瓶还坚持同学们上课。
1: 督导当地对标两不愁三保障脱贫标准,找问题补短板,努力如期实现脱贫任务。
1: 乌某和朱某,分别因为涉嫌危险驾驶罪和伪证罪,被移送人民检察院审查。
1: 医生说每一个孩子的情况都不一样,所以说针对他们的推拿手法都是不一样的。