Speech Sample Page for Audiobook Speech Synthesis Conditioned by Cross-Sentence Context-Aware Word Embeddings

Audio samples from “Audiobook Speech Synthesis Conditioned by Cross-Sentence Context-Aware Word Embeddings”

Authors: Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, Yusuke Ijima, Ryo Masumura, Hiroshi Saruwatari

Abstract:
This paper proposes an audiobook speech synthesis method that considers a wider range of contexts than a sentence level. The style of the audiobook speech depends not only on the current sentence to be synthesized but also on its neighboring sentences. Therefore, unlike conventional text-to-speech synthesis for isolated sentences, it is necessary to consider the context of the neighboring sentences. Our method utilizes cross-sentence context-aware word embedding, which is obtained by inputting the neighboring and current sentences into BERT. The speech synthesis model, Tacotron2, is conditioned by this word embedding in addition to the current sentence. Experimental results show that taking neighboring sentences into account significantly improves synthetic speech quality.

Paper link: here

Outline of compared Models:

Proposed models are based on Tacotron2, a widely studied Sequence to-sequence TTS model. Our proposed model utilizes Cross-sentence context-aware word embeddings which is obtained by inputting multiple sentences to BERT. We have prepared 2 models and each models are trained with and without fine-tuning of BERT.

SingleSentence

Singlesentence only takes current sentence as input to BERT. Hence, this model does’t utilizes cross-sentence context-aware word embeddings

MultiSentences

MultiSentences takes the current sentence and 2 previous sentences as input. This enables the model to utilize cross-sentence context-aware word embeddings.

Speech Samples

For each samples we have the following

Tacotron2
SingleSentence
SingleSentence (finetuned)
MultiSentence
MultiSentence (finetuned)

where models marked with finetuned are trained with fine-tuning of BERT

「このあいだは、チョコレートにおせんべい、アイスクリームもおちてたね。」
“The other day I fount chocolate, rich crackers and ice cream on the ground”
Ground Truth Tacotron2 SingleSentence SingleSentence
(finetuned) MultiSentences MultiSentences
(finetuned)
かえるくんは、ひるねをじゃまされて、はらをたてました。
The frog was upset because he was prevented from having a nap.
Ground Truth Tacotron2 SingleSentence SingleSentence
(finetuned) MultiSentences MultiSentences
(finetuned)
ユサユサユサッグラグラグラッ
onomatopoeia untranslatable　
Ground Truth Tacotron2 SingleSentence SingleSentence
(finetuned) MultiSentences MultiSentences
(finetuned)
ぽかぽかといいてんきになったので、ありくんだちは、はっぱのふねで、スーイスーイユーラユーラといけでたのしくあそんでいました。
The weather was warm and sunny, so Arikun and his friends were having fun playing in the water on a leafy boat.
Ground Truth Tacotron2 SingleSentence SingleSentence
(finetuned) MultiSentences MultiSentences
(finetuned)
「たすけてー！」
“Help me!”
Ground Truth Tacotron2 SingleSentence SingleSentence
(finetuned) MultiSentences MultiSentences
(finetuned)

Effect of modifying the previous sentences

For each samples, current sentence to be synthesized is shown in italic, and the modification is shown in bold.

Original
Model input:ありたちが、ゾロゾロゾロゾロえさをさがしてあるいています。いちばんまえのありくんがいいました。「このあいだは、チョコレートにおせんべい、アイスクリームもおちてたね。」
English translation:A group of ants were walking around, look-ing for food. The foremost ant said, “The other day I found chocolate, rice crackers and ice cream on the ground”

Quietly
Model input:ありたちが、ゾロゾロゾロゾロえさをさがしてあるいています。いちばんまえのありくんが小声でいいました。「このあいだは、チョコレートにおせんべい、アイスクリームもおちてたね。」
English translation:A group of ants were walking around, look-ing for food. The foremost ant said quietly ,“The other day I found chocolate, rice crackers and ice cream on the ground”

Loudly
Model input:ありたちが、ゾロゾロゾロゾロえさをさがしてあるいています。いちばんまえのありくんが大声でいいました。「このあいだは、チョコレートにおせんべい、アイスクリームもおちてたね。」
English translation:A group of ants were walking around, look-ing for food. The foremost ant said loundly ,“The other day I found chocolate, rice crackers and ice cream on the ground”