Audio samples from “Audiobook Speech Synthesis Conditioned by Cross-Sentence Context-Aware Word Embeddings”

Authors: Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, Yusuke Ijima, Ryo Masumura, Hiroshi Saruwatari

Abstract:
This paper proposes an audiobook speech synthesis method that considers a wider range of contexts than a sentence level. The style of the audiobook speech depends not only on the current sentence to be synthesized but also on its neighboring sentences. Therefore, unlike conventional text-to-speech synthesis for isolated sentences, it is necessary to consider the context of the neighboring sentences. Our method utilizes cross-sentence context-aware word embedding, which is obtained by inputting the neighboring and current sentences into BERT. The speech synthesis model, Tacotron2, is conditioned by this word embedding in addition to the current sentence. Experimental results show that taking neighboring sentences into account significantly improves synthetic speech quality.

Paper link: here

Outline of compared Models:

Proposed models are based on Tacotron2, a widely studied Sequence to-sequence TTS model. Our proposed model utilizes Cross-sentence context-aware word embeddings which is obtained by inputting multiple sentences to BERT. We have prepared 2 models and each models are trained with and without fine-tuning of BERT.

SingleSentence

Singlesentence only takes current sentence as input to BERT. Hence, this model does’t utilizes cross-sentence context-aware word embeddings

MultiSentences

MultiSentences takes the current sentence and 2 previous sentences as input. This enables the model to utilize cross-sentence context-aware word embeddings.

Speech Samples

For each samples we have the following

where models marked with finetuned are trained with fine-tuning of BERT

  1. 「このあいだは、チョコレートにおせんべい、アイスクリームもおちてたね。」
    “The other day I fount chocolate, rich crackers and ice cream on the ground”

    Ground TruthTacotron2SingleSentenceSingleSentence
    (finetuned)
    MultiSentencesMultiSentences
    (finetuned)
  2. かえるくんは、ひるねをじゃまされて、はらをたてました。
    The frog was upset because he was prevented from having a nap.

    Ground TruthTacotron2SingleSentenceSingleSentence
    (finetuned)
    MultiSentencesMultiSentences
    (finetuned)
  3. ユサユサユサッ グラグラグラッ
    onomatopoeia untranslatable 

    Ground TruthTacotron2SingleSentenceSingleSentence
    (finetuned)
    MultiSentencesMultiSentences
    (finetuned)
  4. ぽかぽかといいてんきになったので、ありくんだちは、はっぱのふねで、スーイスーイユーラユーラといけでたのしくあそんでいました。
    The weather was warm and sunny, so Arikun and his friends were having fun playing in the water on a leafy boat.

    Ground TruthTacotron2SingleSentenceSingleSentence
    (finetuned)
    MultiSentencesMultiSentences
    (finetuned)
  5. 「たすけてー!」
    “Help me!”

    Ground TruthTacotron2SingleSentenceSingleSentence
    (finetuned)
    MultiSentencesMultiSentences
    (finetuned)

Effect of modifying the previous sentences

For each samples, current sentence to be synthesized is shown in italic, and the modification is shown in bold.