Speech samples for “Predicting VQVAE-based Character Acting Style from Quotation-Annotated Text for Audiobook Speech Synthesis”
- Wataru Nakata
- Tomoki Koriyama
- Yuki Saito
- Shinnosuke Takamichi
- Ijima Yusuke
- Ryo Masumura
- Hiroshi Saruwatari
We have prepared 5 models for comparison:
- FS2 (w/o BERT) : Ordinary FastSpeech2
- FS2: FastSpeech2 conditioned by cross-sentence context from RoBERTa
- FS2-ResCNN: FS2 conditioned by ResCNN features from ground truth speech
- FS2-ResCNN-VQ: FS2 conditioned by vector quantized ResCNN features from ground truth speech
- FS2-character: FS2 conditioned by fictional character embeddings.
- FS2-all: Proposed model.
For details about each models, please refer to our paper.
Speech in dialogues
We present the character name corresponding to each sentences as “character name” as well as speech generated by each models. Note that models shown in red colors takes ground truth speech as input during inference. They are shown as a reference and comparing these models to others is not appropriate.
|Character name||Ground truth||FS2 (w/o BERT)||FS2||FS2-ResCNN||FS2-ResCNN-VQ||FS2-character||FS2-all|