Predicting VQVAE-based Character Acting Style from Quotation-Annotated Text for Audiobook Speech Synthesis

Speech samples for “Predicting VQVAE-based Character Acting Style from Quotation-Annotated Text for Audiobook Speech Synthesis”

Authors:

Presentation slide

Speech samples

We have prepared 5 models for comparison:

FS2 (w/o BERT) : Ordinary FastSpeech2
FS2: FastSpeech2 conditioned by cross-sentence context from RoBERTa
FS2-ResCNN: FS2 conditioned by ResCNN features from ground truth speech
FS2-ResCNN-VQ: FS2 conditioned by vector quantized ResCNN features from ground truth speech
FS2-character: FS2 conditioned by fictional character embeddings.
FS2-all: Proposed model.

For details about each models, please refer to our paper.

Speech in dialogues

We present the character name corresponding to each sentences as “character name” as well as speech generated by each models. Note that models shown in red colors takes ground truth speech as input during inference. They are shown as a reference and comparing these models to others is not appropriate.

Character name	Ground truth	FS2 (w/o BERT)	FS2	FS2-ResCNN	FS2-ResCNN-VQ	FS2-character	FS2-all
Narration
Narration
Ant
Ant girl
Ant