PF-STAR: emotional speech synthesis presentation

About This Presentation

Transcript and Presenter's Notes

Title: PF-STAR: emotional speech synthesis

1
PF-STAR emotional speech synthesis
Istituto di Scienze e Tecnologie della
Cognizione, Sezione di Padova Fonetica e
Dialettologia, CNR
2
Analysis of emotive speech audio
disgust (D) surprise (SU) neutral (N)
anger (A) joy (J) fear (F) sadness (SA)
Recordings /aba/, /ava/, /mamma/

Cues extraction and analysis
Intensity, duration, pitch, pitch range,
formants.
F0 stressed vowel mean and
F0mid values are strongly correlated.
Shimmer, Jitter, HNR, Hammarbergs index,
Spectral flatness, Spectral energy distributions
voice quality correlates.

F0mean (global and for stressed vowel), F0mid,
and F0range for /aba/
3
Analysis of emotive speech voice quality
Voice quality characterization Anger harsh
voice (/a/) Disgust creaky voice (/a/)
Joy, Fear, Surprise breathy voice
Discriminant analysis classification
scores 60/70 for stressed and
unstressed vowel Best score Fear, Anger
Worst score Surprise
VOQUAL 2003 paper Emotions and Voice Quality
Experiments with Sinusoidal Modeling
4
Processing of emotive speech
Neutral Emotive transformation based on
sinusoidal modeling
Target Disgust
Disgust
Neutral
Disgust (PsTs)
Sadness (PsTs)
Sadness
Target Sadness

Results
Time-stretch and (formant preserving) pitch
shift alone cant account for the principal
emotion related cues
Spectral conversion can account for some of the
emotion cues

5
Processing of emotive speech
Neutral Emotive transformation based on
sinusoidal modeling
Neutral
PsTs
PsTsSc
Target
anger disgust joy fear surprise sadness
6
SI voice processing for TTS systems
Processing of emotive speech results
Emotive synthesis based on FESTIVAL MBROLA (Male
Voice)
Neutral
PsTs
PsTsVQtr
Target
Anger Disgust Joy Fear Surprise Sadness
7
ETTS Audio Examples
8
Mark-Up Languages for E-TTS

Hierarchic description of emotive voice

High Level emotive tag (e.g., ltangergt, ltjoygt,
ltfeargt, etc.) Medium Level phonetic voice
description (e.g., ltmodalgt, ltsoftgt, ltpressedgt,
etc.) Low Level acoustic description (e.g.,
ltspectral tiltgt, ltshimmer gt, ltjittergt, etc.)
Definition of speaker-independent rules to
control voice quality within a text-to-speech
synthesizer.

Write a Comment

User Comments (0)

About PowerShow.com

PF-STAR: emotional speech synthesis PowerPoint PPT Presentation