Title: PF-STAR: emotional speech synthesis
1PF-STAR emotional speech synthesis
Istituto di Scienze e Tecnologie della
Cognizione, Sezione di Padova Fonetica e
Dialettologia, CNR
2Analysis of emotive speech audio
disgust (D) surprise (SU) neutral (N)
anger (A) joy (J) fear (F) sadness (SA)
Recordings /aba/, /ava/, /mamma/
- Cues extraction and analysis
- Intensity, duration, pitch, pitch range,
formants. - F0 stressed vowel mean and
- F0mid values are strongly correlated.
- Shimmer, Jitter, HNR, Hammarbergs index,
Spectral flatness, Spectral energy distributions
voice quality correlates.
F0mean (global and for stressed vowel), F0mid,
and F0range for /aba/
3Analysis of emotive speech voice quality
Voice quality characterization Anger harsh
voice (/a/) Disgust creaky voice (/a/)
Joy, Fear, Surprise breathy voice
Discriminant analysis classification
scores 60/70 for stressed and
unstressed vowel Best score Fear, Anger
Worst score Surprise
VOQUAL 2003 paper Emotions and Voice Quality
Experiments with Sinusoidal Modeling
4Processing of emotive speech
Neutral Emotive transformation based on
sinusoidal modeling
Target Disgust
Disgust
Neutral
Disgust (PsTs)
Sadness (PsTs)
Sadness
Target Sadness
- Results
- Time-stretch and (formant preserving) pitch
shift alone cant account for the principal
emotion related cues - Spectral conversion can account for some of the
emotion cues
5Processing of emotive speech
Neutral Emotive transformation based on
sinusoidal modeling
Neutral
PsTs
PsTsSc
Target
anger disgust joy fear surprise sadness
6SI voice processing for TTS systems
Processing of emotive speech results
Emotive synthesis based on FESTIVAL MBROLA (Male
Voice)
Neutral
PsTs
PsTsVQtr
Target
Anger Disgust Joy Fear Surprise Sadness
7ETTS Audio Examples
8Mark-Up Languages for E-TTS
- Hierarchic description of emotive voice
High Level emotive tag (e.g., ltangergt, ltjoygt,
ltfeargt, etc.) Medium Level phonetic voice
description (e.g., ltmodalgt, ltsoftgt, ltpressedgt,
etc.) Low Level acoustic description (e.g.,
ltspectral tiltgt, ltshimmer gt, ltjittergt, etc.)
Definition of speaker-independent rules to
control voice quality within a text-to-speech
synthesizer.