Title: Zeynep Inanoglu
1A Statistical Approach To Emotional Prosody
Generation
- Zeynep Inanoglu
- Machine Intelligence Laboratory
- CU Engineering Department
- Supervisor Prof. Steve Young
2Agenda
- Previous Toshiba Update
- A Review of Emotional Speech Synthesis
- Motivation for Proposed Approach
- Proposed Approach Intonation Generation from
Syllable HMMs. - Intonation Models and Training
- Recognition Performance of Intonation Units
- Intonation Synthesis from HMMs
- MLLR based Intonation Adaptation
- Perceptual Tests
- Summary and Future Direction
3Previous Toshiba Update A Brief Review
- Emotion Recognition
- Demonstrated work on HMM-based emotion detection
in voicemail messages. (Emotive Alert) - Reported the set of acoustic features that
maximize classification accuracy for each emotion
type identified using sequential forward floating
algorithm. - Expressive Speech Synthesis
- Demonstrated the importance of prosody in
emotional expression through copy-synthesis of
emotional prosody onto neutral utterances. - Suggested the linguistically descriptive
intonation units for prosody modelling. (accents,
boundary tones)
4A Review of Emotional Synthesis
- The importance of prosody in emotional expression
has been confirmed. (Banse Scherer, 1996
Mozziconacci, 1998) - The available prosody rules are mainly defined
for global parameters. (mean pitch, pitch range,
speaking rate, declination) - Interaction of linguistic units and emotion is
largely untested. (Banziger, 2005) - Strategies for emotional synthesis vary based on
the type of synthesizer. - Formant Synthesis allows control over various
segmental and prosodic parameter Emotional
prosody rules extracted from literature are
applied by modifying neutral synthesizer
parameters. (Cahn, 1990 Burkhardt, 2000
MurrayArnott, 1995) - Diphone Synthesis allows prosody control by
defining target contours and durations based on
emotional prosody rules. (Schroeder, 2004
Burkhardt, 2005) - Unit-Selection Synthesis provides minimal
parametric flexibility. Attempts at emotional
expression involve recording entire unit
databases for each emotion and selecting units
from the appropriate database at run time. (Iida
et al, 2003) - HMM Synthesis allows spectral and prosodic
control at the segmental level. Provides
statistical framework for modelling emotions.
(Tsuzuki et al, 2004)
5A Review of Emotional Synthesis
Unit-Selection Synthesis Very good quality -
Not scalable, too much effort
Unit Replication
HMM Synthesis Statistical -Too granular for
prosody modelling
Unexplored
METHOD
Statistical
- Formant Synthesis / Diphone Synthesis
- - Only as good as hand-crafted rules
- Poor to medium baseline quality
Rule-Based
Segmental
Global
Intonational (syllable/phrase)
GRANULARITY
6Motivation For Proposed Approach1
- We propose a generative model of prosody.
- We envision evaluating this prosodic model in a
variety of synthesis contexts through signal
manipulation schemes such as TD-PSOLA. - Statistical
- Rule based systems are only as good as their
hand-crafted rules. Why not learn rules from
data? - Success of HMM methods in speech synthesis.
- Syllable-based
- Pitch movements are most relevant on the syllable
or intonational phrase level. However, the
effects of emotion on contour shapes and
linguistic units are largely unexplored. - Linguistic Units of Intonation
- Coupling of emotion and linguistic phenomena has
not been investigated.
1 This work will be published in the Proceedings
of ACII, October 2005, Beijing
7Overview
Context Sensitive HMMs
Emotion HMMs
Neutral Speech Data
Training
MLLR
Mean Pitch
Emotion Data
Syllable Boundaries
F0 Generation
Syllable Labels
1 1.5 c 1.5 1.9 a 1.9 2.3 c
2.3 2.5 rb
TD-PSOLA
Phonetic Labels
Synthesized Contour
Step 2 Generate intonation contours from HMM
Step 3 Adapt models given a small amount of
emotion data
Step 4 Transplant contour onto an utterance.
Current focus is on pitch modelling only.
Syllable-based intonation models.
Step 1 Train intonation models on neutral data
8Intonation Models and Training
- Basic Models
- Seven basic models A (accent), C (unstressed),
RB (rising boundary), FB (falling boundary), ARB,
AFB, SIL - Context-Sensitive models
- Tri-unit models (Preceding and following
intonation unit) - Full-context models (Position of syllable in
intonational phrase, forward counts of accents,
boundary tones in IP position of vowel in
syllable, number of phones in the syllable) - Decision tree-based parameter tying was performed
for context-sensitive models. - Data Boston Radio Corpus.
- Features Normalized raw f0 and energy values as
well as differentials.
9Recognition Results
- Evaluation of models was performed in a
recognition framework to assess how well the
models represent intonation units and to quantify
the benefits of incorporating context. - A held-out test set was used for predicting
intonation sequences - Basic models were tested with a varying numbers
of mixture components. This was compared with
accuracy rates of full-context models.
Basic Label Set (7 models, 3 emitting states, N mix) Basic Label Set (7 models, 3 emitting states, N mix) Basic Label Set (7 models, 3 emitting states, N mix) Basic Label Set (7 models, 3 emitting states, N mix) Full Context Label Set with Decision Tree Based Tying
GMM N1 GMM N2 GMM N4 GMM N10 Full Context Label Set with Decision Tree Based Tying
Corr53.26 Acc 44.52 Corr53.36 Acc 45.48 Corr54.65 Acc 46.31 Corr59.58 Acc 50.40 Corr 64.02 Acc 55.88
10Intonation Synthesis From HMM
- The goal is to generate an optimal sequence of
observations directly from syllable HMMs given
the intonation models - The optimal state sequence is predetermined by
basic duration models. So parameter generation
problem becomes - The solution is the sequence of mean vectors for
the state sequence Qmax - We used the cepstral parameter generation
algorithm of HTS system for interpolated F0
generation (Tokuda et al, 1995) - Differential F0 features (?f and ??f) are used as
constraints in contour generation. Maximization
is done for static parameters only.
11Intonation Synthesis From HMM
- A single observation vector consists of static
and dynamic features - The relationship between the static and dynamic
features are as follows - This relationship can be expressed in matrix form
where O is the sequence of full
feature vectors and F is the sequence of static
features only. W is the matrix form of window
functions. The maximization problem then becomes - The solution is a set of equations that can be
solved in a time recursive manner. (Tokuda et al,
1995)
12Intonation Synthesis From HMM
Context Sensitive HMMs
Emotion HMMs
Neutral Speech Data
Training
MLLR
Mean Pitch
Emotion Data
Syllable Boundaries
F0 Generation
Syllable Labels
1 1.5 c 1.5 1.9 a 1.9 2.3 c
2.3 2.5 rb
Phonetic Labels
Synthesized Contour
13Perceptual Effects of Intonation Units
a a c c
c c
a a c a
c fb
14Pitch Contour Samples
- Generated Neutral Contours Transplanted on Unseen
Utterances
original
synthesized
original
synthesized
tri-unit
full-context
15MLLR Adaptation to Emotional Speech
1
- Maximum Likelihood Linear Regression (MLLR)
adaptation computes a set of linear
transformations for the mean and variance
parameters of a continuous HMM. - The number of transforms are based on a
regression tree and a threshold for what is
considered enough adaptation data. - Adaptation data from Emotional Prosody Corpus
which consists of four syllable phrases in a
variety of emotions. - Happy and sad speech were chosen for this
experiment.
3
2
4
5
6
7
Not Enough Data Use transformation From parent
node.
16MLLR Adaptation To Happy Sad Data
c
c
c
arb
c
c
Neutral Sad Happy
17Perceptual Tests
- Test 1 How natural are neutral contours?
- Ten listeners were asked to rate utterances in
terms of naturalness of intonation. - Some utterances were unmodified and others had
synthetic contours. - A t-test (plt0.05) on the data showed that
distributions of ratings for the two hidden
groups overlap sufficiently, i.e. there is no
significant difference in terms of quality.
18Perceptual Tests
- Test 2 Does adaptation work?
- The goal is to find out if adapted models produce
contours that people perceive to be more
emotional than the neutral contours. - Given pairs of utterances, 14 listeners were
asked to identify the happier/sadder one.
19Perceptual Tests
- Utterances with sad contours were identified 80
of the time. This was significant. (plt0.01) - Listeners formed a bimodal distribution in their
ability to detect happy utterances. Overall only
46 of the happy intonation was identified as
happier than neutral. (Smiling voice is infamous
in literature) - Happy models worked better with utterances with
more accents and rising boundaries - the
organization of labels matters!!! -
20Summary and Future Direction
- A statistical approach to prosody generation was
proposed with an initial focus on F0 contours. - The results of the perceptual tests were
encouraging and yielded guidelines for future
direction - Bypass the use of perceptual labels. Use lexical
stress information as a prior in automatic
labelling of corpora. - Investigate the role of emotion on accent
frequency to come up with a Language Model of
emotion. - Duration Modelling Evaluate HSMM framework as
well as duration adaptation by using vowel
specific conversion functions. - Voice Source Modelling Treat LF parameters as
part of prosody. - Investigate the use of graphical models for
allowing hierarchical constraints on generated
parameters. - Incorporate the framework into one or more TTS
systems.