Zeynep Inanoglu - PowerPoint PPT Presentation

About This Presentation

Title:

Zeynep Inanoglu

Description:

A Statistical Approach To Emotional Prosody Generation Zeynep Inanoglu Machine Intelligence Laboratory CU Engineering Department Supervisor: Prof. Steve Young – PowerPoint PPT presentation

Number of Views:74

Avg rating:3.0/5.0

Slides: 21

Provided by: Zeyn8

Category:

more less

Transcript and Presenter's Notes

Title: Zeynep Inanoglu

1
A Statistical Approach To Emotional Prosody
Generation

Zeynep Inanoglu
Machine Intelligence Laboratory
CU Engineering Department
Supervisor Prof. Steve Young

2
Agenda

Previous Toshiba Update
A Review of Emotional Speech Synthesis
Motivation for Proposed Approach
Proposed Approach Intonation Generation from
Syllable HMMs.
Intonation Models and Training
Recognition Performance of Intonation Units
Intonation Synthesis from HMMs
MLLR based Intonation Adaptation
Perceptual Tests
Summary and Future Direction

3
Previous Toshiba Update A Brief Review

Emotion Recognition
Demonstrated work on HMM-based emotion detection
in voicemail messages. (Emotive Alert)
Reported the set of acoustic features that
maximize classification accuracy for each emotion
type identified using sequential forward floating
algorithm.
Expressive Speech Synthesis
Demonstrated the importance of prosody in
emotional expression through copy-synthesis of
emotional prosody onto neutral utterances.
Suggested the linguistically descriptive
intonation units for prosody modelling. (accents,
boundary tones)

4
A Review of Emotional Synthesis

The importance of prosody in emotional expression
has been confirmed. (Banse Scherer, 1996
Mozziconacci, 1998)
The available prosody rules are mainly defined
for global parameters. (mean pitch, pitch range,
speaking rate, declination)
Interaction of linguistic units and emotion is
largely untested. (Banziger, 2005)
Strategies for emotional synthesis vary based on
the type of synthesizer.
Formant Synthesis allows control over various
segmental and prosodic parameter Emotional
prosody rules extracted from literature are
applied by modifying neutral synthesizer
parameters. (Cahn, 1990 Burkhardt, 2000
MurrayArnott, 1995)
Diphone Synthesis allows prosody control by
defining target contours and durations based on
emotional prosody rules. (Schroeder, 2004
Burkhardt, 2005)
Unit-Selection Synthesis provides minimal
parametric flexibility. Attempts at emotional
expression involve recording entire unit
databases for each emotion and selecting units
from the appropriate database at run time. (Iida
et al, 2003)
HMM Synthesis allows spectral and prosodic
control at the segmental level. Provides
statistical framework for modelling emotions.
(Tsuzuki et al, 2004)

5
A Review of Emotional Synthesis
Unit-Selection Synthesis Very good quality -
Not scalable, too much effort
Unit Replication
HMM Synthesis Statistical -Too granular for
prosody modelling
Unexplored
METHOD
Statistical

Formant Synthesis / Diphone Synthesis
- Only as good as hand-crafted rules
Poor to medium baseline quality

Rule-Based
Segmental
Global
Intonational (syllable/phrase)
GRANULARITY
6
Motivation For Proposed Approach1

We propose a generative model of prosody.
We envision evaluating this prosodic model in a
variety of synthesis contexts through signal
manipulation schemes such as TD-PSOLA.
Statistical
Rule based systems are only as good as their
hand-crafted rules. Why not learn rules from
data?
Success of HMM methods in speech synthesis.
Syllable-based
Pitch movements are most relevant on the syllable
or intonational phrase level. However, the
effects of emotion on contour shapes and
linguistic units are largely unexplored.
Linguistic Units of Intonation
Coupling of emotion and linguistic phenomena has
not been investigated.

1 This work will be published in the Proceedings
of ACII, October 2005, Beijing
7
Overview
Context Sensitive HMMs
Emotion HMMs
Neutral Speech Data
Training
MLLR
Mean Pitch
Emotion Data
Syllable Boundaries
F0 Generation
Syllable Labels
1 1.5 c 1.5 1.9 a 1.9 2.3 c
2.3 2.5 rb
TD-PSOLA
Phonetic Labels
Synthesized Contour
Step 2 Generate intonation contours from HMM
Step 3 Adapt models given a small amount of
emotion data
Step 4 Transplant contour onto an utterance.
Current focus is on pitch modelling only.
Syllable-based intonation models.
Step 1 Train intonation models on neutral data
8
Intonation Models and Training

Basic Models
Seven basic models A (accent), C (unstressed),
RB (rising boundary), FB (falling boundary), ARB,
AFB, SIL
Context-Sensitive models
Tri-unit models (Preceding and following
intonation unit)
Full-context models (Position of syllable in
intonational phrase, forward counts of accents,
boundary tones in IP position of vowel in
syllable, number of phones in the syllable)
Decision tree-based parameter tying was performed
for context-sensitive models.
Data Boston Radio Corpus.
Features Normalized raw f0 and energy values as
well as differentials.

9
Recognition Results

Evaluation of models was performed in a
recognition framework to assess how well the
models represent intonation units and to quantify
the benefits of incorporating context.
A held-out test set was used for predicting
intonation sequences
Basic models were tested with a varying numbers
of mixture components. This was compared with
accuracy rates of full-context models.

Basic Label Set (7 models, 3 emitting states, N mix) Basic Label Set (7 models, 3 emitting states, N mix) Basic Label Set (7 models, 3 emitting states, N mix) Basic Label Set (7 models, 3 emitting states, N mix) Full Context Label Set with Decision Tree Based Tying
GMM N1 GMM N2 GMM N4 GMM N10 Full Context Label Set with Decision Tree Based Tying
Corr53.26 Acc 44.52 Corr53.36 Acc 45.48 Corr54.65 Acc 46.31 Corr59.58 Acc 50.40 Corr 64.02 Acc 55.88
10
Intonation Synthesis From HMM

The goal is to generate an optimal sequence of
observations directly from syllable HMMs given
the intonation models
The optimal state sequence is predetermined by
basic duration models. So parameter generation
problem becomes
The solution is the sequence of mean vectors for
the state sequence Qmax
We used the cepstral parameter generation
algorithm of HTS system for interpolated F0
generation (Tokuda et al, 1995)
Differential F0 features (?f and ??f) are used as
constraints in contour generation. Maximization
is done for static parameters only.

11
Intonation Synthesis From HMM

A single observation vector consists of static
and dynamic features
The relationship between the static and dynamic
features are as follows
This relationship can be expressed in matrix form
where O is the sequence of full
feature vectors and F is the sequence of static
features only. W is the matrix form of window
functions. The maximization problem then becomes
The solution is a set of equations that can be
solved in a time recursive manner. (Tokuda et al,
1995)

12
Intonation Synthesis From HMM
Context Sensitive HMMs
Emotion HMMs
Neutral Speech Data
Training
MLLR
Mean Pitch
Emotion Data
Syllable Boundaries
F0 Generation
Syllable Labels
1 1.5 c 1.5 1.9 a 1.9 2.3 c
2.3 2.5 rb
Phonetic Labels
Synthesized Contour
13
Perceptual Effects of Intonation Units
a a c c
c c
a a c a
c fb
14
Pitch Contour Samples

Generated Neutral Contours Transplanted on Unseen
Utterances

original
synthesized
original
synthesized
tri-unit
full-context
15
MLLR Adaptation to Emotional Speech
1

Maximum Likelihood Linear Regression (MLLR)
adaptation computes a set of linear
transformations for the mean and variance
parameters of a continuous HMM.
The number of transforms are based on a
regression tree and a threshold for what is
considered enough adaptation data.
Adaptation data from Emotional Prosody Corpus
which consists of four syllable phrases in a
variety of emotions.
Happy and sad speech were chosen for this
experiment.

3
2
4
5
6
7
Not Enough Data Use transformation From parent
node.
16
MLLR Adaptation To Happy Sad Data
c
c
c
arb
c
c
Neutral Sad Happy
17
Perceptual Tests

Test 1 How natural are neutral contours?
Ten listeners were asked to rate utterances in
terms of naturalness of intonation.
Some utterances were unmodified and others had
synthetic contours.
A t-test (plt0.05) on the data showed that
distributions of ratings for the two hidden
groups overlap sufficiently, i.e. there is no
significant difference in terms of quality.

18
Perceptual Tests

Test 2 Does adaptation work?
The goal is to find out if adapted models produce
contours that people perceive to be more
emotional than the neutral contours.
Given pairs of utterances, 14 listeners were
asked to identify the happier/sadder one.

19
Perceptual Tests

Utterances with sad contours were identified 80
of the time. This was significant. (plt0.01)
Listeners formed a bimodal distribution in their
ability to detect happy utterances. Overall only
46 of the happy intonation was identified as
happier than neutral. (Smiling voice is infamous
in literature)
Happy models worked better with utterances with
more accents and rising boundaries - the
organization of labels matters!!!

20
Summary and Future Direction

A statistical approach to prosody generation was
proposed with an initial focus on F0 contours.
The results of the perceptual tests were
encouraging and yielded guidelines for future
direction
Bypass the use of perceptual labels. Use lexical
stress information as a prior in automatic
labelling of corpora.
Investigate the role of emotion on accent
frequency to come up with a Language Model of
emotion.
Duration Modelling Evaluate HSMM framework as
well as duration adaptation by using vowel
specific conversion functions.
Voice Source Modelling Treat LF parameters as
part of prosody.
Investigate the use of graphical models for
allowing hierarchical constraints on generated
parameters.
Incorporate the framework into one or more TTS
systems.