Zeynep Inanoglu - PowerPoint PPT Presentation

About This Presentation
Title:

Zeynep Inanoglu

Description:

A Statistical Approach To Emotional Prosody Generation Zeynep Inanoglu Machine Intelligence Laboratory CU Engineering Department Supervisor: Prof. Steve Young – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 21
Provided by: Zeyn8
Category:

less

Transcript and Presenter's Notes

Title: Zeynep Inanoglu


1
A Statistical Approach To Emotional Prosody
Generation
  • Zeynep Inanoglu
  • Machine Intelligence Laboratory
  • CU Engineering Department
  • Supervisor Prof. Steve Young

2
Agenda
  • Previous Toshiba Update
  • A Review of Emotional Speech Synthesis
  • Motivation for Proposed Approach
  • Proposed Approach Intonation Generation from
    Syllable HMMs.
  • Intonation Models and Training
  • Recognition Performance of Intonation Units
  • Intonation Synthesis from HMMs
  • MLLR based Intonation Adaptation
  • Perceptual Tests
  • Summary and Future Direction

3
Previous Toshiba Update A Brief Review
  • Emotion Recognition
  • Demonstrated work on HMM-based emotion detection
    in voicemail messages. (Emotive Alert)
  • Reported the set of acoustic features that
    maximize classification accuracy for each emotion
    type identified using sequential forward floating
    algorithm.
  • Expressive Speech Synthesis
  • Demonstrated the importance of prosody in
    emotional expression through copy-synthesis of
    emotional prosody onto neutral utterances.
  • Suggested the linguistically descriptive
    intonation units for prosody modelling. (accents,
    boundary tones)

4
A Review of Emotional Synthesis
  • The importance of prosody in emotional expression
    has been confirmed. (Banse Scherer, 1996
    Mozziconacci, 1998)
  • The available prosody rules are mainly defined
    for global parameters. (mean pitch, pitch range,
    speaking rate, declination)
  • Interaction of linguistic units and emotion is
    largely untested. (Banziger, 2005)
  • Strategies for emotional synthesis vary based on
    the type of synthesizer.
  • Formant Synthesis allows control over various
    segmental and prosodic parameter Emotional
    prosody rules extracted from literature are
    applied by modifying neutral synthesizer
    parameters. (Cahn, 1990 Burkhardt, 2000
    MurrayArnott, 1995)
  • Diphone Synthesis allows prosody control by
    defining target contours and durations based on
    emotional prosody rules. (Schroeder, 2004
    Burkhardt, 2005)
  • Unit-Selection Synthesis provides minimal
    parametric flexibility. Attempts at emotional
    expression involve recording entire unit
    databases for each emotion and selecting units
    from the appropriate database at run time. (Iida
    et al, 2003)
  • HMM Synthesis allows spectral and prosodic
    control at the segmental level. Provides
    statistical framework for modelling emotions.
    (Tsuzuki et al, 2004)

5
A Review of Emotional Synthesis
Unit-Selection Synthesis Very good quality -
Not scalable, too much effort
Unit Replication
HMM Synthesis Statistical -Too granular for
prosody modelling
Unexplored
METHOD
Statistical
  • Formant Synthesis / Diphone Synthesis
  • - Only as good as hand-crafted rules
  • Poor to medium baseline quality

Rule-Based
Segmental
Global
Intonational (syllable/phrase)
GRANULARITY
6
Motivation For Proposed Approach1
  • We propose a generative model of prosody.
  • We envision evaluating this prosodic model in a
    variety of synthesis contexts through signal
    manipulation schemes such as TD-PSOLA.
  • Statistical
  • Rule based systems are only as good as their
    hand-crafted rules. Why not learn rules from
    data?
  • Success of HMM methods in speech synthesis.
  • Syllable-based
  • Pitch movements are most relevant on the syllable
    or intonational phrase level. However, the
    effects of emotion on contour shapes and
    linguistic units are largely unexplored.
  • Linguistic Units of Intonation
  • Coupling of emotion and linguistic phenomena has
    not been investigated.

1 This work will be published in the Proceedings
of ACII, October 2005, Beijing
7
Overview
Context Sensitive HMMs
Emotion HMMs
Neutral Speech Data
Training
MLLR
Mean Pitch
Emotion Data
Syllable Boundaries
F0 Generation
Syllable Labels
1 1.5 c 1.5 1.9 a 1.9 2.3 c
2.3 2.5 rb
TD-PSOLA
Phonetic Labels
Synthesized Contour
Step 2 Generate intonation contours from HMM
Step 3 Adapt models given a small amount of
emotion data
Step 4 Transplant contour onto an utterance.
Current focus is on pitch modelling only.
Syllable-based intonation models.
Step 1 Train intonation models on neutral data
8
Intonation Models and Training
  • Basic Models
  • Seven basic models A (accent), C (unstressed),
    RB (rising boundary), FB (falling boundary), ARB,
    AFB, SIL
  • Context-Sensitive models
  • Tri-unit models (Preceding and following
    intonation unit)
  • Full-context models (Position of syllable in
    intonational phrase, forward counts of accents,
    boundary tones in IP position of vowel in
    syllable, number of phones in the syllable)
  • Decision tree-based parameter tying was performed
    for context-sensitive models.
  • Data Boston Radio Corpus.
  • Features Normalized raw f0 and energy values as
    well as differentials.

9
Recognition Results
  • Evaluation of models was performed in a
    recognition framework to assess how well the
    models represent intonation units and to quantify
    the benefits of incorporating context.
  • A held-out test set was used for predicting
    intonation sequences
  • Basic models were tested with a varying numbers
    of mixture components. This was compared with
    accuracy rates of full-context models.

Basic Label Set (7 models, 3 emitting states, N mix) Basic Label Set (7 models, 3 emitting states, N mix) Basic Label Set (7 models, 3 emitting states, N mix) Basic Label Set (7 models, 3 emitting states, N mix) Full Context Label Set with Decision Tree Based Tying
GMM N1 GMM N2 GMM N4 GMM N10 Full Context Label Set with Decision Tree Based Tying
Corr53.26 Acc 44.52 Corr53.36 Acc 45.48 Corr54.65 Acc 46.31 Corr59.58 Acc 50.40 Corr 64.02 Acc 55.88
10
Intonation Synthesis From HMM
  • The goal is to generate an optimal sequence of
    observations directly from syllable HMMs given
    the intonation models
  • The optimal state sequence is predetermined by
    basic duration models. So parameter generation
    problem becomes
  • The solution is the sequence of mean vectors for
    the state sequence Qmax
  • We used the cepstral parameter generation
    algorithm of HTS system for interpolated F0
    generation (Tokuda et al, 1995)
  • Differential F0 features (?f and ??f) are used as
    constraints in contour generation. Maximization
    is done for static parameters only.

11
Intonation Synthesis From HMM
  • A single observation vector consists of static
    and dynamic features
  • The relationship between the static and dynamic
    features are as follows
  • This relationship can be expressed in matrix form
    where O is the sequence of full
    feature vectors and F is the sequence of static
    features only. W is the matrix form of window
    functions. The maximization problem then becomes
  • The solution is a set of equations that can be
    solved in a time recursive manner. (Tokuda et al,
    1995)

12
Intonation Synthesis From HMM
Context Sensitive HMMs
Emotion HMMs
Neutral Speech Data
Training
MLLR
Mean Pitch
Emotion Data
Syllable Boundaries
F0 Generation
Syllable Labels
1 1.5 c 1.5 1.9 a 1.9 2.3 c
2.3 2.5 rb
Phonetic Labels
Synthesized Contour
13
Perceptual Effects of Intonation Units
a a c c
c c
a a c a
c fb
14
Pitch Contour Samples
  • Generated Neutral Contours Transplanted on Unseen
    Utterances

original
synthesized
original
synthesized
tri-unit
full-context
15
MLLR Adaptation to Emotional Speech
1
  • Maximum Likelihood Linear Regression (MLLR)
    adaptation computes a set of linear
    transformations for the mean and variance
    parameters of a continuous HMM.
  • The number of transforms are based on a
    regression tree and a threshold for what is
    considered enough adaptation data.
  • Adaptation data from Emotional Prosody Corpus
    which consists of four syllable phrases in a
    variety of emotions.
  • Happy and sad speech were chosen for this
    experiment.

3
2
4
5
6
7
Not Enough Data Use transformation From parent
node.
16
MLLR Adaptation To Happy Sad Data
c
c
c
arb
c
c
Neutral Sad Happy
17
Perceptual Tests
  • Test 1 How natural are neutral contours?
  • Ten listeners were asked to rate utterances in
    terms of naturalness of intonation.
  • Some utterances were unmodified and others had
    synthetic contours.
  • A t-test (plt0.05) on the data showed that
    distributions of ratings for the two hidden
    groups overlap sufficiently, i.e. there is no
    significant difference in terms of quality.

18
Perceptual Tests
  • Test 2 Does adaptation work?
  • The goal is to find out if adapted models produce
    contours that people perceive to be more
    emotional than the neutral contours.
  • Given pairs of utterances, 14 listeners were
    asked to identify the happier/sadder one.

19
Perceptual Tests
  • Utterances with sad contours were identified 80
    of the time. This was significant. (plt0.01)
  • Listeners formed a bimodal distribution in their
    ability to detect happy utterances. Overall only
    46 of the happy intonation was identified as
    happier than neutral. (Smiling voice is infamous
    in literature)
  • Happy models worked better with utterances with
    more accents and rising boundaries - the
    organization of labels matters!!!

20
Summary and Future Direction
  • A statistical approach to prosody generation was
    proposed with an initial focus on F0 contours.
  • The results of the perceptual tests were
    encouraging and yielded guidelines for future
    direction
  • Bypass the use of perceptual labels. Use lexical
    stress information as a prior in automatic
    labelling of corpora.
  • Investigate the role of emotion on accent
    frequency to come up with a Language Model of
    emotion.
  • Duration Modelling Evaluate HSMM framework as
    well as duration adaptation by using vowel
    specific conversion functions.
  • Voice Source Modelling Treat LF parameters as
    part of prosody.
  • Investigate the use of graphical models for
    allowing hierarchical constraints on generated
    parameters.
  • Incorporate the framework into one or more TTS
    systems.
Write a Comment
User Comments (0)
About PowerShow.com