Data-Driven Prosody and Voice Quality Generation for Emotional Speech - PowerPoint PPT Presentation

About This Presentation
Title:

Data-Driven Prosody and Voice Quality Generation for Emotional Speech

Description:

Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab CUED – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 15
Provided by: Zeyn5
Category:

less

Transcript and Presenter's Notes

Title: Data-Driven Prosody and Voice Quality Generation for Emotional Speech


1
Data-Driven Prosody and Voice Quality Generation
for Emotional Speech
  • Zeynep Inanoglu Steve Young
  • Machine Intelligence Lab
  • CUED

2
Project Goal Restated
  • Given a neutral utterance and all the known
    features that can be extracted from a TTS
    framework convert the utterance into a
    specified emotion with zero degradation in
    quality.
  • Particularly useful for unit-selection
    synthesizers with good-quality neutral output.
  • What can be modified?
  • Prosody (F0, duration, prominence structure)
  • Voice quality (spectral features relating to the
    voice source filter)
  • Method Data-Driven Learning Generation
    (Decision trees, HMM)
  • This presentation addresses both the issues of
    prosody generation and voice quality
    modification.

3
Data Specifics Female Speaker
  • Parallel data used 595 utterances of each
    emotion
  • 545 utterances used for training of the different
    modules, 50 were set aside as the test set.
  • Features extracted
  • Text-based features
  • Phone identity, lexical stress, syllable
    position, word length, word position, part of
    speech.
  • Perceptual features (Intonation Units)
  • TOBI-based syllable labels alh, ah, al, c
    represent three accent types and one symbol for
    unaccented syllables.
  • Automatically extracted by Prosodizer.

4
Step 1 Convert Intonation Unit Sequence
alh c c c ah c al c
Neutral Sequence
Text-based features
Sequence Conversion
Target Emotion
alh c alh c ah c alh c
  • ASSUMPTION Each emotion has an intrinsic pattern
    in its intonation unit sequence (e.g. surprised
    utterances use alh much more frequently than
    neutral)
  • For each input neutral unit and its context, we
    want to find the corresponding unit in a target
    emotion.
  • Training Use Decision Trees trained on parallel
    data
  • This is similar to sliding a decision tree along
    the utterance and generating an output at each
    syllable.

5
Step 1 Sequence Conversion Results
  • Sequence prediction accuracy is computed on the
    test data by measuring the number of units that
    match between the converted and target sequences.
    (Substitution error)
  • As a benchmark, the unchanged neutral sequence is
    also compared to the target sequence.
  • Sequence conversion improves accuracy for happy,
    angry and surprised and doesnt change the
    results for sad.

Sequence Prediction Accuracy () Sequence Prediction Accuracy ()
Neutral Sequence Converted Sequence
happy 61.90 66.52
sad 60.81 60.04
angry 60.26 72.45
surprised 53.23 61.03
6
Step 2 Intonation Model Training
Intonation Models
Text-based features
alh c alh c
F0 contour
  • Each syllable intonation is modelled as a
    three-state left-to-right HMM.
  • Each model is highly context sensitive
  • An example model alhlex_at_1wpos_at_1spos_at_3pofs_at_3w
    ord_len_at_3syltype_at_2
  • Syllable models trained with interpolated F0 and
    Energy values based on the laryngograph signal as
    well as first and second order differentials.
    (F0, E, DF0, DDF0, DE, DDE)
  • Decision tree-based parameter tying was
    performed.

7
Step 2 Model-Based Intonation Generation
  • The goal is to generate an optimal sequence of F0
    values directly from context-sensitive syllable
    HMMs given the intonation sequence
  • This results in a sequence of mean state values.
  • Cepstral parameter generation algorithm of HTS
    system for interpolated F0 generation (Tokuda et
    al, 1995)
  • Differential F0 features are used as constraints
    in contour generation. Results in smoother
    contours.

8
Step 2 Model-Based Intonation Generation
  • Difficult to obtain an objective measure for F0
    comparison that is perceptually relevant.
  • A simple approach is to align the target and the
    model-generated contours so that they both have N
    pitch points and measure RMS error per utterance
  • The average RMS error for all generated test
    contours is given below. The first row presents
    the original error between the neutral contour
    and the target contour as a benchmark.

RMSE in Hz Surprised Angry Happy Sad
RMSE (neutral, target) 93.7 138.8 68.8 34.8
RMSE (gen, target) 64.7 68.3 65.6 27.2
9
Step 3 Duration Tree Training
Duration Trees
Text-based features
alh c alh c
Duration Tier
  • A decision tree was built for each voiced broad
    class
  • vowels, nasals, glides and fricatives.
  • All text-based features and intonation units used
    to build the trees.
  • The features that were most significant varied
    with emotion and phone class.
  • For each test utterance a duration tier was
    constructed by taking the ratio of predicted
    duration to neutral duration.

10
Step 3 Duration Trees - Evaluation
  • Assume Poisson distribution at the leaves of the
    decision tree.
  • where ? is the mean of the leaf node
    (predicted duration)
  • Measure the performance of duration trees by
    using them as a classifier.
  • How likely is it for happy durations to be
    generated by neutral/happy/sadtrees?
  • We want the diagonal entries to be minimum (most
    likely) of each row.

Mean Log Likelihood Of Test Data Mean Log Likelihood Of Test Data Decision Tree Decision Tree Decision Tree Decision Tree Decision Tree
Mean Log Likelihood Of Test Data Mean Log Likelihood Of Test Data Neutral Happy Sad Surprised Angry
Actual Duration Neutral -9.28 -10.18 -11.14 -11.01 -11.49
Actual Duration Happy -11.52 -10.12 -12.01 -11.41 -11.87
Actual Duration Sad -14.35 -13.04 -12.05 -14.90 -15.29
Actual Duration Surprised -12.13 -10.86 -13.13 -10.23 -11.39
Actual Duration Angry -12.82 -11.17 -13.37 -11.53 -10.95
11
Overview of Run-Time System
  • alh c c c ah c al c

Neutral Sequence
Text-based features
Sequence Conversion
Target Emotion
alh c alh c ah c alh c
Text-based features
Intonation Models
Duration Trees
Target Emotion
Target Emotion
F0 contour
Duration Tier
TD-PSOLA (Praat)
12
Prosodic Conversion - Samples
Neutral Happy Sad Surprised Angry



13
Experiments With Voice Quality
  • Analysis of Long Term Average Spectra for vowels.
  • Pitch-Synchronous Analysis (single pitch period
    frames)
  • Total power in each frame normalized to a
    constant value.
  • Anger has significantly more energy in the
    1550-2500 band and less in 0-800
  • Sadness has a sharper spectral tilt and more low
    frequency energy
  • Happy surprised follow similar spectral
    patterns

/ae/
14
Upcoming work
  • More experiments with voice quality modification
  • Decision tree based filter bank generation
    approach.
  • Combine voice quality processing with prosody
    generation.
  • Application of techniques to the MMJ (male)
    corpus, performance comparison across gender.
  • Perceptual study
  • Acquire recognition scores across emotions,
    gender and feature set
  • Miscellaneous Application of FSP, MMJ models to
    a new speaker/language.
Write a Comment
User Comments (0)
About PowerShow.com