Data-Driven Prosody and Voice Quality Generation for Emotional Speech - PowerPoint PPT Presentation

About This Presentation

Title:

Data-Driven Prosody and Voice Quality Generation for Emotional Speech

Description:

Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab CUED – PowerPoint PPT presentation

Number of Views:112

Avg rating:3.0/5.0

Slides: 15

Provided by: Zeyn5

Category:

more less

Transcript and Presenter's Notes

Title: Data-Driven Prosody and Voice Quality Generation for Emotional Speech

1
Data-Driven Prosody and Voice Quality Generation
for Emotional Speech

Zeynep Inanoglu Steve Young
Machine Intelligence Lab
CUED

2
Project Goal Restated

Given a neutral utterance and all the known
features that can be extracted from a TTS
framework convert the utterance into a
specified emotion with zero degradation in
quality.
Particularly useful for unit-selection
synthesizers with good-quality neutral output.
What can be modified?
Prosody (F0, duration, prominence structure)
Voice quality (spectral features relating to the
voice source filter)
Method Data-Driven Learning Generation
(Decision trees, HMM)
This presentation addresses both the issues of
prosody generation and voice quality
modification.

3
Data Specifics Female Speaker

Parallel data used 595 utterances of each
emotion
545 utterances used for training of the different
modules, 50 were set aside as the test set.
Features extracted
Text-based features
Phone identity, lexical stress, syllable
position, word length, word position, part of
speech.
Perceptual features (Intonation Units)
TOBI-based syllable labels alh, ah, al, c
represent three accent types and one symbol for
unaccented syllables.
Automatically extracted by Prosodizer.

4
Step 1 Convert Intonation Unit Sequence
alh c c c ah c al c
Neutral Sequence
Text-based features
Sequence Conversion
Target Emotion
alh c alh c ah c alh c

ASSUMPTION Each emotion has an intrinsic pattern
in its intonation unit sequence (e.g. surprised
utterances use alh much more frequently than
neutral)
For each input neutral unit and its context, we
want to find the corresponding unit in a target
emotion.
Training Use Decision Trees trained on parallel
data
This is similar to sliding a decision tree along
the utterance and generating an output at each
syllable.

5
Step 1 Sequence Conversion Results

Sequence prediction accuracy is computed on the
test data by measuring the number of units that
match between the converted and target sequences.
(Substitution error)
As a benchmark, the unchanged neutral sequence is
also compared to the target sequence.
Sequence conversion improves accuracy for happy,
angry and surprised and doesnt change the
results for sad.

Sequence Prediction Accuracy () Sequence Prediction Accuracy ()
Neutral Sequence Converted Sequence
happy 61.90 66.52
sad 60.81 60.04
angry 60.26 72.45
surprised 53.23 61.03
6
Step 2 Intonation Model Training
Intonation Models
Text-based features
alh c alh c
F0 contour

Each syllable intonation is modelled as a
three-state left-to-right HMM.
Each model is highly context sensitive
An example model alhlex_at_1wpos_at_1spos_at_3pofs_at_3w
ord_len_at_3syltype_at_2
Syllable models trained with interpolated F0 and
Energy values based on the laryngograph signal as
well as first and second order differentials.
(F0, E, DF0, DDF0, DE, DDE)
Decision tree-based parameter tying was
performed.

7
Step 2 Model-Based Intonation Generation

The goal is to generate an optimal sequence of F0
values directly from context-sensitive syllable
HMMs given the intonation sequence
This results in a sequence of mean state values.
Cepstral parameter generation algorithm of HTS
system for interpolated F0 generation (Tokuda et
al, 1995)
Differential F0 features are used as constraints
in contour generation. Results in smoother
contours.

8
Step 2 Model-Based Intonation Generation

Difficult to obtain an objective measure for F0
comparison that is perceptually relevant.
A simple approach is to align the target and the
model-generated contours so that they both have N
pitch points and measure RMS error per utterance
The average RMS error for all generated test
contours is given below. The first row presents
the original error between the neutral contour
and the target contour as a benchmark.

RMSE in Hz Surprised Angry Happy Sad
RMSE (neutral, target) 93.7 138.8 68.8 34.8
RMSE (gen, target) 64.7 68.3 65.6 27.2
9
Step 3 Duration Tree Training
Duration Trees
Text-based features
alh c alh c
Duration Tier

A decision tree was built for each voiced broad
class
vowels, nasals, glides and fricatives.
All text-based features and intonation units used
to build the trees.
The features that were most significant varied
with emotion and phone class.
For each test utterance a duration tier was
constructed by taking the ratio of predicted
duration to neutral duration.

10
Step 3 Duration Trees - Evaluation

Assume Poisson distribution at the leaves of the
decision tree.
where ? is the mean of the leaf node
(predicted duration)
Measure the performance of duration trees by
using them as a classifier.
How likely is it for happy durations to be
generated by neutral/happy/sadtrees?
We want the diagonal entries to be minimum (most
likely) of each row.

Mean Log Likelihood Of Test Data Mean Log Likelihood Of Test Data Decision Tree Decision Tree Decision Tree Decision Tree Decision Tree
Mean Log Likelihood Of Test Data Mean Log Likelihood Of Test Data Neutral Happy Sad Surprised Angry
Actual Duration Neutral -9.28 -10.18 -11.14 -11.01 -11.49
Actual Duration Happy -11.52 -10.12 -12.01 -11.41 -11.87
Actual Duration Sad -14.35 -13.04 -12.05 -14.90 -15.29
Actual Duration Surprised -12.13 -10.86 -13.13 -10.23 -11.39
Actual Duration Angry -12.82 -11.17 -13.37 -11.53 -10.95
11
Overview of Run-Time System

alh c c c ah c al c

Neutral Sequence
Text-based features
Sequence Conversion
Target Emotion
alh c alh c ah c alh c
Text-based features
Intonation Models
Duration Trees
Target Emotion
Target Emotion
F0 contour
Duration Tier
TD-PSOLA (Praat)
12
Prosodic Conversion - Samples
Neutral Happy Sad Surprised Angry

13
Experiments With Voice Quality

Analysis of Long Term Average Spectra for vowels.
Pitch-Synchronous Analysis (single pitch period
frames)
Total power in each frame normalized to a
constant value.
Anger has significantly more energy in the
1550-2500 band and less in 0-800
Sadness has a sharper spectral tilt and more low
frequency energy
Happy surprised follow similar spectral
patterns

/ae/
14
Upcoming work

More experiments with voice quality modification
Decision tree based filter bank generation
approach.
Combine voice quality processing with prosody
generation.
Application of techniques to the MMJ (male)
corpus, performance comparison across gender.
Perceptual study
Acquire recognition scores across emotions,
gender and feature set
Miscellaneous Application of FSP, MMJ models to
a new speaker/language.