Title: Data-Driven Prosody and Voice Quality Generation for Emotional Speech
1Data-Driven Prosody and Voice Quality Generation
for Emotional Speech
- Zeynep Inanoglu Steve Young
- Machine Intelligence Lab
- CUED
2Project Goal Restated
- Given a neutral utterance and all the known
features that can be extracted from a TTS
framework convert the utterance into a
specified emotion with zero degradation in
quality. - Particularly useful for unit-selection
synthesizers with good-quality neutral output. - What can be modified?
- Prosody (F0, duration, prominence structure)
- Voice quality (spectral features relating to the
voice source filter) - Method Data-Driven Learning Generation
(Decision trees, HMM) - This presentation addresses both the issues of
prosody generation and voice quality
modification.
3Data Specifics Female Speaker
- Parallel data used 595 utterances of each
emotion - 545 utterances used for training of the different
modules, 50 were set aside as the test set. - Features extracted
- Text-based features
- Phone identity, lexical stress, syllable
position, word length, word position, part of
speech. - Perceptual features (Intonation Units)
- TOBI-based syllable labels alh, ah, al, c
represent three accent types and one symbol for
unaccented syllables. - Automatically extracted by Prosodizer.
4Step 1 Convert Intonation Unit Sequence
alh c c c ah c al c
Neutral Sequence
Text-based features
Sequence Conversion
Target Emotion
alh c alh c ah c alh c
- ASSUMPTION Each emotion has an intrinsic pattern
in its intonation unit sequence (e.g. surprised
utterances use alh much more frequently than
neutral) - For each input neutral unit and its context, we
want to find the corresponding unit in a target
emotion. - Training Use Decision Trees trained on parallel
data - This is similar to sliding a decision tree along
the utterance and generating an output at each
syllable.
5Step 1 Sequence Conversion Results
- Sequence prediction accuracy is computed on the
test data by measuring the number of units that
match between the converted and target sequences.
(Substitution error) - As a benchmark, the unchanged neutral sequence is
also compared to the target sequence. - Sequence conversion improves accuracy for happy,
angry and surprised and doesnt change the
results for sad.
Sequence Prediction Accuracy () Sequence Prediction Accuracy ()
Neutral Sequence Converted Sequence
happy 61.90 66.52
sad 60.81 60.04
angry 60.26 72.45
surprised 53.23 61.03
6Step 2 Intonation Model Training
Intonation Models
Text-based features
alh c alh c
F0 contour
- Each syllable intonation is modelled as a
three-state left-to-right HMM. - Each model is highly context sensitive
- An example model alhlex_at_1wpos_at_1spos_at_3pofs_at_3w
ord_len_at_3syltype_at_2 - Syllable models trained with interpolated F0 and
Energy values based on the laryngograph signal as
well as first and second order differentials.
(F0, E, DF0, DDF0, DE, DDE) - Decision tree-based parameter tying was
performed.
7Step 2 Model-Based Intonation Generation
- The goal is to generate an optimal sequence of F0
values directly from context-sensitive syllable
HMMs given the intonation sequence - This results in a sequence of mean state values.
- Cepstral parameter generation algorithm of HTS
system for interpolated F0 generation (Tokuda et
al, 1995) - Differential F0 features are used as constraints
in contour generation. Results in smoother
contours.
8Step 2 Model-Based Intonation Generation
- Difficult to obtain an objective measure for F0
comparison that is perceptually relevant. - A simple approach is to align the target and the
model-generated contours so that they both have N
pitch points and measure RMS error per utterance - The average RMS error for all generated test
contours is given below. The first row presents
the original error between the neutral contour
and the target contour as a benchmark.
RMSE in Hz Surprised Angry Happy Sad
RMSE (neutral, target) 93.7 138.8 68.8 34.8
RMSE (gen, target) 64.7 68.3 65.6 27.2
9Step 3 Duration Tree Training
Duration Trees
Text-based features
alh c alh c
Duration Tier
- A decision tree was built for each voiced broad
class - vowels, nasals, glides and fricatives.
- All text-based features and intonation units used
to build the trees. - The features that were most significant varied
with emotion and phone class. - For each test utterance a duration tier was
constructed by taking the ratio of predicted
duration to neutral duration.
10Step 3 Duration Trees - Evaluation
- Assume Poisson distribution at the leaves of the
decision tree. - where ? is the mean of the leaf node
(predicted duration) - Measure the performance of duration trees by
using them as a classifier. - How likely is it for happy durations to be
generated by neutral/happy/sadtrees? - We want the diagonal entries to be minimum (most
likely) of each row.
Mean Log Likelihood Of Test Data Mean Log Likelihood Of Test Data Decision Tree Decision Tree Decision Tree Decision Tree Decision Tree
Mean Log Likelihood Of Test Data Mean Log Likelihood Of Test Data Neutral Happy Sad Surprised Angry
Actual Duration Neutral -9.28 -10.18 -11.14 -11.01 -11.49
Actual Duration Happy -11.52 -10.12 -12.01 -11.41 -11.87
Actual Duration Sad -14.35 -13.04 -12.05 -14.90 -15.29
Actual Duration Surprised -12.13 -10.86 -13.13 -10.23 -11.39
Actual Duration Angry -12.82 -11.17 -13.37 -11.53 -10.95
11Overview of Run-Time System
Neutral Sequence
Text-based features
Sequence Conversion
Target Emotion
alh c alh c ah c alh c
Text-based features
Intonation Models
Duration Trees
Target Emotion
Target Emotion
F0 contour
Duration Tier
TD-PSOLA (Praat)
12Prosodic Conversion - Samples
Neutral Happy Sad Surprised Angry
13Experiments With Voice Quality
- Analysis of Long Term Average Spectra for vowels.
- Pitch-Synchronous Analysis (single pitch period
frames) - Total power in each frame normalized to a
constant value. - Anger has significantly more energy in the
1550-2500 band and less in 0-800 - Sadness has a sharper spectral tilt and more low
frequency energy - Happy surprised follow similar spectral
patterns
/ae/
14Upcoming work
- More experiments with voice quality modification
- Decision tree based filter bank generation
approach. - Combine voice quality processing with prosody
generation. - Application of techniques to the MMJ (male)
corpus, performance comparison across gender. - Perceptual study
- Acquire recognition scores across emotions,
gender and feature set - Miscellaneous Application of FSP, MMJ models to
a new speaker/language.