Emotional Speech

About This Presentation

Title:

Emotional Speech

Description:

Voice quality indicative of anxious, bored. 4/20/05. CS 4706. 12 ... of opposite affective attributes (bored vs. interested, friendly vs. hostile, ... – PowerPoint PPT presentation

Number of Views:87

Avg rating:3.0/5.0

Slides: 51

Provided by: jacksonl6

Learn more at: http://www1.cs.columbia.edu

Category:

more less

Transcript and Presenter's Notes

Title: Emotional Speech

1
Emotional Speech

Guest Lecturer Jackson Liscombe
CS 4706
Julia Hirschberg
4/20/05

2
Assumptions (1)

Prosody is
pitch fundamental frequency (f0)
loudness energy (rms)
duration speaking rate, hesitation
Prosody carries meaning
given/new
focus
discourse structure

3
Assumptions (2)

Text to Speech Synthesis (TTS)
formant-based
concatenative / unit selection
Articulatory
Machine learning techniques
predefined set of features
learn rules on a training corpus
apply rules to unseen data

4
Outline

Why do we care about emotional speech?
Emotional Speech Defined
Perception Studies
Production Studies
Lauren Wilcox on voice quality

5
Emotion. What is it Good For?

Spoken Dialogue Systems
customer-care centers
task planning
tutorial systems
automated agents
Approaching Artificial Intelligence

6
Emotion. Why is it hard?

Colloquial def. ? Technical def.
Emotions are non-exclusive
Human consensus low

7
Study I Consensus

Liscombe et al. 2003
User study to classify emotional speech tokens
Semantically neutral (dates and numbers)
10 emotions
confident, encouraging, friendly, happy,
interested
angry, anxious, bored, frustrated, sad
Example

8
Study I Consensus
p 9
Study I Consensus

Emotions are heavily correlated
Emotions are non-exclusive
Are emotion labels appropriate?
activation
valency

10
Perception of Emotional Speech

Machine learning to predict emotional states in
human speech
Common Features
prosody
lexical items
voice Quality

11
Acted Speech

1990s - present
Aubergé, Campbell, Cowie, Douglas-Cowie,
Hirscheberg, Liscombe, Mozziconacci, Oudeyer,
Pereira, Roach, Scherer, Schröder, Tato, Yuan,
Zetterholm,

12
Study II Acted Speech

4 actors
10 emotions
Binary decision trees (RIPPER)
Accuracy ranged from 70 - 80
Prosody indicative of anger, happy, sad
Voice quality indicative of anxious, bored

13
Emotional Speech in Spoken Dialogue Systems

Batliner, Huber, Fischer, Spilker, Nöth (2003)
Verbmobil (Wizard of Oz scenarios)
Ang, Dhillon, Krupski, Shriberg, Stolcke (2002)
DARPA Communicator
Lee, Narayanan (2004)
Speechworks call-center
Prosodic, Lexical, and Discourse-level features

14
Study III Call-center

ATTs How May I Help You system
Predict anger and frustration

15
Study III Call-center
That amount is incorrect.
16
Study III Call-center
17
Study III Call-center
18
Study III Call-center

Feature sets
Prosodic (f0, rms, speaking rate)
Discourse (turn number, dialog act)
Lexical (words)
Contextual (dialogue history)

19
Study III Call-center
20
Study IV Tutorial

Physics tutorial system
Detect student uncertainty
Examples

21
Production of Emotional Speech
22
TTS Where are we now

Natural sounding speech for some utterances
Where good match between input and database
Stillhard to vary prosodic features and retain
naturalness
Yes-no questions Do you want to fly first class?
Context-dependent variation still hard to infer
from text and hard to realize naturally

Appropriate contours from text
Emphasis, de-emphasis to convey focus, given/new
distinction I own a cat. Or, rather, my cat
owns me.
Variation in pitch range, rate, pausal duration
to convey topic structure
Characteristics of emotional speech little
understood, so hard to convey a voice that
sounds friendly, sympathetic, authoritative.
How to mimic real voices?

24
Examples of Emotional Synthesis
http//emosamples.syntheticspeech.de/
25
The Role of Voice Quality in Communicating
Emotion, Mood, and Attitude
L. Wilcox Overview of Speech Communication paper
for COMS4706

Christer Gobl, Ailbhe Ni Chasaide
Some slide content borrowed from
an online voice quality tutorial by K. Marasek
Experimental Phonetics Group
at the Institute of Natural Language Processing
University of Stuttgart, Germany

26
Voice Quality

The characteristic auditory coloring of ones
voice
Derived from a variety of laryngeal and
supralaryngeal features
Present throughout ones speech.
The natural and distinctive tone of speech sounds
produced by a particular person yields a
particular voice (Trask 1996).
This paper focuses on harsh voice, tense voice,
modal voice, breathy voice, whispery voice,
creaky voice, and lax-creaky voice and the role
of these voice qualities in affective expression.
The larynx is used to transform an airstream into
audible sounds.
This process is central to perceived voice
quality.
Most people in linguistics view voice qualities
in terms of one quality in contrast with another.
Phonemic voice quality has a contrastive
function in the phonological system of a
language.

27
Experiment

-Subjects are asked to listen to synthesized
utterances.
-Utterances were synthesized with seven different
voice qualities.
-Subjects were asked to identify pairs of
opposing affective attributes

28
Motivation for experiment

Many vocal expressions signal affect pitch
variables, speech rate, pausing structure,
duration of accented/unaccented syllables, these
are easier to measure that voice quality
Voice quality is said to play a fundamental role
in affective communication but few empirical
studies seek to understand voice source
correlates.
Some natural voice qualities said to map to
affect and therefore assist in characterizing
emotion in speech (based on phonetic
observations)

29
Motivation for Experiment

-Different researchers have found varied mappings
in their own empirical studies. Further study
could confirm some previous findings
Lavar 80, Scherer 86, Laukkanen 96
Breathy intimacy
Whispery confidentiality, secrecy
Harsh voice anger
Tense voice anger, joy, fear
Lax voice sadness
But not all agree
Murray, Arnott (93)
Breathy anger, happiness
Modal to tense sadness

30
Motivation for Experiment

-Some findings conclude that glottal source
contributes to the perception of valence as well
as vocal effort (Laukkanen 97).
-Synthesis might be an ideal tool for examining
how individual features of a signal contribute to
the perception of affect.
-Previous work has generated emotive synthetic
speech through manipulation of voice quality
parameters (Cahn, 90, Murray, Arnott 95) but
the synthesizers used didnt offer full control
of these parameters (DECtalk)
-Voice quality might signal strong as well as
milder emotional states and speaker attitude

31
Different speech source behaviors generate
different voice qualities. Larynx adjusts in
different ways to create different phonatory
gestures, features

Laver (80) defines three which are considered in
this paper
Adductive tension
(interarytenoid muscles adduct the arytenoid
muscles)
Medial compression
(adductive force on vocal processes- adjustment
of ligamental glottis)
Longitudinal pressure
(tension of vocal folds)
Recall scary glottis animation
? diagram online voice quality tutorial by
K. Marasek Experimental Phonetics Group at the
Institute of Natural Language Processing ,
University of Stuttgart, Germany

32
Modal voice

neutral mode
muscular adjustments are moderate
vibration of the vocal folds is periodic with
full closing of glottis, so no audible friction
noises are produced when air flows through the
glottis.
frequency of vibration and loudness are in the
lowto mid range for conversational speech

33
Tense voice voiced phonation

Very strong tension of the vocal folds, very high
tension in the vocal tract leads to harsh voice
quality.

34
Whispery voice voiceless phonation

Very low adductive tension
Medial compression moderately high
Longitudinal tension moderately high
Little or no vocal fold vibration
( produced through turbulences generated by
the friction of the air in and above the larynx,
which produces frication)

35
Creaky voice voiced phonation

vocal folds vibrate at a very low frequency
vibration is somewhat irregular, vibrating mass
is heavier because of low tension (only the
ligamental part of glottis vibrates)
The vocal folds are strongly adducted
longitudinal tension is weak
Moderately high medial compression
Vocal folds thicken and create an unusually
thick and slack structure.

36
Lax - creaky

Despite definition of creaky voice quality,
creaky voice is found to have high glottal
tension at times, and low tension at others
Different creaky quality, lax-creaky was created
in experiment as separate from creaky.
Lax-creaky breathy voice settings reduced
aspiration noise and added creakiness for
experiment.

37
Breathy voice voiced phonation

Tension is low
minimal adductive tension,
weak medial compression
medium longitudinal tension of the vocal folds
folds do not come together completely leading to
frication

38
Voice quality estimation is difficult

If estimated with respect to a controlled neutral
quality, how is that controlled quality known to
be truly neutral? One must match the natural
laryngeal behavior to the neutral model of
behavior.
How adequate are the models of vocal fold
movements for the description of real phonation?
The established relationships between a produced
acoustical signal and the voice source are
complex and since we are only able to observe the
behavior of voicing indirectly, prone to error.
Otherwise need direct source signal obtained by
invasive techniques (ouch) and invasion might
interfere with signal.

39
Voice quality estimation

Inverse filtering approach
Speech production source signal vocal tract
filter response
Inverse filtering cancels the effects of the
vocal tracts, resulting signal is estimate of
source ill-posed problem
(popular approaches are automatic- based on
linear predictive analysis but do worse for
non-modal (colorful) qualities
Still need to measure the inversely filtered
signal

40
Example
41
Experiment

-Subjects are asked to listen to synthesized
utterances.
-Utterances were synthesized with seven different
voice qualities.
-Subjects were asked to identify pairs of
opposing affective attributes

42
Experiment - details

Natural utterances recorded in anehoic chamber
("anechoic" "without echo) high quality
recording of the Swedish utterance ja adjo
(semantically neutral) statement heard by
non-swedish speaking native speakers of Irish
English. The recording was digitized at high
sampling frequency and high resolution (16bit)
and prepared for analysis

43
Experiment- details

Recorded utterance analyzed and parameterized.
The popular LF (Liljencrants-Fant) model of
differentiated glottal flow (Fant et al., 1995)
was used to match the measured glottal waveform
with a theoretical model of the voice source.
Using LF a waveform is described by a set of
mathematical functions that model a given segment
of the waveform. The following parameters were
used in the experiment
EE - excitation strength
RA normalized value of TA - time constant of
the exponential curve, describes the "rounding of
the corner" of the waveform between t4 and t3
divided by t0 (amount of residual airflow after
the main excitation prior to ax glottal closure.
RG measure of glottal frequency as determined
by the opening branch of the glottal pulse
(normalized to fundamental frequency)
RK measure of glottal pulse skew, defined by
the relative durations of the opening and closing
branches of the glottal pulse.

44
Experiment - details

Utterance resynthesized with modal voice quality
(moderate tension) formant synth (KLSYN88a synth
Sensimetrics corp- Boston) allowing control of
source and filter parameters and different
variations of each
Once synthesized with modal voice, the modal
stimuli is reproduced six times, each time with a
different non-modal voice quality (tense,
breathy, whispery, creaky, harsh, lax-creaky) .
This is done by adjusting parameters such as
- fundamental frequency
Open Quotient (OQ) (ratio of the time in which
the vocal folds are open and the whole pitch
period duration)
Speed Quotient (also called skewness or rk)
(ratio of rise and fall time of the glottal flow
-more, differently to create different voice
qualities

45
Experiment - details

Perception tests constructed with each of the
stimuli and given to subjects
8 short subtests with 10 randomally chosen
stimuli were given to subjects. Interval between
sets 7 secs
within each set of stimuli 4 sec interval
Subjects respond to the affective content of the
stimuli on a scale of 1 to 7 (opposite terms on
either side) responses elicited for one
particular pair of opposite affective attributes
(bored vs. interested, friendly vs. hostile, sad
vs. happy, intimate vs. formal, timid vs.
confident afraid vs. unafraid)
12 subjects partipicated 6 male, 6 female

46
Results
47
(No Transcript)
48
Results

Voice quality and subject variable were
statistically highly significant
Differences between individual qualities were
statistically significant
Most readily perceived
Relaxation and stress
Highly perceived
Anger, boredom, intimacy, content, formal
(aside from anger- these could be categorized as
states, moods, attitudes, so consistent with
experiment goal)
Least well perceived
Unafraid, afraid, friendly, happy, sad
Milder states better signaled than strong emotion

49
Results

Notice modal stimuli is not perceived as totally
neutral
Similar response patterns occurred with
breathy/whispery and tense/harsh
Lax-creaky vs creaky does show significant
differences
Results and their comparison to previous
findings
Lax-creaky lower arousal, activation
Whispery timid, afraid
Tense high arousal/activation (confident,
interested, happy, angry)
Breathy, whispery, creaky, and more so lax
creaky relaxed, content, intimate, friendly,
sad, bored)
Lax-creaky, more so than whispery- effectively
signaled intimacy
And lax-creaky, more so than breathy, signaled
sadness Linking of breathy voice to anger and
happiness were not supported
A shift from modal to tense elicited happy affect
(rather than sad as proposed by Murray/Arnott
99)
Anger is shown to link to tense voice and joy
(Scherer 86)
As one moves from high to low activation stimuli
set, cross-subject variability increases

50
Some pros and cons of this study

Showed that voice quality alone can evoke
differences in speaker affect
But when comparing only synthesized voices, isnt
it a question of which is relatively more
colorful?
voice qualities are multi-colored and each map
to a variety of affective expression
(expressions are in some cases related, in others
unrelated)
traditional view that voice quality conveys
valence of emotion but not activation is
challenged (for affective states with negative
valence, activation still differentiates them and
is detected with voice quality alone)
Hard to know to what degree naturally occurring
phonomena matches model matches synthesis and
which level to look at to improve or criticize
when hearing final synthesis.
Aside from a phonetic system, subjects might
associate voice qualities depending on personal
situations, events, etc (could whispery sound
sinister?)
When only deciding between 2 extremes, subjects
might have difficulty trying not to listen for
the purpose of choosing one or another (?)
- but same data reduction occurred, so beginning
natural utterance not exact copy