Title: Universit
1Multimodal Expressive Embodied Conversational
Agents
Catherine Pelachaud
Elisabetta Bevacqua Nicolas Ech Chafai,
FT Maurizio Mancini Magalie Ochs, FT Christopher
Peters Radek Niewiadomski
2ECAs Capabilities
- Anthropomorphic autonome figures
- New form on human-machine interaction
- Study of human communication, human-human
interaction - ECAs ought to be endowed with dialogic and
expressive capabilities - Perception an ECA must be able to pay attention
to, perceive user and the context she is placed
in.
3ECAs capabilities
- Interaction
- speaker and addressee emits signals
- speaker perceives feedback from addressee
- speaker may decide to adapt to addressees
feedback - consider social context
- Generation expressive synchronized visual and
acoustic behaviors. - produce expressive behaviours
- words, voice, intonation,
- gaze, facial expression, gesture
- body movements, body posture
4Synchrony tool - BEAT
- Cassell et al, Media Lab MIT
- Decomposition of text into theme and rheme
- Linked to WordNet
- Computation of
- intonation
- gaze
- gesture
5Virtual Training Environments MRE(J. Gratch, L.
Jonhson, S. Marsella, USC)
6Interactive System
- Real state agent
- Gesture synchronized with speech and intonation
- Small talk
- Dialog partner
7MAX, S. Kopp, U of Bielefeld
Gesture understanding and imitation
8Gilbert and George at the Bank (Upenn, 1994)
9(No Transcript)
10Greta
11Problem to Be Solved
- Human communication is endowed with three devices
to express communicative intention - Verbs and formulas
- Intonation and paralinguistic
- Facial expression, gaze, gesture, body movement,
posture - Problem For any communicative act, the Speaker
has to decide - Which nonverbal behaviors to show
- How to execute them
12Verbal and Nonverbal Communication
- Suppose I want to advise a friend to put on her
coat because it is snowing. - Which signals do I use?
- Verbal signal use of a syntactically complex
sentence - Take your umbrella because it is raining
- Verbal nonverbal signals
- Take your umbrella point out to the window to
show the rain by a gesture or by gaze
13Multimodal Signals
- The whole body communicates by using
- Verbal acts (words and sentences)
- Prosody, intonation (nonverbal vocal signals)
- Gesture (hand and arm movements)
- Facial action (smile, frown)
- Gaze (eyes and head movements)
- Body orientation and posture (trunk and leg
movements) - All these systems of signals have to cooperate in
expressing overall meaning of communicative act.
14 Multimodal Signals
- Accompany flow of speech
- Synchronized at the verbal level
- Punctuate accented phonemic segments and pauses
- Substitute for word(s)
- Emphasize what is being said
- Regulate the exchange of speaking turn
15Synchronization
- There exists an isomorphism between patterns of
speech, intonation and facial actions - Different levels of synchrony
- Phoneme level (blink)
- Word level (eyebrow)
- Phrase level (hand gesture)
- Interactional synchrony Synchrony between
speaker and addressee
16Taxonomy of Communicative Functions (I. Poggi)
- The speaker may provide three broad types of
information about - Information about the world deictic, iconic
(adjectival), - Information about the speakers mind
- belief (certainty, adjectival)
- goal (performative, rheme/theme, turn-system,
belief relation) - emotion
- meta-cognitive
- Information about speakers identity (sex,
culture, age)
17Multimodal Signals (Isabella Poggi)
- Characterization of multimodal signals by their
placement with respect to linguistic utterance
and significance in transmitting information. Eg - Raised eyebrow may signal surprise, emphasis,
question mark, suggestion - Smile may express happiness, be a polite
greeting, be a backchannel signal - Need two information to characterize multimodal
signals - Their meaning
- Their visual action
18Lexicon(meaning, signal)
- Expression meaning
- deictic this, that, here, there
- adjectival small, difficult
- certainty certain, uncertain
- performative greet, request
- topic comment emphasis
- Belief relation contrast,
- turn allocation take/give turn
- affective anger, fear, happy-for, sorry-for,
envy, relief, .
- Expression signal
- Deictic gaze direction
- Certainty Certain palm up open hand Uncertain
raised eyebrow - adjectival small eye aperture
- Belief relation Contrast raised eyebrow
- Performative Suggest small raised eyebrow, head
aside Assert horizontal ring - Emotion Sorry-for head aside, inner eyebrow up
Joy raising fist up - Emphasis raised eyebrows, head nod, beat
19Representation Language
- Affective Presentation Markup Language APML
- describes the communicative functions
- works at meaning level and not the signal level
- ltAPMLgt
- ltturn-allocation type"take turn"gt ltperformative
type"greet"gt - Good Morning, Angela. lt/performativegt
- ltaffective type"happy"gt It is so
- lttopic-comment type"comment"gt wonderful
lt/topic-commentgt - to see you again. lt/affectivegt ltcertainty
type"certain"gt I was - lttopic-comment type"comment"gt sure
lt/topic-commentgt - we would do so, one day! lt/certaintygt
- lt/turn-allocationgt lt/APMLgt.
20Facial Description Language
- Facial expressions defined as (meaning, signal)
pairs stored in library - Hierarchical set of classes
- Facial basis FB class basic facial movement
- An FB may be represented as a set of MPEG-4
compliant FAPs or recursively, as a combination
of other FBs using the ' operators - FBfap3v1,,fap69vk
- FB'c1FB1c2FB2
- where c1 and c2 are constants and FB1 and FB2 can
be - Previous defined FBs
- FB of the form fap3v1,,fap69vk
21Facial basis class
- Facial basis class
- Examples of facial basis class
- Eyebrow small_frown, left_raise, right_raise
- Eyelid upper_lid_raise
- Mouth left_corner_stretch, left_corner_raise
22Facial Displays
- Every facial display (FD) is made up of one or
more FBs - FDFB1 FB2 FB3 FBn
- surpriseraise_eyebrowraise_lidopen_mouth
- worried(surprise0.7)sadness
23Facial Displays
- Probabilistic mapping between the tags and
signals - Es happy_for (smile0.5, 0.3) (smile0.25)
(smile2 raised_eyebrow, 0.35) (nothing, 0.1) - Definition of a function class for addressee
association (meaning, signal) - Class communicative function
- Certainty
- Adjectival
- Performative
- Affective
24Facial Temporal Course
25Gestural Lexicon
- Certainty
- Certain palm up open hand
- Uncertain showing empty hands while lowering
forearms - Belief-relation
- List of items of same class numbering on fingers
- Temporal relation fist with extended hand moves
back and forth behind ones shoulder - Turn-taking
- Hold the floor raise hand, palm toward hearer
- Performative
- Assert horizontal ring
- Reproach extended index, palm to left, rotating
up down on wrist - Emphasis beat
26Gesture Specification Language
- Scripting language for hand-arm gestures, based
on formational parameters Stokoe - Hand shape specified using HamNoSys Prillwitz
et. al. - Arm position concentric squares in front of
agent McNeill - Wrist orientation palm and finger base
orientation - Gestures are defined by a sequence of timed key
poses gesture frame - Gestures are broken down temporally into distinct
(optional) phases - Gesture phase preparation, stroke, hold,
retraction - Change of formational components over time
27Gesture specification example Certain
28Gesture Temporal Course
stroke start stroke end
rest position
preparation
retraction
rest position
29ECA architecture
30ECA Architecture
- Input to the system APML annotated text
- Output to the system Animation files and WAV
file for the audio - System
- Interprets APML tagged dialogs, i.e. all
communicative functions - Looks in a library the mapping between the
meaning (specified by the XML-tag) and signals - Decides which signals to convey on which
modalities - Synchronizes the signals with speech at different
levels (word, phoneme or utterance)
31Behavioral Engine
32Modules
- APML Parser XML parser
- TTS Festival manages the speech synthesis and
give us the list of phonemes and phonemes
duration. - Expr2Signal Converter given a communicative
function and its meaning, this module returns the
list of facial signals - Conflicts Resolver resolves the conflicts that
may happened when more than one facial signals
should be activated on same facial parts - Face Generator converts the facial signals into
MPEG-4 FAP values - Viseme Generator converts each phoneme, given by
Festival, into a set of FAPs - MPEG4 FAP Decoder is an MPEG-4 compliant Facial
Animation Engine
33TTS Festival
- Drive the synchronization of facial expression
- Synchronization implemented at word level
- Timing of facial expression connected to the text
embedded between the markers - Use of the tree structure of Festival to compute
expressions duration
34Expr2Signal Converter
- Instantiation of APML tags meaning of a given
communicative function - Converts markers into facial signals
- Use of a library containing the lexicon of the
type (meaning, facial expressions)
35Gaze Model
- Based on communicative functions model of
Isabella Poggi - This model predicts what should be the value of
gaze in order to have a given meaning in a given
conversational context. - For example
- agent wants to emphasize a given word, the model
will output that the agent should gaze at her
conversant.
36Gaze Model
- Very deterministic behavior model at every
Communicative Function associated with a meaning
correspond the same signal (with probabilistic
changes) - Event-driven model only when a Communicative
Function is specified the associated signals are
computed - only when a Communicative Function is
specified, the corresponding behavior may vary
37Gaze Model
- Several drawbacks as there is no temporal
consideration - No consideration of past and current gaze
behavior to compute the new one - No consideration of how long the current gaze
state of S and L has lasted
38Gaze Algorithm
- Two steps
- Communicative prediction
- Apply the communicative function model to compute
the gaze behavior as to convey a given meaning
for S and L - Statistical prediction
- The communicative gaze model is probabilistically
modified by a statistical model defined with
constraints - what is the communicative gaze behavior of S and
L - in which gaze behavior S and L were
- the duration of the current state of S and L
39Temporal Gaze Parameters
- The gaze behaviors depend on the communicative
functions, general purpose of the conversation
(persuasion discours, teaching...), personality,
cultural root, social relations... - Very, too, complex model
- propose parameters that control the gaze behavior
overall - TS1,L1max maximum duration the mutual gaze
state may remain active. - TS1max maximum duration of gaze state S1.
- TL1max maximum duration of gaze state L1 .
- TS0max maximum duration of gaze state S0.
- TL0max maximum duration of gaze state L0.
40Mutual Gaze
41Gaze Aversion
42Gesture Planner
- Adaptive instantiation
- Preparation and retraction phase adjustments
- Transition key and rest gesture insertion
- Joint-chain follow-through
- Forward time shifting of children joints in time
- Stroke of gesture on stressed word
- Stroke expansion
- During planning phase, identify rheme clauses
with closely repeated emphases/pitch accents - Indicate secondary accents by repeating the
stroke of the primary gesture with decreasing
amplitude
43Gesture Planner
- Determination of gesture
- Look in dictionary
- Selection of gesture
- Gestures associated with most embedded tags have
priority (except beat) adjectival, deictic - Duration of gesture
- Coarticulation between successive gestures closed
in time - Hold for gestures belonging to higher up tag
hierarchy (e.g. performative, belief-relation) - Otherwise go to rest position
44Behavior Expressivity
- Behavior is related to the (Wallbott, 1998)
- quality of the mental state (e.g. emotion) it
refers to - quantity (somehow linked to the intensity factor
of the mental state) - Behaviors encode
- content information (the What is communicating)
- expressive information (the How it is
communicating) - Behavior expressivity refers to the manner of
execution of the behavior
45Expressivity Dimensions
- Spatial amplitude of movement
- Temporal duration of movement
- Power dynamic property of movement
- Fluidity smoothness and continuity of movement
- Repetitiveness tendency to rhythmic repeats
- Overall Activation quantity of movement across
modalities
46Overall Activitation
- Threshold filter on atomic behaviors during
APML tag matching - Determines the number of nonverbal signals to
be executed.
47Spatial Parameter
- Amplitude of movement controlled through
asymmetric scaling of the reach - space that is used to find IK goal positions
- Expand or condense the entire space in front of
agent
48Temporal parameter
- Determine the speed of the arm movement of a
gesture's - meaning-carrying stroke phase
- Modify speed of stroke
Stroke shift / velocity control of a beat gesture
Y position of wrist w.r.t. shoulder cm
Frame
49Fluidity
- Continuity control of TCB interpolation
splines and gesture-to-gesture - Continuity of arms trajectory paths
- Control the velocity profiles of an action
coarticulation
X position of wrist w.r.t. shoulder cm
Frame
50Power
- Tension and Bias control of TCB splines
- Overshoot reduction
- Acceleration and deceleration of limbs
Hand shape control for gestures that do not need
hand configuration to convey their meaning
(beats).
51Repetitivity
- Technique of stroke expansion Consecutive
emphases are realized gesturally by repeating the
stroke of the first gesture.
52Multiple Modality Ex Abrupt
Overall Activity 0.6 Spatial 0 Temporal
1 Fluidity -1 Power 1 Repetition -1
53Multiple Modality Ex Vigorous
Overall Activity 1 Spatial 1 Temporal
1 Fluidity 1 Power 0 Repetition 1
54Evaluation of Expressive Gesture
- (H1) The chosen implementation for mapping single
dimensions of expressivity onto animation
parameters is appropriate - a change in a single
dimension can be recognized and correctly
attributed by users. - (H2) Combining parameters in such a way that they
reflect a given communicative intent will result
in more believable overall impression of the
agent. - 106 subjects from 17 to 26 years old
55Perceptual Test Studies
- Evaluation of the adequacy of the implementation
of each parameter - check whether subjects could perceive and
distinguish the six different expressivity
parameters and indicate their direction of
change. - Result good recognition for spatial and temporal
parameters lower recognition for fluidity and
power parameters as they are inter-dependent. - Evaluation task does setting appropriate values
for the expressivity parameters create behaviors
that are judged as exhibiting corresponding
expressivity? - 3 different types of behaviors abrupt, sluggish,
vigorous - users prefer the coherent performance for
vigorous and abrupt
56Interaction
- Interaction two or more parties exchange
messages. - Interaction is by no means a one way
communication channel between parties. - Within an interaction, parties take turns in
playing the roles of the speaker and of the
addressee.
57Interaction
- Speaker and addressee adapt their behaviors to
each other - Speaker monitors addressees attention and
interest in what he has to say - addressee selects feedback behaviors to show the
speaker that he is paying attention
58Interaction
- Speaker
- Pointless for a speaker to engage in an act of
communication if addressee does not pay or intend
to pay attention - Important for speaker to assess addressees
engagement at - when starting an interaction assess the
possibility of engagement in interaction
(establish phase) - when interaction is going on check if engagement
is lasting and sustaining conversation (maintain
phase)
59Interaction
- addressee
- attention pay attention to the signals produced
by speaker to perceive, process and memorize them - perception of signals
- comprehension understand meaning attached to
signals - internal reaction the comprehension of the
meaning may create cognitive and emotional
reaction - decision communication or not of the internal
reaction - generation display behaviors
60Backchannel
- Types of backchannels (I. Poggi)
- attention
- comprehension
- belief
- interest
- agreement
- positive/negative
- any combination of the above pay attention but
not understand understand but non believe, etc.
61Backchannel
- Depending on the type of speech act they respond
to, a signal will be interpreted as a backchannel
or not. - backchannel a signal of agreement / disagreement
that follows the expression of opinions,
evaluations, planning - not a backchannel a signal of comprehension /
incomprehension after an explicit question Did
you understand?
62Backchannel
- Polysemy of backchannel signals
- a signal may provide different types of
information - a frown negative feedback for understanding,
believing and agreeing
63Backchannel signals of gaze
- gaze
- show direction of attention
- inform on level of engagement or on intention to
maintain engagement - indicate degree of intimacy
- but also
- monitor the gaze behavior of others to establish
their intention to engage or maintain engaged - shared attention situation involved mutual gaze
at each other partner or mutual gaze at a same
object
64Backchannel modelling
- Reactive model
- generates an instinctive feedback without
reasoning - simple backchannel or mimicry
- spontaneous - sincere
- Cognitive model
- conscious decision to provide backchannel to
provoke a particular effect on the speaker or to
reach a specific goal - deliberate possibly pretended
- it can be shifted to automatic (ex. when
listening to a bore)
65Backchannel Demo
66A reactive backchannel
- Currently, our model is reactive in nature
- Dependent on perception
- Speaker interprets addressees behavior
- Speaker generates or alters its own behavior
- Our focus interest and attention on a signal
level (not on a cognitive level)
67Organization of the communication Attraction of
attention
- Communicative agents the agents provide
information to the user, and should guarantee the
user pay attention - Animation expressivity principle of staging,
so that a single idea is clearly expressed at
each instant of time - Animation specificity animators creativity, no
realistic constraints for animators
What types of gesture properties could guarantee
users attention?
France Telecom
68Organization of the communication Attraction of
attention
- Corpus videos from traditional animation that
illustrate different types of conversational
interaction - the modulations of gesture expressivity over time
play a role in managing communication, thus
serving as a pragmatic tool
France Telecom
69Emotion
- elicited by the evaluation of events, objects,
actions - integration of emotions in a dialog system
(Artimis, FT) - identify under which circumstances a dialog agent
should express emotions
France Telecom
70Emotion
- BDI representation
- based on OCC model Appraisal variables Ortony
et al. 1988 - Desirability/Undesirability Achievement or
threaten of the agent's choice - Degree of realization Degree of certainty of
the choice's achievement - Probability of an event Probability of
feasibility of an event - Agency The agent who is actor of the event
France Telecom
71Emotion
- complex emotions
- superposition of 2 emotions evaluation of an
event can happen under different angles - mask an emotion by another one consideration of
social context - joy deception
masking
72VideoMasking of Deception by Joy
73Conclusion
- Creation of a virtual agent able to
- communicate nonverbally
- show emotions
- use expressive gestures
- perceive and be attentive
- maintain the attention
- Two studies on expressivity
- from manual annotation of video corpus
- from mimicry of movement analysis