Title: Hierarchical Hidden Markov Models
1Université Catholique de Louvain Faculté des
Sciences Appliquées
Laboratoire de Télécommunications et
Télédétection (TELE)
- Hierarchical Hidden Markov Models
- Common Framework for complex recognition and
planning under uncertainty - Facial Animation (F-X Fanard)
- Emotion Recognition (Olivier Martin)
- Gesture Recognition (Kosta Gaitanis)
2Outline
- HMM DBN what is what why ?
- HHMM A common framework
- Applications
- Kosta Gesture Recognition
- F-X Facial Animation
- Olivier Emotion Recognition
3What are Bayesian Networks ?
- A Bayesian Network is a graph
- A set of nodes stochastic variables
- A set of directed links causal relationships
- The graph has no cycles (DAG)
- Each node as a conditional probability table that
quantifies the effect that the parents have on
the node (causality)
- P(O1, , ON) ?i P(OiParents(Oi))
4An exemple
5Bayesian Network Classifiers
- U O1, , ON, S
- Oi are the observation variables.
- S is the state variable
- Goal Infer P(S O1, , ON) using Bayes rule
6What is a Dynamic Bayesian Network?
- For each time slice, a bayesian network
St1
St
O3
O1
O3
O1
O4
O2
O2
O4
O5
O5
O6
O6
7Hidden Markov Model
- The simplest Dynamic Bayesian Network
- Defined by ? (?,B,?)
- ? ajk state transition probabilities ?
P(StkSt-1j) - B bj(k) Probability of observations ?
P(OtvkStj) - ? pj Initial state distribution ? P(S1j)
St1
St-1
N possible states
St
Ot1
Ot
Ot-1
M possible values
8HMM 3 problems
- The evaluation problem (analysis
forward/backward) - Given an HMM ? and a sequence of observations O,
what is the probability that the observations are
generated by the model P(O ?) ? - The decoding problem (analysis - Viterbi)
- Given an HMM ? and a sequence of observations O,
what is the most likely state sequence in the
model that produced the observations ? - The learning problem (synthesis Baum-Welch)
- Given an HMM ? and a sequence of observations O,
how should we ajust the model parameters ? in
order to maximize P(O ?) ?
9An exemple
HMM P(SO1, , O6) BN P(SO1, O2, O5
).P(SO4 ).P(SO3, O6 )
- If every observation variable may take
- 10 different values
- HMM 106 values
- BN 1110 values
- Naive BN 60 values
10Why Dynamic Bayesian Networks ?
- Handle temporal aspects (like HMM)
- Handle dependancies within variables allowing for
efficient inference mechanisms - Trade-off between precision and computational
time for real-time applications hybrid
inference methods (Rao-Blackwellised Particle
Filters)
11Why Dynamic Bayesian Networks ?
- Handle Multimodal Fusion Fission, simply by
adding/substracting edges between vertices. - Intuitive and Comprehensive representation
(unlike NN, SVM,) - Take all scenarios into account (unlike
rule-based) - No need for much learninglearning is used to
refine a priori information about the model
(unlike NN, SVM, etc)
12Smoothed learning
- Too few samples for unbiased learning ?
- ? Compute a posteriori probability by machine
learning, taking a priori information into
account intelligently (learning coefficient µ)
13References
- Dynamic Bayesian Networks
- Friedman Bayesian Network Classifiers
- http//www.cs.uu.nl/docs/vakken/pn/bayesclass.pdf
- Hidden Markov Models
- Rabiner A tutorial on Hidden Markov Models and
selected applications in Speech Recognition - www.ai.mit.edu/courses/6.867-f02/papers/rabiner.p
df
14HHMM A common framework
- Kosta Gaitanis, UCL
- Louvain-la-Neuve, Belgium
15Outline
- HHMM extending the classical HMM
- Inference
- Brief presentation of classical methods
- Linear time Inference in the HHMM
- Multiple Actors extending the HHMM
16From the HMM to the Hierarchical HMM
- A HMM generates symbols
- Some problems have inherently a hierarchical
structure - We want to exploit and model this hierarchical
structure - Solution A HMM that generates another HMM !
- Advantage the observations are correlated at
higher levels ? longer periods of time
17DBN representation of a HHMM
State at level k
Termination node
Observation (noisy)
2 types of inference - Bottom up
(recognition) - Top down (planning)
Simple HMM
18Example moving in an airport
Same model can be used for planning AND for
recognition !
19Inference in the HHMM
- Exact Inference (Pearl, JTree, loopy, )
- Exponential complexity wrt number of arcs
- Approximate Inference
- Distribution approximations
- ? non-gaussian
- Sampling methods (FP, MCMC, )
- ? Poor precision with large number of nodes
- Hybrid Inference (RBPF, )
- Good tradeoff betwin complexity and precision
- Intelligent sampling can reduce network
connectivity ? exact inference can be possible
for the rest of the network
20Linear Time Inference in the HHMM
The original HHMM
The HHMM after sampling the termination nodes
- Rao-Blackwellised Particle Filter
- Particle Filter samples the termination nodes and
the horizontal transitions - Exact Inference calculates the belief states of
the other variables using Bayes Rule ? Linear
complexity (tree structure)
21Taking into account multiple actors
- An actor can be
- A point (edge of mouth, edge of body, )
- A part of the body (mouth, hand, leg, )
- A group of parts of the body (legs, hands,
upper/lower face) - A person
- An object
- In general, anything that can be observed is
considered as an actor - Conditional dependance between actors at
different levels of the hierarchy - If each actor needs a separate HHMM ?
Exponential Complexity, oups
22Extending the HHMM for multiple actors
- Idea The dependance between actors is modeled
only at one level. - Complexity stays linear because the substructures
are independant (tree structure) - Coordination node models structures such as AND
/ OR
23Applications
- Any problem that has
- Noisy data
- Correlated observations
- Natural hierarchical decomposition
- And needs
- Dynamic Recognition (Bottom-up) or Planning
(Top-down) - Flexible modeling of actions
- Learning
24Gesture Recognition using the HHMM
- Kosta Gaitanis, UCL
- Louvain-la-Neuve, Belgium
25Applications Gesture Recognition
- Goal What is this man doing ?
- Walking, jumping, sitting, taking an object,
- Data Acquisition Natural Gesture (Alterface)
- Positions and speed of 5 crucial body points
(head, hands, feet)
26Hierarchical Decomposition
- Natural Hierarchical decomposition of the body
- The variables are independant at lower levels but
dependant at higher levels - Only one object at the top of the hierarchy
27Modelling an action
28Modelling Actions in a HHMM
29Learning
- Higher levels can be modelled easily using a
priori knowledge. - Lower levels are more complex
- Production States
- From verbally stated actions to point movements
- Observation error model
- Create a model for the errors made during data
acquisition - Inversions (left/right hand, head/hand, leg/hand,
) - Self-Occlusions
30Observation Error Model
- Self Occlusion
- Bayesian Networks can still infer with
missing-data - Naive missing data inference
- Inference with estimated data
31Application Specific Model
32MPEG-4 Facial Animation
- Fanard François-Xavier, UCL
- Louvain-la-Neuve, Belgium
33Facial Definition Parameters (FDPs)
- Set of 84 feature points placed on the face
- Divided in 10 subgroups
- Help to
- customize an animation according to personnal
characteristics - reproduce the ossatures topography of a
particular face on which a specific texture may
be applied (the skin, the eyes, the beard,...) - Send once each animation session
34Facial Animation Parameter Units (FAPUs)
- Independance between emotions/animations and
facial models - Fractions of distance between face key
characteristics (iris/iris distance,) - Computed on a neutral face
35Facial Animation Parameters (FAPs)
- Correspond to facial muscles actions
- Reproduce basic facial actions (expressions,
emotions articulation) - 68 FAPs divided in 2 levels
- High level (2)
- the visemes mouth movements during elocution
(predefined) - the expressions (neutral, anger, disgust, joy,
sadness, fear surprise)
Neutral
Joy
Surprise
Anger
36Facial Animation Parameter (FAP)
- Corresponds to facial muscles actions
- Reproduce basic facial actions (expressions,
emotions articulation) - 68 FAPs divided in 2 levels
- High level (2)
- Low level (66)
- raise_b_midlip, stretch_l_cornerlip,
raise_b_lip_lm,
Surprise
open_jaw raise_b_midlip stretch_l_cornerlip stretc
h_r_cornerlip raise_b_lip_lm raise_b_lip_rm close_
t_l_eyelid close_t_r_eyelid close_b_l_eyelid close
_b_r_eyelid
raise_l_i_eyebrow raise_r_i_eyebrow raise_l_m_eye
brow raise_r_m_eyebrow raise_l_o_eyebrow raise_r_o
_eyebrow squeeze_l_eyebrow squeeze_r_eyebrow stret
ch_l_cornerlip_o stretch_r_cornerlip_o
37Facial Animation Parameter (FAP)
- Corresponds to facial muscles actions
- Reproduce basic facial actions (expressions,
emotions articulation) - 68 FAPs divided in 2 levels
- High level (2)
- the visemes mouth movements during elocution
(21 predefined) - the expressions (neutral, anger, disgust, joy,
sadness, fear surprise) - Low level (66)
- raise_b_midlip, stretch_l_cornerlip,
raise_b_lip_lm, - Applicables on most of the FDPs
- Transmitted continuously during model animation
- Mesh deformations
38Requirements for facial animation
- The animation requires a large set of different
emotions - Noise introduction
- The animation needs to be real
- Handling of temporal relations
- The possibility of structuration of the FDPs
subgroups - Introduction of hierarchy (and/or)
- Creating a facial expression is a complex task
- Model learning
Hierarchical Hidden Markov Model (HHMM)
39HHMM for facial animation
EXPRESSION (t1)
EXPRESSION (t)
Level 4
Level 3
Level 2
Level 1
3.2
3.4
3.6
Level 0
Observations
40HHMM for facial animation
EXPRESSION (t1)
EXPRESSION (t)
Level 4
speech
Level 3
Level 2
Level 1
3.2
3.4
3.6
Level 0
Observations
41Towards Multimodal Emotion Recognition using
Bayesian Networks
- Olivier Martin, UCL
- Louvain-la-Neuve, Belgium
42Goal
- Recognize the users emotional state using a
combinaison of facial, vocal and gestual
information. - (Use the recognised emotional state to understand
users interactions in interactive applications)
43Outline
- Facial, Vocal Gestual modalities for emotion
recognition - Multimodal Fusion Fission
- Collaborations inside SIMILAR
44System Overview Facial Layer(Feature
Extraction by the team of Alice Caplier,
LIS-INPG, Grenoble, France)
Lips State
Eye state
Eyebrow state
Bayesian network
Bayesian network
Bayesian network
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
45System Overview Vocal Layer(Feature Extraction
by the team of Thierry Dutoit at Multitel, Mons,
Belgium)
Energy
µ
s²
Min Max
Speaking Rate
µ
s²
Min Max
Vocal Information
Bayesian Network
Pitch
µ
s²
Min Max
Examples
Angry ?
Noise
µ
s²
Min Max
Stressed ?
46System Overview Gestual Layer(Real-time
tracking of gestual features provided by
Alterface, Belgium)
Gestual Information
Bayesian Network
Examples
Position speed
47Multimodal Fusion A Second Stage Module
Lips States
N states N variables !
Left Eyebrow States
Right Eyebrow States
Emotionnal Belief Vector
Bayesian Network
Left Eye States
Right Eye States
Vocal Info
Gestual Info
48Examples of Multimodal Fission in Multimodal
Emotion Recognition
- Speech detection to decrease mouth influence when
the user is speaking and turn on the vocal
modality influence. - Hands tracking to detect occlusions, turn off
occluded feature influence (and find semantic
meaning of the occlusion).
49Multimodal fission an example
- P(Ck Ck-1, Ak ) where Ak F1,k ,
F2,k,G1,k, V1,k - Observations
- Fi mouth corners
- G1 a gestual feature
- V1 a vocal feature
User has his hand in front of right mouth corner
User is not speaking
User is speaking
Ck1
Ck-1
Ck
V1
F1
F1
F1
V1
V1
F2
G1
F2
F2
G1
G1
50Using 5 distances
51Learning
- Learning using Anthonys way to simulate disgust
- P(Di disgust) and P(Di neutral)
52Results
- 0 disgust recognition for Bert
- Bert is not a very good actor !and most of the
people claiming - doing emotion recognition are NOT doing it
53Results (2)
- 96 disgust recognition for Alex
- When asked to show disgust, Alex activates the
same muscles than Anthony - AND
- ? My system works
54Collaborations inside SIMILAR
- Thanks to
- LIS-INPG Grenoble, France
- Facial features extraction
- Multitel, Mons, Belgium
- Vocal features extraction
- Alterface, Louvain-la-Neuve, Belgium
- Gestual features extraction