Title: Dynamic Bayesian Networks for Multimodal Interaction
1Dynamic Bayesian Networks for Multimodal
Interaction
Tony Jebara Machine Learning Lab Columbia
University joint work with A. Howard and N.
Gu
2Outline
- Introduction Multi-Modal and Multi-Person
- Bayesian Networks and the Junction Tree Algorithm
- Maximum Likelihood and Expectation Maximization
- Dynamic Bayesian Networks (HMMs, Kalman Filters)
- Hidden ARMA Models
- Maximum Conditional Likelihood and Conditional EM
- Two-Person Visual Interaction (Gesture Games)
- Input-Output Hidden Markov Models
- Audio-Visual Interaction (Conversation)
- Intractable DBNs, Minimum Free Energy,
Generalized EM - Dynamical System Trees
- Multi-Person Visual Interaction (Football Plays)
- Haptic-Visual Modeling (Surgical Drills)
- Ongoing Directions
3Introduction
- Simplest Dynamical Systems (single Markovian
Process) - Hidden Markov Model and Kalman Filter
- But Multi-modal data (audio, video and haptics)
have - Different time scale processes
- Different amplitude scale processes
- Different noise characteristics processes
- Also, Multi-person data (multi-limb, two-person,
group) - Weakly coupled
- Conditionally Dependent
- Dangerous to
- slam all time
- data into one
- single series
- Find new ways to zipper multiple interacting
processes
4Bayesian Networks
- Also called Graphical Models
- Marry graph theory statistics
- Directed graph which efficiently
- encodes large p(x1,,xN) as
- product of conditionals
- of node given parents
- Avoids storing huge hypercube over all variables
x1,,xN - Here, xi discrete (multinomial) or continuous
(Gaussian) - Split BNs over sets of hidden XH and observed XV
variables - Three basic operations for BNs
- 1) Infer marginals/conditionals of hidden (JTA)
- 2) Compute likelihood of data (JTA)
- 3) Maximize likelihood the data (EM)
5Bayes Nets to Junction Trees
- Workhorse of BNs is Junction Tree Algorithm
1) Bayes Net
2) Moral Graph
3) Triangulated
4) Junction tree
6Junction Tree Algorithm
- The JTA sends messages from cliques through
- separators (these are just tables or potential
functions) - Ensures that various tables in the junction tree
graph - agree/consistent over shared variables (via
marginals).
If agree
Send message From V to W
Send message From W to V
Then, Cliques Agree
Else
7Junction Tree Algorithm
- On trees, JTA is guaranteed 1)Init 2)Collect
3)Distribute -
Ends with potentials as marginals or
conditionals of hidden variables given
data p(Xh1Xv) p(Xh2Xv) p(Xh1, Xh2Xv) And
likelihood p(Xv) is potential normalizer
8Maximum Likelihood with EM
- We wish to maximize the likelihood over q for
learning - EM instead iteratively maxes lower bound on
log-likelihood - E-step
- M-step
q(z)
L(q,q)
q
9Dynamic Bayes Nets
- Dynamic Bayesian Networks are BNs unrolled in
time - Simples and most classical examples are
Linear Dynamical System
Hidden Markov Model
State Transition Model
State Transition Model
Emission Model
Emission Model
10Two-Person Interaction
- Learn from two interacting people (person Y and
- person X) to mimic interaction via simulated
person Y. - One hidden Markov model for each userno
coupling! - One time series for both users too rigid!
Interact with single user via p(yx)
Learn from two users to get p(yx)
11DBN Hidden ARMA Model
Learn to imitate behavior by watching a
teacher exhibit it. Eg. unsupervised observation
of 2- agent interaction Eg. Track lip
motion Discover correlations between past action
subsequent reaction Estimate p(Y past X ,
past Y)
X
Y
12DBN Hidden ARMA Model
- Focus on predicting person Y from past of both X
and Y - Have multiple linear models of the past to the
future - Use a window for moving average (compressed with
PCA) - But, select among them using S (nonlinear)
- Here, we show only a 2nd order moving average
- to predict the next Y given
- past two Ys, past two Xs and current X
- and random choice of ARMA linear model
13Hidden ARMA Features
- Model skin color as mixture of RGB Gaussians
- Track person as mixture of spatial Gaussians
- But, want to predict only Y from X Be
discriminative - Use maximum conditional likelihood (CEM)
14Conditional EM
- Only need a conditional?
- Then maximize conditional likelihood
EM divide conquer
CEM discriminative divide conquer
15Conditional EM
CEM p(yx)
CEM vs. EM p(cx,y) CEM accuracy 100 EM
accuracy 51
EM p(yx)
16Conditional EM for hidden ARMA
Estimate Prediction Discriminatively/Conditionally
p(futurepast)
2 Users gesture to each other for a few
minutes Model Mix of 25 Gaussians, STM T120,
Dims2215
Nearest Neighbor 1.57 RMS Constant
Velocity 0.85 RMS Hidden ARMA 0.64 RMS
17Hidden ARMA on Gesture
SCARE
WAVE
CLAP
18DBN Input-Output HMM
- Similarly, learn persons response
- audio video stimuli to predict Y
- (or agent A) from X (or world W)
- Wearable collects audio video A,W
-Sony Picturebook Laptop -2 Cameras (7 Hz) (USB
Analog) -2 Microphones (USB Analog) -100 Megs
per hour (10/Gig)
19DBN Input-Output HMM
- Consider simulating agent given world
- Hidden Markov model on its own
- is insufficient since it does not
- distinguish between the input
- rule the world has and the output
- we need to generate
- Instead, form input-output HMM
- One IOHMM predicts agents audio
- using all 3 past channels
- One IOHMM predicts agents video
- Use CEM to learn the IOHMM discriminatively
20Input-Output HMM Data
Video -Histogram lighting correction -RGB
Mixture of Gaussians to detect skin -Face 2000
pixels at 7Hz (X,Y,Intensity) Audio -Hamming
Window, FFT, Equalization -Spectrograms at
60Hz -200 bands (Amplitude, Frequency)
Very noisy data set!
21Video Representation
- Principal Components Analysis - linear vectors
in Euclidean space - Images, spectrograms, time
series vectors. - Vectorization is bad,
nonlinear - Images collections of (X,Y,I)
tuples pixels - Spectrograms collections of
(A,F) tuples therefore... - Corresponded
Principal Components Analysis
X
M are soft permutation matrices
22Video Representation
Original PCA CPCA
2000 XYI Pixels Compress to 20 dims
23Input-Output HMM
Estimate hidden trellis from partial data
For agent and world 1 Loudness scalar 20 Spectro
Coeffs 20 Face Coeffs
24Input-Output HMM with CEM
Conditionally model p(Agent Audio World Audio
, World Video) p(Agent Video World Audio,
World Video) Dont care how well we can model
world audio and video Just as long as we can map
it to agent audio or agent video Avoids temporal
scale problems too (Video 5Hz, Audio 60 Hz)
Audio IOHMM
CEM 60-state 82-Dim HMM Diagonal Gaussian
Emissions 90,000 Samples Train / 36,000 Test
25Input-Output HMM with CEM
TRAINING TESTING
EM (red) CEM (blue) Audio 99.61
100.58 Video -122.46 -121.26
Joint Likelihood
Conditional Likelihood
RESYNTHESIS
Spectrograms from eigenspace KD-Tree on Video
Coefficients to closest image in
training (point-cloud too confusing)
26Input-Output HMM Results
Train
Test
27Intractable Dynamic Bayes Nets
Interaction Through Output
Interaction Through Hidden States
Factorial Hidden Markov Model
Coupled Hidden Markov Model
28Intractable DBNs Generalized EM
- As before, we use bound on likelihood
- But best q over hidden vars that minimizes KL
intractable! - Thus, restrict q to only explore factorized
distributions - EM still converges underpartial E steps partial
M steps,
q(z)
-L(q,q)
q
l(q)
q
29Intractable DBNs Variational EM
- Now, the q distributions are limited to be chains
- Tractable as an iterative method
- Also known as variational EM structured mean-field
Factorial Hidden Markov Model
Coupled Hidden Markov Model
30Dynamical System Trees
- How to handle more people and a hieararchy of
coupling? - DSTs consider coupling university staff
- students -gt department -gt school -gt university
Interaction Through Aggregated Community State
Internal nodes are states. Leaf nodes are
emissions. Any subtree is also a DST. DST above
unrolled over 2 time steps
31Dynamical System Trees
- Also apply generalization of EM and do
- variational structured mean field for q
distribution. - Becomes formulaic fo any DST topology!
- Code available at http//www.cs.columbia.edu/jeba
ra/dst
32DSTs and Generalized EM
Inference
Introduce v.p.
Inference
Introduce v.p.
Inference
Introduce v.p.
Inference
Structured Mean Field Use tractable distribution
Q to approximate P Introduce variational
parameters Find Min KL(QP)
33DSTs for American Football
Initial frame of a typical play
Trajectories of players
34DSTs for American Football
20 time series of two types of plays (wham and
digs) Likelihood ratio of models used as
classifer DST1 puts all players into 1 game
state DST2 combines players into two teams and
then into game
35DSTs for Gene Networks
- Time series of cell cycle
- Hundreds of gene
- expression levels over time
- Use given hierarchical
- clustering
- DST with hierarchical
- clustering structure
- gives best test
- likelihood
36Robotic Surgery, Haptics Video
- Davinci Laparoscopic Robot
- Used in hundreds of hospitals
- Surgeon works on console
- Robot mimics movement
- on (local) patient
- Captures all actuator/robot
- data as 300Hz time series
- Multi-Channel Video of
- cameras inside patient
37Robotic Surgery, Haptics Video
38Robotic Surgery, Haptics Video
64 Dimensional Time Series _at_ 300 Hz Console and
Actuator Parameters
Expert
Novice
Suturing
39Robotic Surgical Drills Results
- Compress Haptic Video data with PCA to 60 dims.
- Collected Data from Novices and Experts and
- built several DBNs (IOHMMs, DSTs, etc.) of
- expert and novice for 3 different drills (6
models total). - Preliminary results
- Minefield Russian Roulette Suture
40Conclusion
- Dynamic Bayesian networks are natural upgrade to
HMMs. - Relevant for structured, multi-modal and
multi-person temporal data. - Several exampls of dynamic Bayesian networks for
- audio, video and haptic channels
- single, two-person and multi-person activity.
- DBNs HMMs, Kalman Filters, hidden ARMA,
input-output HMMs. - Use max likelihood (EM) or max conditional
likelihood (CEM). - Intractable DBNs switched Kalman filters,
dynamical systems trees. - Use max free energy (GEM) and structured mean
field. - Examples of applications
- gesture interaction (gesture games)
- audio-video interaction (social conversation)
- multi-person game playing (American football)
- haptic-video interaction (robotic laparoscopy).
- Funding provided in part by the National Science
Foundation, the - Central Intelligence Agency, Alphastar and
Microsoft.