Dynamic Bayesian Networks for Multimodal Interaction - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Dynamic Bayesian Networks for Multimodal Interaction

Description:

audio, video and haptic channels. single, two-person and multi-person activity. ... haptic-video interaction (robotic laparoscopy) ... – PowerPoint PPT presentation

Number of Views:97

Avg rating:3.0/5.0

Slides: 41

Provided by: jeb83

Category:

more less

Transcript and Presenter's Notes

Title: Dynamic Bayesian Networks for Multimodal Interaction

1
Dynamic Bayesian Networks for Multimodal
Interaction
Tony Jebara Machine Learning Lab Columbia
University joint work with A. Howard and N.
Gu
2
Outline

Introduction Multi-Modal and Multi-Person
Bayesian Networks and the Junction Tree Algorithm
Maximum Likelihood and Expectation Maximization
Dynamic Bayesian Networks (HMMs, Kalman Filters)
Hidden ARMA Models
Maximum Conditional Likelihood and Conditional EM
Two-Person Visual Interaction (Gesture Games)
Input-Output Hidden Markov Models
Audio-Visual Interaction (Conversation)
Intractable DBNs, Minimum Free Energy,
Generalized EM
Dynamical System Trees
Multi-Person Visual Interaction (Football Plays)
Haptic-Visual Modeling (Surgical Drills)
Ongoing Directions

3
Introduction

Simplest Dynamical Systems (single Markovian
Process)
Hidden Markov Model and Kalman Filter
But Multi-modal data (audio, video and haptics)
have
Different time scale processes
Different amplitude scale processes
Different noise characteristics processes
Also, Multi-person data (multi-limb, two-person,
group)
Weakly coupled
Conditionally Dependent
Dangerous to
slam all time
data into one
single series
Find new ways to zipper multiple interacting
processes

4
Bayesian Networks

Also called Graphical Models
Marry graph theory statistics
Directed graph which efficiently
encodes large p(x1,,xN) as
product of conditionals
of node given parents
Avoids storing huge hypercube over all variables
x1,,xN
Here, xi discrete (multinomial) or continuous
(Gaussian)
Split BNs over sets of hidden XH and observed XV
variables
Three basic operations for BNs
1) Infer marginals/conditionals of hidden (JTA)
2) Compute likelihood of data (JTA)
3) Maximize likelihood the data (EM)

5
Bayes Nets to Junction Trees

Workhorse of BNs is Junction Tree Algorithm

1) Bayes Net
2) Moral Graph
3) Triangulated
4) Junction tree
6
Junction Tree Algorithm

The JTA sends messages from cliques through
separators (these are just tables or potential
functions)
Ensures that various tables in the junction tree
graph
agree/consistent over shared variables (via
marginals).

If agree
Send message From V to W
Send message From W to V
Then, Cliques Agree
Else
7
Junction Tree Algorithm

On trees, JTA is guaranteed 1)Init 2)Collect
3)Distribute

Ends with potentials as marginals or
conditionals of hidden variables given
data p(Xh1Xv) p(Xh2Xv) p(Xh1, Xh2Xv) And
likelihood p(Xv) is potential normalizer
8
Maximum Likelihood with EM

We wish to maximize the likelihood over q for
learning
EM instead iteratively maxes lower bound on
log-likelihood
E-step
M-step

q(z)
L(q,q)
q
9
Dynamic Bayes Nets

Dynamic Bayesian Networks are BNs unrolled in
time
Simples and most classical examples are

Linear Dynamical System
Hidden Markov Model
State Transition Model
State Transition Model
Emission Model
Emission Model
10
Two-Person Interaction

Learn from two interacting people (person Y and
person X) to mimic interaction via simulated
person Y.
One hidden Markov model for each userno
coupling!
One time series for both users too rigid!

Interact with single user via p(yx)
Learn from two users to get p(yx)
11
DBN Hidden ARMA Model
Learn to imitate behavior by watching a
teacher exhibit it. Eg. unsupervised observation
of 2- agent interaction Eg. Track lip
motion Discover correlations between past action
subsequent reaction Estimate p(Y past X ,
past Y)
X
Y
12
DBN Hidden ARMA Model

Focus on predicting person Y from past of both X
and Y
Have multiple linear models of the past to the
future
Use a window for moving average (compressed with
PCA)
But, select among them using S (nonlinear)
Here, we show only a 2nd order moving average
to predict the next Y given
past two Ys, past two Xs and current X
and random choice of ARMA linear model

13
Hidden ARMA Features

Model skin color as mixture of RGB Gaussians
Track person as mixture of spatial Gaussians
But, want to predict only Y from X Be
discriminative
Use maximum conditional likelihood (CEM)

14
Conditional EM

Only need a conditional?
Then maximize conditional likelihood

EM divide conquer
CEM discriminative divide conquer
15
Conditional EM
CEM p(yx)
CEM vs. EM p(cx,y) CEM accuracy 100 EM
accuracy 51
EM p(yx)
16
Conditional EM for hidden ARMA
Estimate Prediction Discriminatively/Conditionally
p(futurepast)
2 Users gesture to each other for a few
minutes Model Mix of 25 Gaussians, STM T120,
Dims2215
Nearest Neighbor 1.57 RMS Constant
Velocity 0.85 RMS Hidden ARMA 0.64 RMS
17
Hidden ARMA on Gesture
SCARE
WAVE
CLAP
18
DBN Input-Output HMM

Similarly, learn persons response
audio video stimuli to predict Y
(or agent A) from X (or world W)
Wearable collects audio video A,W

-Sony Picturebook Laptop -2 Cameras (7 Hz) (USB
Analog) -2 Microphones (USB Analog) -100 Megs
per hour (10/Gig)
19
DBN Input-Output HMM

Consider simulating agent given world
Hidden Markov model on its own
is insufficient since it does not
distinguish between the input
rule the world has and the output
we need to generate
Instead, form input-output HMM
One IOHMM predicts agents audio
using all 3 past channels
One IOHMM predicts agents video
Use CEM to learn the IOHMM discriminatively

20
Input-Output HMM Data
Video -Histogram lighting correction -RGB
Mixture of Gaussians to detect skin -Face 2000
pixels at 7Hz (X,Y,Intensity) Audio -Hamming
Window, FFT, Equalization -Spectrograms at
60Hz -200 bands (Amplitude, Frequency)
Very noisy data set!
21
Video Representation
- Principal Components Analysis - linear vectors
in Euclidean space - Images, spectrograms, time
series vectors. - Vectorization is bad,
nonlinear - Images collections of (X,Y,I)
tuples pixels - Spectrograms collections of
(A,F) tuples therefore... - Corresponded
Principal Components Analysis

X
M are soft permutation matrices
22
Video Representation
Original PCA CPCA
2000 XYI Pixels Compress to 20 dims
23
Input-Output HMM
Estimate hidden trellis from partial data
For agent and world 1 Loudness scalar 20 Spectro
Coeffs 20 Face Coeffs
24
Input-Output HMM with CEM
Conditionally model p(Agent Audio World Audio
, World Video) p(Agent Video World Audio,
World Video) Dont care how well we can model
world audio and video Just as long as we can map
it to agent audio or agent video Avoids temporal
scale problems too (Video 5Hz, Audio 60 Hz)
Audio IOHMM
CEM 60-state 82-Dim HMM Diagonal Gaussian
Emissions 90,000 Samples Train / 36,000 Test
25
Input-Output HMM with CEM
TRAINING TESTING
EM (red) CEM (blue) Audio 99.61
100.58 Video -122.46 -121.26
Joint Likelihood
Conditional Likelihood
RESYNTHESIS
Spectrograms from eigenspace KD-Tree on Video
Coefficients to closest image in
training (point-cloud too confusing)
26
Input-Output HMM Results
Train
Test
27
Intractable Dynamic Bayes Nets
Interaction Through Output
Interaction Through Hidden States
Factorial Hidden Markov Model
Coupled Hidden Markov Model
28
Intractable DBNs Generalized EM

As before, we use bound on likelihood
But best q over hidden vars that minimizes KL
intractable!
Thus, restrict q to only explore factorized
distributions
EM still converges underpartial E steps partial
M steps,

q(z)
-L(q,q)
q
l(q)
q
29
Intractable DBNs Variational EM

Now, the q distributions are limited to be chains
Tractable as an iterative method
Also known as variational EM structured mean-field

Factorial Hidden Markov Model
Coupled Hidden Markov Model
30
Dynamical System Trees

How to handle more people and a hieararchy of
coupling?
DSTs consider coupling university staff
students -gt department -gt school -gt university

Interaction Through Aggregated Community State
Internal nodes are states. Leaf nodes are
emissions. Any subtree is also a DST. DST above
unrolled over 2 time steps
31
Dynamical System Trees

Also apply generalization of EM and do
variational structured mean field for q
distribution.
Becomes formulaic fo any DST topology!
Code available at http//www.cs.columbia.edu/jeba
ra/dst

32
DSTs and Generalized EM
Inference
Introduce v.p.
Inference
Introduce v.p.
Inference
Introduce v.p.
Inference
Structured Mean Field Use tractable distribution
Q to approximate P Introduce variational
parameters Find Min KL(QP)
33
DSTs for American Football
Initial frame of a typical play
Trajectories of players
34
DSTs for American Football
20 time series of two types of plays (wham and
digs) Likelihood ratio of models used as
classifer DST1 puts all players into 1 game
state DST2 combines players into two teams and
then into game
35
DSTs for Gene Networks

Time series of cell cycle
Hundreds of gene
expression levels over time
Use given hierarchical
clustering
DST with hierarchical
clustering structure
gives best test
likelihood

36
Robotic Surgery, Haptics Video

Davinci Laparoscopic Robot
Used in hundreds of hospitals
Surgeon works on console
Robot mimics movement
on (local) patient
Captures all actuator/robot
data as 300Hz time series
Multi-Channel Video of
cameras inside patient

37
Robotic Surgery, Haptics Video
38
Robotic Surgery, Haptics Video
64 Dimensional Time Series _at_ 300 Hz Console and
Actuator Parameters
Expert
Novice
Suturing
39
Robotic Surgical Drills Results