Dynamic Bayesian Networks for Multimodal Interaction - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Dynamic Bayesian Networks for Multimodal Interaction

Description:

audio, video and haptic channels. single, two-person and multi-person activity. ... haptic-video interaction (robotic laparoscopy) ... – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 41
Provided by: jeb83
Category:

less

Transcript and Presenter's Notes

Title: Dynamic Bayesian Networks for Multimodal Interaction


1
Dynamic Bayesian Networks for Multimodal
Interaction
Tony Jebara Machine Learning Lab Columbia
University joint work with A. Howard and N.
Gu
2
Outline
  • Introduction Multi-Modal and Multi-Person
  • Bayesian Networks and the Junction Tree Algorithm
  • Maximum Likelihood and Expectation Maximization
  • Dynamic Bayesian Networks (HMMs, Kalman Filters)
  • Hidden ARMA Models
  • Maximum Conditional Likelihood and Conditional EM
  • Two-Person Visual Interaction (Gesture Games)
  • Input-Output Hidden Markov Models
  • Audio-Visual Interaction (Conversation)
  • Intractable DBNs, Minimum Free Energy,
    Generalized EM
  • Dynamical System Trees
  • Multi-Person Visual Interaction (Football Plays)
  • Haptic-Visual Modeling (Surgical Drills)
  • Ongoing Directions

3
Introduction
  • Simplest Dynamical Systems (single Markovian
    Process)
  • Hidden Markov Model and Kalman Filter
  • But Multi-modal data (audio, video and haptics)
    have
  • Different time scale processes
  • Different amplitude scale processes
  • Different noise characteristics processes
  • Also, Multi-person data (multi-limb, two-person,
    group)
  • Weakly coupled
  • Conditionally Dependent
  • Dangerous to
  • slam all time
  • data into one
  • single series
  • Find new ways to zipper multiple interacting
    processes

4
Bayesian Networks
  • Also called Graphical Models
  • Marry graph theory statistics
  • Directed graph which efficiently
  • encodes large p(x1,,xN) as
  • product of conditionals
  • of node given parents
  • Avoids storing huge hypercube over all variables
    x1,,xN
  • Here, xi discrete (multinomial) or continuous
    (Gaussian)
  • Split BNs over sets of hidden XH and observed XV
    variables
  • Three basic operations for BNs
  • 1) Infer marginals/conditionals of hidden (JTA)
  • 2) Compute likelihood of data (JTA)
  • 3) Maximize likelihood the data (EM)

5
Bayes Nets to Junction Trees
  • Workhorse of BNs is Junction Tree Algorithm

1) Bayes Net
2) Moral Graph
3) Triangulated
4) Junction tree
6
Junction Tree Algorithm
  • The JTA sends messages from cliques through
  • separators (these are just tables or potential
    functions)
  • Ensures that various tables in the junction tree
    graph
  • agree/consistent over shared variables (via
    marginals).

If agree
Send message From V to W
Send message From W to V
Then, Cliques Agree
Else
7
Junction Tree Algorithm
  • On trees, JTA is guaranteed 1)Init 2)Collect
    3)Distribute

Ends with potentials as marginals or
conditionals of hidden variables given
data p(Xh1Xv) p(Xh2Xv) p(Xh1, Xh2Xv) And
likelihood p(Xv) is potential normalizer
8
Maximum Likelihood with EM
  • We wish to maximize the likelihood over q for
    learning
  • EM instead iteratively maxes lower bound on
    log-likelihood
  • E-step
  • M-step

q(z)
L(q,q)
q
9
Dynamic Bayes Nets
  • Dynamic Bayesian Networks are BNs unrolled in
    time
  • Simples and most classical examples are

Linear Dynamical System
Hidden Markov Model
State Transition Model
State Transition Model
Emission Model
Emission Model
10
Two-Person Interaction
  • Learn from two interacting people (person Y and
  • person X) to mimic interaction via simulated
    person Y.
  • One hidden Markov model for each userno
    coupling!
  • One time series for both users too rigid!

Interact with single user via p(yx)
Learn from two users to get p(yx)
11
DBN Hidden ARMA Model
Learn to imitate behavior by watching a
teacher exhibit it. Eg. unsupervised observation
of 2- agent interaction Eg. Track lip
motion Discover correlations between past action
subsequent reaction Estimate p(Y past X ,
past Y)
X
Y
12
DBN Hidden ARMA Model
  • Focus on predicting person Y from past of both X
    and Y
  • Have multiple linear models of the past to the
    future
  • Use a window for moving average (compressed with
    PCA)
  • But, select among them using S (nonlinear)
  • Here, we show only a 2nd order moving average
  • to predict the next Y given
  • past two Ys, past two Xs and current X
  • and random choice of ARMA linear model

13
Hidden ARMA Features
  • Model skin color as mixture of RGB Gaussians
  • Track person as mixture of spatial Gaussians
  • But, want to predict only Y from X Be
    discriminative
  • Use maximum conditional likelihood (CEM)

14
Conditional EM
  • Only need a conditional?
  • Then maximize conditional likelihood

EM divide conquer
CEM discriminative divide conquer
15
Conditional EM
CEM p(yx)
CEM vs. EM p(cx,y) CEM accuracy 100 EM
accuracy 51
EM p(yx)
16
Conditional EM for hidden ARMA
Estimate Prediction Discriminatively/Conditionally
p(futurepast)
2 Users gesture to each other for a few
minutes Model Mix of 25 Gaussians, STM T120,
Dims2215
Nearest Neighbor 1.57 RMS Constant
Velocity 0.85 RMS Hidden ARMA 0.64 RMS
17
Hidden ARMA on Gesture
SCARE
WAVE
CLAP
18
DBN Input-Output HMM
  • Similarly, learn persons response
  • audio video stimuli to predict Y
  • (or agent A) from X (or world W)
  • Wearable collects audio video A,W

-Sony Picturebook Laptop -2 Cameras (7 Hz) (USB
Analog) -2 Microphones (USB Analog) -100 Megs
per hour (10/Gig)
19
DBN Input-Output HMM
  • Consider simulating agent given world
  • Hidden Markov model on its own
  • is insufficient since it does not
  • distinguish between the input
  • rule the world has and the output
  • we need to generate
  • Instead, form input-output HMM
  • One IOHMM predicts agents audio
  • using all 3 past channels
  • One IOHMM predicts agents video
  • Use CEM to learn the IOHMM discriminatively

20
Input-Output HMM Data
Video -Histogram lighting correction -RGB
Mixture of Gaussians to detect skin -Face 2000
pixels at 7Hz (X,Y,Intensity) Audio -Hamming
Window, FFT, Equalization -Spectrograms at
60Hz -200 bands (Amplitude, Frequency)
Very noisy data set!
21
Video Representation
- Principal Components Analysis - linear vectors
in Euclidean space - Images, spectrograms, time
series vectors. - Vectorization is bad,
nonlinear - Images collections of (X,Y,I)
tuples pixels - Spectrograms collections of
(A,F) tuples therefore... - Corresponded
Principal Components Analysis

X
M are soft permutation matrices
22
Video Representation
Original PCA CPCA
2000 XYI Pixels Compress to 20 dims
23
Input-Output HMM
Estimate hidden trellis from partial data
For agent and world 1 Loudness scalar 20 Spectro
Coeffs 20 Face Coeffs
24
Input-Output HMM with CEM
Conditionally model p(Agent Audio World Audio
, World Video) p(Agent Video World Audio,
World Video) Dont care how well we can model
world audio and video Just as long as we can map
it to agent audio or agent video Avoids temporal
scale problems too (Video 5Hz, Audio 60 Hz)
Audio IOHMM
CEM 60-state 82-Dim HMM Diagonal Gaussian
Emissions 90,000 Samples Train / 36,000 Test
25
Input-Output HMM with CEM
TRAINING TESTING
EM (red) CEM (blue) Audio 99.61
100.58 Video -122.46 -121.26
Joint Likelihood
Conditional Likelihood
RESYNTHESIS
Spectrograms from eigenspace KD-Tree on Video
Coefficients to closest image in
training (point-cloud too confusing)
26
Input-Output HMM Results
Train
Test
27
Intractable Dynamic Bayes Nets
Interaction Through Output
Interaction Through Hidden States
Factorial Hidden Markov Model
Coupled Hidden Markov Model
28
Intractable DBNs Generalized EM
  • As before, we use bound on likelihood
  • But best q over hidden vars that minimizes KL
    intractable!
  • Thus, restrict q to only explore factorized
    distributions
  • EM still converges underpartial E steps partial
    M steps,

q(z)
-L(q,q)
q
l(q)
q
29
Intractable DBNs Variational EM
  • Now, the q distributions are limited to be chains
  • Tractable as an iterative method
  • Also known as variational EM structured mean-field

Factorial Hidden Markov Model
Coupled Hidden Markov Model
30
Dynamical System Trees
  • How to handle more people and a hieararchy of
    coupling?
  • DSTs consider coupling university staff
  • students -gt department -gt school -gt university

Interaction Through Aggregated Community State
Internal nodes are states. Leaf nodes are
emissions. Any subtree is also a DST. DST above
unrolled over 2 time steps
31
Dynamical System Trees
  • Also apply generalization of EM and do
  • variational structured mean field for q
    distribution.
  • Becomes formulaic fo any DST topology!
  • Code available at http//www.cs.columbia.edu/jeba
    ra/dst

32
DSTs and Generalized EM
Inference
Introduce v.p.
Inference
Introduce v.p.
Inference
Introduce v.p.
Inference
Structured Mean Field Use tractable distribution
Q to approximate P Introduce variational
parameters Find Min KL(QP)
33
DSTs for American Football
Initial frame of a typical play
Trajectories of players
34
DSTs for American Football
20 time series of two types of plays (wham and
digs) Likelihood ratio of models used as
classifer DST1 puts all players into 1 game
state DST2 combines players into two teams and
then into game
35
DSTs for Gene Networks
  • Time series of cell cycle
  • Hundreds of gene
  • expression levels over time
  • Use given hierarchical
  • clustering
  • DST with hierarchical
  • clustering structure
  • gives best test
  • likelihood

36
Robotic Surgery, Haptics Video
  • Davinci Laparoscopic Robot
  • Used in hundreds of hospitals
  • Surgeon works on console
  • Robot mimics movement
  • on (local) patient
  • Captures all actuator/robot
  • data as 300Hz time series
  • Multi-Channel Video of
  • cameras inside patient

37
Robotic Surgery, Haptics Video
38
Robotic Surgery, Haptics Video
64 Dimensional Time Series _at_ 300 Hz Console and
Actuator Parameters
Expert
Novice
Suturing
39
Robotic Surgical Drills Results
  • Compress Haptic Video data with PCA to 60 dims.
  • Collected Data from Novices and Experts and
  • built several DBNs (IOHMMs, DSTs, etc.) of
  • expert and novice for 3 different drills (6
    models total).
  • Preliminary results
  • Minefield Russian Roulette Suture

40
Conclusion
  • Dynamic Bayesian networks are natural upgrade to
    HMMs.
  • Relevant for structured, multi-modal and
    multi-person temporal data.
  • Several exampls of dynamic Bayesian networks for
  • audio, video and haptic channels
  • single, two-person and multi-person activity.
  • DBNs HMMs, Kalman Filters, hidden ARMA,
    input-output HMMs.
  • Use max likelihood (EM) or max conditional
    likelihood (CEM).
  • Intractable DBNs switched Kalman filters,
    dynamical systems trees.
  • Use max free energy (GEM) and structured mean
    field.
  • Examples of applications
  • gesture interaction (gesture games)
  • audio-video interaction (social conversation)
  • multi-person game playing (American football)
  • haptic-video interaction (robotic laparoscopy).
  • Funding provided in part by the National Science
    Foundation, the
  • Central Intelligence Agency, Alphastar and
    Microsoft.
Write a Comment
User Comments (0)
About PowerShow.com