AudioVisual Speaker Detection Using Boosted Bayesian Networks

About This Presentation

Title:

AudioVisual Speaker Detection Using Boosted Bayesian Networks

Description:

Joint work with Vladimir Pavlovic, Ashutosh Garg, Tanzeem ... Tailor Content. Kiosk in user's world. Context must be inferred. Communicate focus of attention ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 26

Provided by: ccGa

Category:

more less

Transcript and Presenter's Notes

Title: AudioVisual Speaker Detection Using Boosted Bayesian Networks

1
Audio-Visual Speaker Detection Using Boosted
Bayesian Networks

Jim Rehg
College of Computing
Georgia Institute of Technology
Joint work with Vladimir Pavlovic, Ashutosh Garg,
Tanzeem Choudhury, Kevin Murphy

2
Smart Kiosk (DEC/Compaq)

Active
Public
Multi-user

Initiate Interaction Tailor Content Kiosk in
users world Context must be inferred Communicate
focus of attention Arbitrate requests
3
Lessons From Cybersmith

Content-authoring and remote maintenance were
critical to successful deployment.
Content is king, also costly and time-consuming.
Interactive entertainment was far more popular
than information (possibly limited by site)
Most appealing to teenagers and older people.
Many people want to talk to the kiosk.

4
Speaker Detection for Kiosk
Rehg, Murphy, Fieguth CVPR 99

Basic question for kiosk Is anyone speaking to
me?
Speakers face the kiosk, move their lips and
generate speech
Visual features frontal faces, mouth motion
energy
Audio features acoustic energy
Bayesian Network models
Encode coupling between noisy sensors
Represent context from kiosk interface

5
Frontal Face Subnet
Frontal
Visible
Hidden states
Skin
Neural Net
Sensor outputs
Color-based Skin Detector
CMU Face Detector
6
Speech Production Subnet
Mouth Motion Sensor
7
Speaker Detection Network
Speaker Near kiosk Facing display
Talking
Speaking
Speech
Frontal
Visible
Kiosk I/O
Sound
Skin
Texture
Neural Net
Mouth
Mouth Motion Energy
Color-based Skin Detector
Texture-based Face Detector
CMU Face Detector
Audio Energy
CPTs learned from fully-labeled data
8
Speaker Detection Results
Ground truth
Estimate
Experimental Setup
subject
secondary
kiosk
9
Observations

Sporadic errors should be improved through the
use of a dynamic model.
Specification of hidden states is inherently
ambiguous
Hard to evaluate subnet performance
Experiment degenerated into spontaneous 3-way
conversation
Hard to determine true speaker state
Turn-taking provides powerful context.

10
Dynamic Bayesian Networks
Garg, Pavlovic Rehg, FG 00

Add temporal dependencies between hidden states
over time.
Within each time-slice the static Bayes net is
replicated.

Speaking
Speaking
Frontal
Frontal
Speech
Speech
Time t-1
Time t
11
Learning DBN Models

This is a discrete HMM with a factored output
density.
Since the network is fully-labeled, exact
learning is feasible.

12
Experimental Results
Ground Truth
Dynamic BN
DD DBN
Static BN
13
Duration Density DBN

Added duration model to temporal arcs
Learned distribution deviates from exponential
Additional complexity of inference is a barrier

14
Experimental Results
15
Parameter Learning

Review steps in parameter learning for factored
HMM model.
What is going wrong here?

16
Bayesian Networks for Classification

Advantages
Ease of representing knowledge and constraints
from task domain.
Efficient algorithms for inference and learning
Generality
Disadvantages
Unsupervised learning may yield suboptimal
classifier performance

17
Issues with Bayes Net Classifiers

Show example of naïve bayes vs. feature structure
(structure matters).
Factorization of density for parameter estimation
Standard learner does not maximize the
conditional likelihood

18
Boosted Parameter Learning

Apply AdaBoost to Bayes net parameter learning
Improved classifier performance
Constant multiple of training cost.

19
Genie Casino Experiment

Single player blackjack game vs multiple agents.
Game context governs speech with dealer.
Record frequency-encoded game state
simultaneously with audiovisual input.

20
Speaker Detection Results
Garg, Pavlovic, Rehg, Huang CVPR 00
21
Comparison
22
Overall Results
23
Boosting Analysis

Posthoc analysis of boosting as adjusting counts
in histograms associated with measurement CPTs
and transition matrix.

24
Structure Learning

Describe basic problem
K2 algorithm
Closed form expression for marginalizing over
probability of parameters
Node ordering and limited parents for greedy
heuristic.
Draw graphs showing effect of structure learning
and MCMC version on static BN examples.

25
Boosted Structure Learning
Joint with Tanzeem Choudhury
Structure learner K2 MCMC search over node
ordering (Koller and Friedman 00)
26
Experimental Results
27
Conclusions and Future Work

Dynamic Bayesian network models are a powerful
framework for cue fusion.
Boosting can be used to construct a supervised
DBN learner (both parameters and structure).
Future work
Use a richer set of audio-visual cues
Explore feature selection given labeled state
data
Expand context model
Integrate with speech recognizer