Title: AudioVisual Speaker Detection Using Boosted Bayesian Networks
1Audio-Visual Speaker Detection Using Boosted
Bayesian Networks
- Jim Rehg
- College of Computing
- Georgia Institute of Technology
- Joint work with Vladimir Pavlovic, Ashutosh Garg,
Tanzeem Choudhury, Kevin Murphy
2Smart Kiosk (DEC/Compaq)
Initiate Interaction Tailor Content Kiosk in
users world Context must be inferred Communicate
focus of attention Arbitrate requests
3Lessons From Cybersmith
- Content-authoring and remote maintenance were
critical to successful deployment. - Content is king, also costly and time-consuming.
- Interactive entertainment was far more popular
than information (possibly limited by site) - Most appealing to teenagers and older people.
- Many people want to talk to the kiosk.
4Speaker Detection for Kiosk
Rehg, Murphy, Fieguth CVPR 99
- Basic question for kiosk Is anyone speaking to
me? - Speakers face the kiosk, move their lips and
generate speech - Visual features frontal faces, mouth motion
energy - Audio features acoustic energy
- Bayesian Network models
- Encode coupling between noisy sensors
- Represent context from kiosk interface
5Frontal Face Subnet
Frontal
Visible
Hidden states
Skin
Neural Net
Sensor outputs
Color-based Skin Detector
CMU Face Detector
6Speech Production Subnet
Mouth Motion Sensor
7Speaker Detection Network
Speaker Near kiosk Facing display
Talking
Speaking
Speech
Frontal
Visible
Kiosk I/O
Sound
Skin
Texture
Neural Net
Mouth
Mouth Motion Energy
Color-based Skin Detector
Texture-based Face Detector
CMU Face Detector
Audio Energy
CPTs learned from fully-labeled data
8Speaker Detection Results
Ground truth
Estimate
Experimental Setup
subject
secondary
kiosk
9Observations
- Sporadic errors should be improved through the
use of a dynamic model. - Specification of hidden states is inherently
ambiguous - Hard to evaluate subnet performance
- Experiment degenerated into spontaneous 3-way
conversation - Hard to determine true speaker state
- Turn-taking provides powerful context.
10Dynamic Bayesian Networks
Garg, Pavlovic Rehg, FG 00
- Add temporal dependencies between hidden states
over time. - Within each time-slice the static Bayes net is
replicated.
Speaking
Speaking
Frontal
Frontal
Speech
Speech
Time t-1
Time t
11Learning DBN Models
- This is a discrete HMM with a factored output
density. - Since the network is fully-labeled, exact
learning is feasible.
12Experimental Results
Ground Truth
Dynamic BN
DD DBN
Static BN
13Duration Density DBN
- Added duration model to temporal arcs
- Learned distribution deviates from exponential
- Additional complexity of inference is a barrier
14Experimental Results
15Parameter Learning
- Review steps in parameter learning for factored
HMM model. - What is going wrong here?
16Bayesian Networks for Classification
- Advantages
- Ease of representing knowledge and constraints
from task domain. - Efficient algorithms for inference and learning
- Generality
- Disadvantages
- Unsupervised learning may yield suboptimal
classifier performance
17Issues with Bayes Net Classifiers
- Show example of naïve bayes vs. feature structure
(structure matters). - Factorization of density for parameter estimation
- Standard learner does not maximize the
conditional likelihood
18Boosted Parameter Learning
- Apply AdaBoost to Bayes net parameter learning
- Improved classifier performance
- Constant multiple of training cost.
19Genie Casino Experiment
- Single player blackjack game vs multiple agents.
- Game context governs speech with dealer.
- Record frequency-encoded game state
simultaneously with audiovisual input.
20Speaker Detection Results
Garg, Pavlovic, Rehg, Huang CVPR 00
21Comparison
22Overall Results
23Boosting Analysis
- Posthoc analysis of boosting as adjusting counts
in histograms associated with measurement CPTs
and transition matrix.
24Structure Learning
- Describe basic problem
- K2 algorithm
- Closed form expression for marginalizing over
probability of parameters - Node ordering and limited parents for greedy
heuristic. - Draw graphs showing effect of structure learning
and MCMC version on static BN examples.
25Boosted Structure Learning
Joint with Tanzeem Choudhury
Structure learner K2 MCMC search over node
ordering (Koller and Friedman 00)
26Experimental Results
27Conclusions and Future Work
- Dynamic Bayesian network models are a powerful
framework for cue fusion. - Boosting can be used to construct a supervised
DBN learner (both parameters and structure). - Future work
- Use a richer set of audio-visual cues
- Explore feature selection given labeled state
data - Expand context model
- Integrate with speech recognizer