AudioVisual Speaker Detection Using Boosted Bayesian Networks - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

AudioVisual Speaker Detection Using Boosted Bayesian Networks

Description:

Joint work with Vladimir Pavlovic, Ashutosh Garg, Tanzeem ... Tailor Content. Kiosk in user's world. Context must be inferred. Communicate focus of attention ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 26
Provided by: ccGa
Category:

less

Transcript and Presenter's Notes

Title: AudioVisual Speaker Detection Using Boosted Bayesian Networks


1
Audio-Visual Speaker Detection Using Boosted
Bayesian Networks
  • Jim Rehg
  • College of Computing
  • Georgia Institute of Technology
  • Joint work with Vladimir Pavlovic, Ashutosh Garg,
    Tanzeem Choudhury, Kevin Murphy

2
Smart Kiosk (DEC/Compaq)
  • Active
  • Public
  • Multi-user

Initiate Interaction Tailor Content Kiosk in
users world Context must be inferred Communicate
focus of attention Arbitrate requests
3
Lessons From Cybersmith
  • Content-authoring and remote maintenance were
    critical to successful deployment.
  • Content is king, also costly and time-consuming.
  • Interactive entertainment was far more popular
    than information (possibly limited by site)
  • Most appealing to teenagers and older people.
  • Many people want to talk to the kiosk.

4
Speaker Detection for Kiosk
Rehg, Murphy, Fieguth CVPR 99
  • Basic question for kiosk Is anyone speaking to
    me?
  • Speakers face the kiosk, move their lips and
    generate speech
  • Visual features frontal faces, mouth motion
    energy
  • Audio features acoustic energy
  • Bayesian Network models
  • Encode coupling between noisy sensors
  • Represent context from kiosk interface

5
Frontal Face Subnet
Frontal
Visible
Hidden states
Skin
Neural Net
Sensor outputs
Color-based Skin Detector
CMU Face Detector
6
Speech Production Subnet
Mouth Motion Sensor
7
Speaker Detection Network
Speaker Near kiosk Facing display
Talking
Speaking
Speech
Frontal
Visible
Kiosk I/O
Sound
Skin
Texture
Neural Net
Mouth
Mouth Motion Energy
Color-based Skin Detector
Texture-based Face Detector
CMU Face Detector
Audio Energy
CPTs learned from fully-labeled data
8
Speaker Detection Results
Ground truth
Estimate
Experimental Setup
subject
secondary
kiosk
9
Observations
  • Sporadic errors should be improved through the
    use of a dynamic model.
  • Specification of hidden states is inherently
    ambiguous
  • Hard to evaluate subnet performance
  • Experiment degenerated into spontaneous 3-way
    conversation
  • Hard to determine true speaker state
  • Turn-taking provides powerful context.

10
Dynamic Bayesian Networks
Garg, Pavlovic Rehg, FG 00
  • Add temporal dependencies between hidden states
    over time.
  • Within each time-slice the static Bayes net is
    replicated.

Speaking
Speaking
Frontal
Frontal
Speech
Speech
Time t-1
Time t
11
Learning DBN Models
  • This is a discrete HMM with a factored output
    density.
  • Since the network is fully-labeled, exact
    learning is feasible.

12
Experimental Results
Ground Truth
Dynamic BN
DD DBN
Static BN
13
Duration Density DBN
  • Added duration model to temporal arcs
  • Learned distribution deviates from exponential
  • Additional complexity of inference is a barrier

14
Experimental Results
15
Parameter Learning
  • Review steps in parameter learning for factored
    HMM model.
  • What is going wrong here?

16
Bayesian Networks for Classification
  • Advantages
  • Ease of representing knowledge and constraints
    from task domain.
  • Efficient algorithms for inference and learning
  • Generality
  • Disadvantages
  • Unsupervised learning may yield suboptimal
    classifier performance

17
Issues with Bayes Net Classifiers
  • Show example of naïve bayes vs. feature structure
    (structure matters).
  • Factorization of density for parameter estimation
  • Standard learner does not maximize the
    conditional likelihood

18
Boosted Parameter Learning
  • Apply AdaBoost to Bayes net parameter learning
  • Improved classifier performance
  • Constant multiple of training cost.

19
Genie Casino Experiment
  • Single player blackjack game vs multiple agents.
  • Game context governs speech with dealer.
  • Record frequency-encoded game state
    simultaneously with audiovisual input.

20
Speaker Detection Results
Garg, Pavlovic, Rehg, Huang CVPR 00
21
Comparison
22
Overall Results
23
Boosting Analysis
  • Posthoc analysis of boosting as adjusting counts
    in histograms associated with measurement CPTs
    and transition matrix.

24
Structure Learning
  • Describe basic problem
  • K2 algorithm
  • Closed form expression for marginalizing over
    probability of parameters
  • Node ordering and limited parents for greedy
    heuristic.
  • Draw graphs showing effect of structure learning
    and MCMC version on static BN examples.

25
Boosted Structure Learning
Joint with Tanzeem Choudhury
Structure learner K2 MCMC search over node
ordering (Koller and Friedman 00)
26
Experimental Results
27
Conclusions and Future Work
  • Dynamic Bayesian network models are a powerful
    framework for cue fusion.
  • Boosting can be used to construct a supervised
    DBN learner (both parameters and structure).
  • Future work
  • Use a richer set of audio-visual cues
  • Explore feature selection given labeled state
    data
  • Expand context model
  • Integrate with speech recognizer
Write a Comment
User Comments (0)
About PowerShow.com