Folie 1 - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Folie 1

Description:

Three fixed cameras. Lapel microphones. Microphone array. Four ... Multi-layer Hidden Markov Model. Experimental setup: 59 meetings using 'Lexicon 2' 15.1 ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 22
Provided by: mmk6
Category:
Tags: cameras | folie | hidden

less

Transcript and Presenter's Notes

Title: Folie 1


1
(No Transcript)
2
Overview
  • Introduction Structuring of meetings
  • Audio-visual features
  • Models and experimental results
  • Segmentation using semantic features (BIC-like)
  • Multi-layer HMM
  • Multi-stream DBN
  • Noise robustness with a mixed-state DBN
  • Summary

3
Structuring of Meetings
  • Meetings can be modelled as a sequence of events,
  • group, or individual actions from a set
  • A A1, A2, A3, , AN

4
Structuring of Meetings
  • Meetings can be modelled as a sequence of events,
  • group, or individual actions from a set
  • A A1, A2, A3, , AN
  • Different sets represent different meeting views

Idle
Writing
Speaking
Person 1
Idle
Group Interest
Neutral
Low
High
Discussion Phase
Presentation
Discussion
Monologue
Group task
Information sharing
Decision
Information sharing
Time
5
Meeting Views
Further sets could represent the current agenda
item or the topic discussed, but may require
higher semantic knowledge for an automatic
analysis.
6
Action Lexicon
Action lexicon I Only single actions 8
action classes
Action lexicon II Single actions and
combinations of parallel actions 14 action
classes
7
M4 Meeting Corpus
  • Four participants
  • 59 videos
  • Each 5 minutes
  • Recorded at IDIAP
  • Three fixed cameras
  • Lapel microphones
  • Microphone array

8
System Overview
9
Visual Features
  • Global motion
  • For each position
  • Centre of motion
  • Changes in motion (dynamics)
  • Mean absolute deviation
  • Intensity of motion
  • Skin colour blobs (GMM)
  • For each person
  • Head orientation and motion
  • Hands Position, size, orientation, and motion
  • Moving blobs from background subtraction

10
Audio Features
  • Lapel microphones
  • For each person
  • For each speech segment
  • Energy
  • Pitch (SIFT algorithm)
  • Speaking rate (combination of estimators)
  • MFC-Coefficients
  • Microphone array
  • For each position
  • For each seat (4), whiteboard, and projector
    screen
  • SRP-PHAT measure to estimate a speech activity
  • And a speech and silence segmentation

11
Lexical Features
For each person
  • Speech transcription (or output of ASR)
  • Gesture transcription (or output of an automatic
    gesture recognizer)
  • Gesture inventory
  • Writing
  • Pointing
  • Standing up
  • Sitting down
  • Nodding
  • Shaking head

12
BIC-like approach
  • Strategy similar to the Bayesian information
    criterion (BIC)
  • Segmentation and classification in one step
  • Two windows with variable length shifted over the
    time scale
  • Inner border is shifted from left to right
  • Classify each window different results for left
    and right window -gt inner border is considered
    as a boundary of a meeting event
  • No boundary is detected -gt enlarge the whole
    window
  • Repeat until right border reaches end of meeting

13
BIC-like approach
Experimental setup 57 meetings using Lexicon 1
14
Multi-layer Hidden Markov Model
  • Individual action layer I-HMM
  • Speaking, Writing, Idle
  • Group action layer G-HMM
  • Discussion, Monologue,
  • Each layer trained independently
  • Simultaneous segmentation and recognition

15
Multi-layer Hidden Markov Model
  • Smaller observation space-gt stable for limited
    training data
  • I-HMMs are person independent-gt much more
    training data from different persons available
  • G-HMM less sensitive to small changes in the
    low-level features
  • Two layers are trained independently-gt different
    HMM combinations can be explored

16
Multi-layer Hidden Markov Model
Experimental setup 59 meetings using Lexicon 2
17
Multi-stream DBN Model
  • Counter structure
  • Improves state-duration modelling
  • Action counter C is incremented
  • after each action transition
  • Hierarchical approach
  • State-space decomposition
  • Two feature related sub-action nodes (S1 and S2)
  • Each sub-state corresponds to a cluster of
    feature vectors
  • Unsupervised training
  • Independent modality processing

18
Experimental Results
Experimental setup cross-validation over 53
meetings using Lexicon 1
19
Mixed-State DBN
  • Couple a multi-stream HMM with a linear dynamical
    system (LDS)
  • HMM is driving input for the LDS
  • With the HMM as driving input disturbances can be
    compensated
  • System can be described as DBN

20
Mixed-State DBN
Experimental setup 59 meetings using Lexicon 1
21
Summary
  • Joint effort of TUM, IDIAP, and UEDIN
  • Modelling meetings as a sequence of individual
    and group actions
  • Large number of audio-visual features
  • Four different group action modelling frameworks
  • Algorithm based on BIC
  • Multi-layer HMM
  • Multi-stream DBN
  • A mixed-state DBN
  • Good performances of the proposed frameworks
  • Future work
  • Space for further improvements Both in the
    feature domain and in the model structures
  • Investigate combinations of the proposed models
Write a Comment
User Comments (0)
About PowerShow.com