Folie 1 - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Folie 1

Description:

Three fixed cameras. Lapel microphones. Microphone array. Four ... Multi-layer Hidden Markov Model. Experimental setup: 59 meetings using 'Lexicon 2' 15.1 ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 22

Provided by: mmk6

Category:

more less

Transcript and Presenter's Notes

Title: Folie 1

1
(No Transcript)
2
Overview

Introduction Structuring of meetings
Audio-visual features
Models and experimental results
Segmentation using semantic features (BIC-like)
Multi-layer HMM
Multi-stream DBN
Noise robustness with a mixed-state DBN
Summary

3
Structuring of Meetings

Meetings can be modelled as a sequence of events,
group, or individual actions from a set
A A1, A2, A3, , AN

4
Structuring of Meetings

Meetings can be modelled as a sequence of events,
group, or individual actions from a set
A A1, A2, A3, , AN
Different sets represent different meeting views

Idle
Writing
Speaking
Person 1
Idle
Group Interest
Neutral
Low
High
Discussion Phase
Presentation
Discussion
Monologue
Group task
Information sharing
Decision
Information sharing
Time
5
Meeting Views
Further sets could represent the current agenda
item or the topic discussed, but may require
higher semantic knowledge for an automatic
analysis.
6
Action Lexicon
Action lexicon I Only single actions 8
action classes
Action lexicon II Single actions and
combinations of parallel actions 14 action
classes
7
M4 Meeting Corpus

Four participants
59 videos
Each 5 minutes

Recorded at IDIAP
Three fixed cameras
Lapel microphones
Microphone array

8
System Overview
9
Visual Features

Global motion
For each position
Centre of motion
Changes in motion (dynamics)
Mean absolute deviation
Intensity of motion

Skin colour blobs (GMM)
For each person
Head orientation and motion
Hands Position, size, orientation, and motion
Moving blobs from background subtraction

10
Audio Features

Lapel microphones
For each person
For each speech segment
Energy
Pitch (SIFT algorithm)
Speaking rate (combination of estimators)
MFC-Coefficients

Microphone array
For each position
For each seat (4), whiteboard, and projector
screen
SRP-PHAT measure to estimate a speech activity
And a speech and silence segmentation

11
Lexical Features
For each person

Speech transcription (or output of ASR)
Gesture transcription (or output of an automatic
gesture recognizer)

Gesture inventory
Writing
Pointing
Standing up
Sitting down
Nodding
Shaking head

12
BIC-like approach

Strategy similar to the Bayesian information
criterion (BIC)
Segmentation and classification in one step
Two windows with variable length shifted over the
time scale
Inner border is shifted from left to right
Classify each window different results for left
and right window -gt inner border is considered
as a boundary of a meeting event
No boundary is detected -gt enlarge the whole
window
Repeat until right border reaches end of meeting

13
BIC-like approach
Experimental setup 57 meetings using Lexicon 1
14
Multi-layer Hidden Markov Model

Individual action layer I-HMM
Speaking, Writing, Idle
Group action layer G-HMM
Discussion, Monologue,
Each layer trained independently
Simultaneous segmentation and recognition

15
Multi-layer Hidden Markov Model

Smaller observation space-gt stable for limited
training data
I-HMMs are person independent-gt much more
training data from different persons available
G-HMM less sensitive to small changes in the
low-level features
Two layers are trained independently-gt different
HMM combinations can be explored

16
Multi-layer Hidden Markov Model
Experimental setup 59 meetings using Lexicon 2
17
Multi-stream DBN Model

Counter structure
Improves state-duration modelling
Action counter C is incremented
after each action transition
Hierarchical approach
State-space decomposition
Two feature related sub-action nodes (S1 and S2)
Each sub-state corresponds to a cluster of
feature vectors
Unsupervised training
Independent modality processing

18
Experimental Results
Experimental setup cross-validation over 53
meetings using Lexicon 1
19
Mixed-State DBN

Couple a multi-stream HMM with a linear dynamical
system (LDS)
HMM is driving input for the LDS
With the HMM as driving input disturbances can be
compensated
System can be described as DBN

20
Mixed-State DBN
Experimental setup 59 meetings using Lexicon 1
21
Summary

Joint effort of TUM, IDIAP, and UEDIN
Modelling meetings as a sequence of individual
and group actions
Large number of audio-visual features
Four different group action modelling frameworks
Algorithm based on BIC
Multi-layer HMM
Multi-stream DBN
A mixed-state DBN
Good performances of the proposed frameworks
Future work
Space for further improvements Both in the
feature domain and in the model structures
Investigate combinations of the proposed models