Title: Dynamic Bayesian Networks for Meeting Structuring
1Dynamic Bayesian Networks for Meeting Structuring
- Alfred Dielmann, Steve Renals
- (University of Sheffield)
2Introduction
GOAL
- Automatic analysis of meetings through
multimodal events recognition
Using objective measures and statistical methods
events which involve one or more communicative
modalities, and represent a single participant
or a whole group behaviour
3Multimodal Recognition
Meeting Room
Knowledge Database
Audio
Video
Feature Extraction
Signal Pre-processing
Information Retrieval
Specialised Recognition Systems
(Speech,Video,Gestures)
Models
Multimodal Events Recognition
4Group Actions
- The machine observes group behaviours through
objective measures (external observer) - Results of this analysis are structured into a
sequence of symbols (coding system) - Exhaustive (covering the entire meeting duration)
- Mutually exclusive (non overlapping symbols)
- We used the coding system adopted by the IDIAP
framework, composed by 5 meeting actions - Monologue / Dialogue / Note taking / Presentation
/ Presentation at the whiteboard
derived from different comunicative modalities
5Corpus
- 60 meetings (30x2 set) collected in the IDIAP
Smart Meeting Room - 30 meetings are used for the training
- 23 meetings are used for the testing
- 7 meetings will be used for the results
validation - 4 participants per meeting
- 5 hours of multi-channel Audio-Visual recordings
- 3 fixed cameras
- 4 lapel microphones 8 element circular
microphones array - Meeting agendas are generated a priori and
strictly followed, in order to have an average of
5 meeting actions for each meeting - Available for public distribution
http//mmm.idiap.ch/
6Features (1)
Only features derived from audio are currently
used...
Speaker Turns
Dimension reduction
Mic. Array
Beam-forming
Prosody and Acoustic
Lapel Mic.
Pitch baseline
Energy
Rate Of Speech
..
7Features (2)
Li(t)Lj(t-1)Lk(t-2)
i
k
j
Location based Speech activities (SRP-PHAT
beamforming) Kindly provided by IDIAP
Speaker Turns Features
8Features (3)
Mask Features using Speech activity
RMS
Energy
Pitch
Pitch extractor
Filters ()
Lapel Mic.
Rate Of Speech
MRATE
Mic. Array
Beam-forming
() Histogram, median and interpolating filter
9Features (4)
Wed like to integrate other features..
Participants Motion features
Video
Image Processing
Other blob positions
Gestures and Actions
Transcripts
Audio.
ASR
Everything that could be automatically extracted
from a recorded meeting
Other
10Dynamic Bayesian Networks (1)
Bayesian Networks are a convenient graphical way
to describe statistical (in)dependencies among
random variables
A
F
Direct Acyclic Graph
Conditional Probability Tables
C
S
Given a set of examples, EM learning
algorithms (ie Baum-Welch) could be used to
train CPTs
L
Given a set of known evidence nodes,
the probability of other nodes can be computed
through inference
O
11Dynamic Bayesian Networks (2)
- DBN are an extension of BNs with random variables
that evolves in time - Instancing a static BN for each temporal slice t
- Explicating temporal dependences between
variables
C
S
C
S
C
S
L
L
L
..
O
O
O
t0 t1
tT
12Dynamic Bayesian Networks (3)
- Hidden Markov Models, Kalman Filter Models and
other state-space models are just a special case
of DBNs
p
A
Q0
Qt
Qt1
.
.
Representation of an HMM as an instance of a DBN
Y0
Yt
Yt1
B
t0 t t1
13Dynamic Bayesian Networks (4)
- Representing HMMs in terms of DBNs makes easy to
create variations on the basic theme .
Z0
Zt
Zt
.
X0
Xt
Xt1
.
Z0
Zt
Zt1
.
V0
Vt
Vt
Q0
Qt
Qt1
.
Q0
Qt
Qt
.
Y0
Yt
Yt1
Y0
Yt
Yt
Factorial HMMs
Coupled HMMs
14Dynamic Bayesian Networks (5)
- Use of DBN and BN present some advantages
- Intuitive way to represent models graphically,
with a standard notation - Unified theory for a huge number of models
- Connecting different models in a structured view
- Making easier to study new models
- Unified set of instruments (ie GMTK) to work
with them (training, inference, decoding) - Maximizes resources reuse
- Minimizes setup time
15First Model (1)
- Early integration of features and modelling
through a - 2-level Hidden Markov Model
Hidden Meeting Actions
A0
At
At1
AT
.
.
Hidden Sub-states
S0
St
St1
ST
.
.
Observable Features Vector
Y0
Yt
Yt1
YT
16First Model (2)
- The main idea behind this model is to decompose
each meeting action in a sequence of sub
actions or substates - (Note that different actions are free to share
the same sub-state)
- The structure is composed by two Ergodic HMM
chains - The top chain links sub-states St with
actions At - The lower one maps directly the feature vectors
Yt into a sub-state St
A0
At
.
S0
St
.
Y0
Yt
17First Model (3)
- The sequence of actions At is known a priori
- The sequence St is determined during the
training process,and the meaning of each substate
is unknown
- The cardinality of St is one of the models
parameters - The mapping of observable features Yt into
hidden sub-states St is obtained through
Gaussian Mixture Models
A0
At
.
S0
St
.
Y0
Yt
18Second Model (1)
- Multistream processing of features through two
parallel and independent Hidden Markov Models
Action Counter
.
.
C0
C0
C0
C0
E0
E0
E0
Enable Transitions
A0
At
At1
AT
.
.
Meeting Actions
Hidden Sub-states
S01
St1
St11
ST1
.
.
S02
St2
St12
ST2
.
.
Prosodic Features
Y01
Yt1
Yt11
YT1
Y02
Yt2
Yt12
YT2
Speaker Turns Features
19Second Model (2)
- Each features-group (or modality) Ym, is mapped
into an independent HMM chain, therefore every
group is evaluated independently and mapped into
an hidden sub-state Stn
As in the previous model, there is another HMM
layer (A), witch represents meeting actions
A0
At
.
S01
St1
.
The whole sub-state St1 x St2 x Stn is
mapped into an action At
S02
St2
.
Y01
Yt1
Y02
Yt2
20Second Model (3)
- It is a variable-duration HMM with explicit
enable node - At represents meeting actions as usual
- Ct counts meeting actions
- Et is a binary indicator variable that enables
states changes inside the node At
Ct 1 1 2 2 2
Et 0 1 0 0 0
At 8 8 5 5 5
.
.
C0
C0
C0
E0
E0
E0
A0
At
At1
.
.
21Second Model (4)
- Training when At changes Ct is incremented
and is set on for a single frame Et (At ,Et and
Ct are part of the training dataset)
Behaviours of Et and Ct learned during the
training phase are then exploited during the
decoding
Ct 1 1 2 2 2
Et 0 1 0 0 0
At 8 8 5 5 5
- Decoding At is free to change only if Et
is high, and - then according to Ct state
22Results
- Using the two models previously described,
results obtained using only audio derived
features
Corr. Sub. Del. Ins. AER
First Model 93.2 2.3 4.5 4.5 11.4
Second Model 94.7 1.5 3.8 0.8 6.1
The second model reduces effectively both the
number of Substitutions and the number of
Insertions
Equivalent to the Word Error Rate measure, used
to evaluate speech recogniser performances
23Conclusions
- A new approach has been proposed
- Achieved results seem to be promising, and in the
future wed like to - Validate them with the remaining part of the
test-set (or eventually an independent test-set) - Integrate other features
- video, ASR transcripts, Xtalk, .
- Try new experiments with existing models
- Develop new DBNs based models
24(No Transcript)
25Multimodal Recognition (2)
Knowledge sources
Approaches
A standalone hi-level recogniser operating on low
level raw data
- Raw Audio
- Raw Video
- Acoustic Features
- Visual Features
- Automatic Speech Recognition
- Video Understanding
- Gesture Recognition
- Eye Gaze Tracking
- Emotion Detection
- .
Fusion of different recognisers at an early
stage, generating hybrid recognisers (like AVSR)
Integration of recognisers outputs through an
high level recogniser