Dynamic Bayesian Networks for Meeting Structuring - PowerPoint PPT Presentation

About This Presentation

Title:

Dynamic Bayesian Networks for Meeting Structuring

Description:

Dynamic Bayesian Networks for Meeting Structuring Alfred Dielmann, Steve Renals (University of Sheffield) – PowerPoint PPT presentation

Number of Views:103

Avg rating:3.0/5.0

Slides: 26

Provided by: spandhDc

Category:

more less

Transcript and Presenter's Notes

Title: Dynamic Bayesian Networks for Meeting Structuring

1
Dynamic Bayesian Networks for Meeting Structuring

Alfred Dielmann, Steve Renals
(University of Sheffield)

2
Introduction
GOAL

Automatic analysis of meetings through
multimodal events recognition

Using objective measures and statistical methods
events which involve one or more communicative
modalities, and represent a single participant
or a whole group behaviour
3
Multimodal Recognition
Meeting Room
Knowledge Database

Audio
Video

Feature Extraction
Signal Pre-processing
Information Retrieval
Specialised Recognition Systems
(Speech,Video,Gestures)
Models
Multimodal Events Recognition
4
Group Actions

The machine observes group behaviours through
objective measures (external observer)
Results of this analysis are structured into a
sequence of symbols (coding system)
Exhaustive (covering the entire meeting duration)
Mutually exclusive (non overlapping symbols)
We used the coding system adopted by the IDIAP
framework, composed by 5 meeting actions
Monologue / Dialogue / Note taking / Presentation
/ Presentation at the whiteboard

derived from different comunicative modalities

5
Corpus

60 meetings (30x2 set) collected in the IDIAP
Smart Meeting Room
30 meetings are used for the training
23 meetings are used for the testing
7 meetings will be used for the results
validation
4 participants per meeting
5 hours of multi-channel Audio-Visual recordings
3 fixed cameras
4 lapel microphones 8 element circular
microphones array
Meeting agendas are generated a priori and
strictly followed, in order to have an average of
5 meeting actions for each meeting
Available for public distribution

http//mmm.idiap.ch/
6
Features (1)
Only features derived from audio are currently
used...
Speaker Turns
Dimension reduction
Mic. Array
Beam-forming
Prosody and Acoustic
Lapel Mic.
Pitch baseline
Energy
Rate Of Speech
..
7
Features (2)

Speaker Turns

Li(t)Lj(t-1)Lk(t-2)
i
k
j
Location based Speech activities (SRP-PHAT
beamforming) Kindly provided by IDIAP
Speaker Turns Features
8
Features (3)
Mask Features using Speech activity
RMS
Energy
Pitch
Pitch extractor
Filters ()
Lapel Mic.
Rate Of Speech
MRATE
Mic. Array
Beam-forming
() Histogram, median and interpolating filter
9
Features (4)
Wed like to integrate other features..
Participants Motion features
Video
Image Processing
Other blob positions
Gestures and Actions
Transcripts
Audio.
ASR
Everything that could be automatically extracted
from a recorded meeting
Other
10
Dynamic Bayesian Networks (1)
Bayesian Networks are a convenient graphical way
to describe statistical (in)dependencies among
random variables
A
F
Direct Acyclic Graph
Conditional Probability Tables
C
S
Given a set of examples, EM learning
algorithms (ie Baum-Welch) could be used to
train CPTs
L
Given a set of known evidence nodes,
the probability of other nodes can be computed
through inference
O
11
Dynamic Bayesian Networks (2)

DBN are an extension of BNs with random variables
that evolves in time
Instancing a static BN for each temporal slice t
Explicating temporal dependences between
variables

C
S
C
S
C
S
L
L
L
..
O
O
O
t0 t1
tT
12
Dynamic Bayesian Networks (3)

Hidden Markov Models, Kalman Filter Models and
other state-space models are just a special case
of DBNs

p
A
Q0
Qt
Qt1
.
.
Representation of an HMM as an instance of a DBN
Y0
Yt
Yt1
B
t0 t t1
13
Dynamic Bayesian Networks (4)

Representing HMMs in terms of DBNs makes easy to
create variations on the basic theme .

Z0
Zt
Zt
.
X0
Xt
Xt1
.
Z0
Zt
Zt1
.
V0
Vt
Vt
Q0
Qt
Qt1
.
Q0
Qt
Qt
.
Y0
Yt
Yt1
Y0
Yt
Yt
Factorial HMMs
Coupled HMMs
14
Dynamic Bayesian Networks (5)

Use of DBN and BN present some advantages
Intuitive way to represent models graphically,
with a standard notation
Unified theory for a huge number of models
Connecting different models in a structured view
Making easier to study new models
Unified set of instruments (ie GMTK) to work
with them (training, inference, decoding)
Maximizes resources reuse
Minimizes setup time

15
First Model (1)

Early integration of features and modelling
through a
2-level Hidden Markov Model

Hidden Meeting Actions
A0
At
At1
AT
.
.
Hidden Sub-states
S0
St
St1
ST
.
.
Observable Features Vector
Y0
Yt
Yt1
YT
16
First Model (2)

The main idea behind this model is to decompose
each meeting action in a sequence of sub
actions or substates
(Note that different actions are free to share
the same sub-state)

The structure is composed by two Ergodic HMM
chains
The top chain links sub-states St with
actions At
The lower one maps directly the feature vectors
Yt into a sub-state St

A0
At
.
S0
St
.
Y0
Yt
17
First Model (3)

The sequence of actions At is known a priori
The sequence St is determined during the
training process,and the meaning of each substate
is unknown

The cardinality of St is one of the models
parameters
The mapping of observable features Yt into
hidden sub-states St is obtained through
Gaussian Mixture Models

A0
At
.
S0
St
.
Y0
Yt
18
Second Model (1)

Multistream processing of features through two
parallel and independent Hidden Markov Models

Action Counter
.
.
C0
C0
C0
C0
E0
E0
E0
Enable Transitions
A0
At
At1
AT
.
.
Meeting Actions
Hidden Sub-states
S01
St1
St11
ST1
.
.
S02
St2
St12
ST2
.
.
Prosodic Features
Y01
Yt1
Yt11
YT1
Y02
Yt2
Yt12
YT2
Speaker Turns Features
19
Second Model (2)

Each features-group (or modality) Ym, is mapped
into an independent HMM chain, therefore every
group is evaluated independently and mapped into
an hidden sub-state Stn

As in the previous model, there is another HMM
layer (A), witch represents meeting actions
A0
At
.
S01
St1
.
The whole sub-state St1 x St2 x Stn is
mapped into an action At
S02
St2
.
Y01
Yt1
Y02
Yt2
20
Second Model (3)

It is a variable-duration HMM with explicit
enable node
At represents meeting actions as usual
Ct counts meeting actions
Et is a binary indicator variable that enables
states changes inside the node At

Ct 1 1 2 2 2
Et 0 1 0 0 0
At 8 8 5 5 5
.
.
C0
C0
C0
E0
E0
E0
A0
At
At1
.
.
21
Second Model (4)

Training when At changes Ct is incremented
and is set on for a single frame Et (At ,Et and
Ct are part of the training dataset)

Behaviours of Et and Ct learned during the
training phase are then exploited during the
decoding
Ct 1 1 2 2 2
Et 0 1 0 0 0
At 8 8 5 5 5

Decoding At is free to change only if Et
is high, and
then according to Ct state

22
Results

Using the two models previously described,
results obtained using only audio derived
features

Corr. Sub. Del. Ins. AER
First Model 93.2 2.3 4.5 4.5 11.4
Second Model 94.7 1.5 3.8 0.8 6.1
The second model reduces effectively both the
number of Substitutions and the number of
Insertions
Equivalent to the Word Error Rate measure, used
to evaluate speech recogniser performances
23
Conclusions

A new approach has been proposed
Achieved results seem to be promising, and in the
future wed like to
Validate them with the remaining part of the
test-set (or eventually an independent test-set)
Integrate other features
video, ASR transcripts, Xtalk, .
Try new experiments with existing models
Develop new DBNs based models

24
(No Transcript)
25
Multimodal Recognition (2)
Knowledge sources
Approaches
A standalone hi-level recogniser operating on low
level raw data

Raw Audio
Raw Video
Acoustic Features
Visual Features
Automatic Speech Recognition
Video Understanding
Gesture Recognition
Eye Gaze Tracking
Emotion Detection
.

Fusion of different recognisers at an early
stage, generating hybrid recognisers (like AVSR)
Integration of recognisers outputs through an
high level recogniser

Write a Comment

User Comments (0)