Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks

About This Presentation

Title:

Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks

Description:

Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks Sergio Escalera, Petia Radeva, Jordi Vitri , Xavier Bar and Bogdan Raducanu – PowerPoint PPT presentation

Number of Views:88

Avg rating:3.0/5.0

Slides: 20

Provided by: ubes

Category:

more less

Transcript and Presenter's Notes

Title: Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks

1
Modelling and Analyzing Multimodal Dyadic
Interactions Using Social Networks

Sergio Escalera, Petia Radeva, Jordi Vitrià,
Xavier Barò and Bogdan Raducanu

Outline
Introduction
Audio Visual cues extraction and fusion
Social Network extraction and analysis
Experimental Results
Conclusions and future work

Introduction
Social interactions play a very important role in
peoples daily lives.
Present trend analysis of human behavior based
on electronic communications SMS, e-mails, chat
New trend analysis of human behavior based on
nonverbal communication social signals
Quantification of social signals represents a
powerful cue to characterize human behavior
facial expression, hand and body gestures, focus
of attention, voice prosody, etc.

Social Network Analysis (SNA) has been developed
as a tool to model social interactions in terms
of a graph-based structure
- Nodes represent the actors persons,
communities, institutions, etc.
- Links represent a specific type of
interdepency friendship, familiarity, business
transactions, etc.
A common way to characterize the information
encoded in a SNA is to use several centrality
measures.

Our contribution
In this work, we propose an integrated framework
for extraction and analysis of a SNA from
multimodal (A/V) dyadic interactions
The advantage is represented by the fact that it
is based on a totally non-intrunsive technology
First we perform speech segmentation through an
audio/visual fusion scheme
- In the audio domain, speech is detected
through clusterization of audio features
- In the visual domain, speech is detected
through differential-based feature extraction
from the segmented mouth region
- The fusion scheme is based on stacked
sequential learning

We used a set of videos belonging to the New
York Times Blogging heads opinion blog. The
videos depict two persons talking on different
subject in front of a webcam
6
- Second To quantify the dyadic interaction, we
used the Influence Model, whose states encode
previously integrated audio-visual data -
Third The Social Network is extracted based on
the estimated influences and its properties are
characterized based on several centrality
measures.

Block-diagram representation of our integrated
framework

The use of term influence is inspired by the
previous work of Choudhury T. Choudhury, 2003.
Sensing and Modelling Human Networks, Ph.D.
Thesis, MIT Media Lab
7
2. Audio Visual cues extraction and fusion

Audio cue
Description
12 first MFCC coefficients
Signal energy
Temporal cepstral derivatives (? and ?2 )

Audio cue
Diarization process
Segmentation
Coarse segmentation according Generalized
Likelihood ratio between consecutive windows
Clustering
Agglomerative hierarchical clustering with a BIC
stopping scheme
Segments boundaries are adjusted at the end

Visual cue
Description
Face segmentation based on Viola-Jones detector
Mouth region segmentation
Vector of HOG descriptors for for the mouth
region

Visual cue
Classification
Non-Speech class modelling
One-class Dynamic Time warping based on the
following dynamic programming equation

Fusion scheme
Stacked sequential learning (suitable for
problems characterized by long runs of identical
labels)
Fusion of audio-visual modalities
Determining temporal relations of both feature
sets for learning a two-stage classifier (based
on Ada-Boost)

12
3. Social Network extraction and analysis

Influence Model (IM), was a tool introduced for
quantification of interacting processes using a
coupled Hidden Markov Model (HMM)
In the case of social interaction, the states of
IM encode automatically extracted audio-visual
features

parameters represent the influences
Influence Model Architecture
13

- The construction of the Social Network is
based on influences values
A directed link between two nodes A and B
(designated by A ? B) implies that A has
influence over B
The SNA is based on several centrality measures
- degree centrality (indegree and outdegree)
- Refers to the number of direct connections
with other persons
- closeness centrality
- Refers to the facility between two persons
to communicate
- betweeness centrality
- Refers to the relevance of a person to act
as a bridge between two sub-groups of the
network
- eigenvector centrality
- Refers to the importance of a person in
the network

14
4. Experimental results

We collected a subset of videos from the New York
Blogging Heads opinion blog
We used 17 videos from 15 persons
Videos depict two persons having a conversation
in front of their webcam on different topics
(politics, economy,)
The conversations have an informal character and
sometimes frequent interruptions can occur

Snapshot from a video
15

Audio features
- The audio stream has been analyzed using
sliding windows of 25 ms with an overlapping
factor of 50.
- Each window is characterized by 13 features
(12 MFCC E), complemented with ? and ?2
- The shortest length of a valid audio segment
was set to 2.5 ms
Video features
- 32 oriented features (corresponding to the
mouth region) have been extracted using the HOG
descriptor
- the length of the DTW sequences has been
set to 18 frames (which corresponds to 1.5 s)
Fusion process
- stacked sequential learning was used to fusion
the audio-visual features
- Adaboost was chosen as classifier

16
Visual and audio-visual speaker segmentation
accuracy
17
The extracted social network showing
participants label and influence directions
18
Centrality measures table
19
5. Conclusions and future work

- We presented an integrated framework for
automatic extraction and analysis of a social
network from im-
plicit input (multimodal dyadic interactions),
based on the
integration of audio/visual features.
In the future, we are planning to extend the
current work to study the problem of social
interactions at larger scale and in different
scenarios
- Starting from the premise that people's lives
are more structured than it might seem a priori,
we plan to study long-term interactions between
persons, with the aim to discover underlying
behavioral patterns present in our day-to-day
existence