Title: Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks
1Modelling and Analyzing Multimodal Dyadic
Interactions Using Social Networks
- Sergio Escalera, Petia Radeva, Jordi Vitrià,
Xavier Barò and Bogdan Raducanu
2- Outline
- Introduction
- Audio Visual cues extraction and fusion
- Social Network extraction and analysis
- Experimental Results
- Conclusions and future work
3- Introduction
- Social interactions play a very important role in
peoples daily lives. - Present trend analysis of human behavior based
on electronic communications SMS, e-mails, chat - New trend analysis of human behavior based on
nonverbal communication social signals - Quantification of social signals represents a
powerful cue to characterize human behavior
facial expression, hand and body gestures, focus
of attention, voice prosody, etc.
4- Social Network Analysis (SNA) has been developed
as a tool to model social interactions in terms
of a graph-based structure - - Nodes represent the actors persons,
communities, institutions, etc. - - Links represent a specific type of
interdepency friendship, familiarity, business
transactions, etc. - A common way to characterize the information
encoded in a SNA is to use several centrality
measures.
5- Our contribution
- In this work, we propose an integrated framework
for extraction and analysis of a SNA from
multimodal (A/V) dyadic interactions - The advantage is represented by the fact that it
is based on a totally non-intrunsive technology - First we perform speech segmentation through an
audio/visual fusion scheme - - In the audio domain, speech is detected
through clusterization of audio features
- - In the visual domain, speech is detected
through differential-based feature extraction
from the segmented mouth region - - The fusion scheme is based on stacked
sequential learning
We used a set of videos belonging to the New
York Times Blogging heads opinion blog. The
videos depict two persons talking on different
subject in front of a webcam
6- Second To quantify the dyadic interaction, we
used the Influence Model, whose states encode
previously integrated audio-visual data -
Third The Social Network is extracted based on
the estimated influences and its properties are
characterized based on several centrality
measures.
- Block-diagram representation of our integrated
framework
The use of term influence is inspired by the
previous work of Choudhury T. Choudhury, 2003.
Sensing and Modelling Human Networks, Ph.D.
Thesis, MIT Media Lab
72. Audio Visual cues extraction and fusion
- Audio cue
- Description
- 12 first MFCC coefficients
- Signal energy
- Temporal cepstral derivatives (? and ?2 )
8- Audio cue
- Diarization process
- Segmentation
- Coarse segmentation according Generalized
Likelihood ratio between consecutive windows - Clustering
- Agglomerative hierarchical clustering with a BIC
stopping scheme - Segments boundaries are adjusted at the end
9- Visual cue
- Description
- Face segmentation based on Viola-Jones detector
- Mouth region segmentation
- Vector of HOG descriptors for for the mouth
region
10- Visual cue
- Classification
- Non-Speech class modelling
- One-class Dynamic Time warping based on the
following dynamic programming equation
11- Fusion scheme
- Stacked sequential learning (suitable for
problems characterized by long runs of identical
labels) - Fusion of audio-visual modalities
- Determining temporal relations of both feature
sets for learning a two-stage classifier (based
on Ada-Boost)
123. Social Network extraction and analysis
- Influence Model (IM), was a tool introduced for
quantification of interacting processes using a
coupled Hidden Markov Model (HMM) - In the case of social interaction, the states of
IM encode automatically extracted audio-visual
features
parameters represent the influences
Influence Model Architecture
13- - The construction of the Social Network is
based on influences values - A directed link between two nodes A and B
(designated by A ? B) implies that A has
influence over B - The SNA is based on several centrality measures
- - degree centrality (indegree and outdegree)
- - Refers to the number of direct connections
with other persons - - closeness centrality
- - Refers to the facility between two persons
to communicate - - betweeness centrality
- - Refers to the relevance of a person to act
as a bridge between two sub-groups of the
network - - eigenvector centrality
- - Refers to the importance of a person in
the network
144. Experimental results
- We collected a subset of videos from the New York
Blogging Heads opinion blog - We used 17 videos from 15 persons
- Videos depict two persons having a conversation
in front of their webcam on different topics
(politics, economy,) - The conversations have an informal character and
sometimes frequent interruptions can occur
Snapshot from a video
15- Audio features
- - The audio stream has been analyzed using
sliding windows of 25 ms with an overlapping
factor of 50. - - Each window is characterized by 13 features
(12 MFCC E), complemented with ? and ?2 - - The shortest length of a valid audio segment
was set to 2.5 ms - Video features
- - 32 oriented features (corresponding to the
mouth region) have been extracted using the HOG
descriptor - - the length of the DTW sequences has been
set to 18 frames (which corresponds to 1.5 s) - Fusion process
- - stacked sequential learning was used to fusion
the audio-visual features - - Adaboost was chosen as classifier
16Visual and audio-visual speaker segmentation
accuracy
17The extracted social network showing
participants label and influence directions
18Centrality measures table
195. Conclusions and future work
- - We presented an integrated framework for
automatic extraction and analysis of a social
network from im- - plicit input (multimodal dyadic interactions),
based on the - integration of audio/visual features.
- In the future, we are planning to extend the
current work to study the problem of social
interactions at larger scale and in different
scenarios - - Starting from the premise that people's lives
are more structured than it might seem a priori,
we plan to study long-term interactions between
persons, with the aim to discover underlying
behavioral patterns present in our day-to-day
existence