Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks - PowerPoint PPT Presentation

About This Presentation
Title:

Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks

Description:

Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks Sergio Escalera, Petia Radeva, Jordi Vitri , Xavier Bar and Bogdan Raducanu – PowerPoint PPT presentation

Number of Views:88
Avg rating:3.0/5.0
Slides: 20
Provided by: ubes
Category:

less

Transcript and Presenter's Notes

Title: Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks


1
Modelling and Analyzing Multimodal Dyadic
Interactions Using Social Networks
  • Sergio Escalera, Petia Radeva, Jordi Vitrià,
    Xavier Barò and Bogdan Raducanu

2
  • Outline
  • Introduction
  • Audio Visual cues extraction and fusion
  • Social Network extraction and analysis
  • Experimental Results
  • Conclusions and future work

3
  • Introduction
  • Social interactions play a very important role in
    peoples daily lives.
  • Present trend analysis of human behavior based
    on electronic communications SMS, e-mails, chat
  • New trend analysis of human behavior based on
    nonverbal communication social signals
  • Quantification of social signals represents a
    powerful cue to characterize human behavior
    facial expression, hand and body gestures, focus
    of attention, voice prosody, etc.

4
  • Social Network Analysis (SNA) has been developed
    as a tool to model social interactions in terms
    of a graph-based structure
  • - Nodes represent the actors persons,
    communities, institutions, etc.
  • - Links represent a specific type of
    interdepency friendship, familiarity, business
    transactions, etc.
  • A common way to characterize the information
    encoded in a SNA is to use several centrality
    measures.

5
  • Our contribution
  • In this work, we propose an integrated framework
    for extraction and analysis of a SNA from
    multimodal (A/V) dyadic interactions
  • The advantage is represented by the fact that it
    is based on a totally non-intrunsive technology
  • First we perform speech segmentation through an
    audio/visual fusion scheme
  • - In the audio domain, speech is detected
    through clusterization of audio features
  • - In the visual domain, speech is detected
    through differential-based feature extraction
    from the segmented mouth region
  • - The fusion scheme is based on stacked
    sequential learning

We used a set of videos belonging to the New
York Times Blogging heads opinion blog. The
videos depict two persons talking on different
subject in front of a webcam
6
- Second To quantify the dyadic interaction, we
used the Influence Model, whose states encode
previously integrated audio-visual data -
Third The Social Network is extracted based on
the estimated influences and its properties are
characterized based on several centrality
measures.
  • Block-diagram representation of our integrated
    framework

The use of term influence is inspired by the
previous work of Choudhury T. Choudhury, 2003.
Sensing and Modelling Human Networks, Ph.D.
Thesis, MIT Media Lab
7
2. Audio Visual cues extraction and fusion
  • Audio cue
  • Description
  • 12 first MFCC coefficients
  • Signal energy
  • Temporal cepstral derivatives (? and ?2 )

8
  • Audio cue
  • Diarization process
  • Segmentation
  • Coarse segmentation according Generalized
    Likelihood ratio between consecutive windows
  • Clustering
  • Agglomerative hierarchical clustering with a BIC
    stopping scheme
  • Segments boundaries are adjusted at the end

9
  • Visual cue
  • Description
  • Face segmentation based on Viola-Jones detector
  • Mouth region segmentation
  • Vector of HOG descriptors for for the mouth
    region

10
  • Visual cue
  • Classification
  • Non-Speech class modelling
  • One-class Dynamic Time warping based on the
    following dynamic programming equation

11
  • Fusion scheme
  • Stacked sequential learning (suitable for
    problems characterized by long runs of identical
    labels)
  • Fusion of audio-visual modalities
  • Determining temporal relations of both feature
    sets for learning a two-stage classifier (based
    on Ada-Boost)

12
3. Social Network extraction and analysis
  • Influence Model (IM), was a tool introduced for
    quantification of interacting processes using a
    coupled Hidden Markov Model (HMM)
  • In the case of social interaction, the states of
    IM encode automatically extracted audio-visual
    features

parameters represent the influences
Influence Model Architecture
13
  • - The construction of the Social Network is
    based on influences values
  • A directed link between two nodes A and B
    (designated by A ? B) implies that A has
    influence over B
  • The SNA is based on several centrality measures
  • - degree centrality (indegree and outdegree)
  • - Refers to the number of direct connections
    with other persons
  • - closeness centrality
  • - Refers to the facility between two persons
    to communicate
  • - betweeness centrality
  • - Refers to the relevance of a person to act
    as a bridge between two sub-groups of the
    network
  • - eigenvector centrality
  • - Refers to the importance of a person in
    the network

14
4. Experimental results
  • We collected a subset of videos from the New York
    Blogging Heads opinion blog
  • We used 17 videos from 15 persons
  • Videos depict two persons having a conversation
    in front of their webcam on different topics
    (politics, economy,)
  • The conversations have an informal character and
    sometimes frequent interruptions can occur

Snapshot from a video
15
  • Audio features
  • - The audio stream has been analyzed using
    sliding windows of 25 ms with an overlapping
    factor of 50.
  • - Each window is characterized by 13 features
    (12 MFCC E), complemented with ? and ?2
  • - The shortest length of a valid audio segment
    was set to 2.5 ms
  • Video features
  • - 32 oriented features (corresponding to the
    mouth region) have been extracted using the HOG
    descriptor
  • - the length of the DTW sequences has been
    set to 18 frames (which corresponds to 1.5 s)
  • Fusion process
  • - stacked sequential learning was used to fusion
    the audio-visual features
  • - Adaboost was chosen as classifier

16
Visual and audio-visual speaker segmentation
accuracy
17
The extracted social network showing
participants label and influence directions
18
Centrality measures table
19
5. Conclusions and future work
  • - We presented an integrated framework for
    automatic extraction and analysis of a social
    network from im-
  • plicit input (multimodal dyadic interactions),
    based on the
  • integration of audio/visual features.
  • In the future, we are planning to extend the
    current work to study the problem of social
    interactions at larger scale and in different
    scenarios
  • - Starting from the premise that people's lives
    are more structured than it might seem a priori,
    we plan to study long-term interactions between
    persons, with the aim to discover underlying
    behavioral patterns present in our day-to-day
    existence
Write a Comment
User Comments (0)
About PowerShow.com