VACE Multimodal Meeting Corpus Lei Chen, Travis Rose, Fey Perill, Xu Han, Jilin Tu, Zhongquian Huang, Mary Harper, Francis Quek, David McNeill, Ronald Tuttle, and Thomas Huang - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

VACE Multimodal Meeting Corpus Lei Chen, Travis Rose, Fey Perill, Xu Han, Jilin Tu, Zhongquian Huang, Mary Harper, Francis Quek, David McNeill, Ronald Tuttle, and Thomas Huang

Description:

VACE Multimodal Meeting Corpus Lei Chen, Travis Rose, Fey Perill, Xu Han, Jilin Tu, Zhongquian Huang – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 35
Provided by: franci59
Category:

less

Transcript and Presenter's Notes

Title: VACE Multimodal Meeting Corpus Lei Chen, Travis Rose, Fey Perill, Xu Han, Jilin Tu, Zhongquian Huang, Mary Harper, Francis Quek, David McNeill, Ronald Tuttle, and Thomas Huang


1
VACE Multimodal Meeting CorpusLei Chen, Travis
Rose, Fey Perill, Xu Han, Jilin Tu, Zhongquian
Huang, Mary Harper, Francis Quek, David McNeill,
Ronald Tuttle, and Thomas Huang
  • We acknowledge support from
  • NSF-STIMULATE program, Grant No. IRI-9618887,
    Gesture, Speech, and Gaze in Discourse
    Segmentation
  • NSF- KDI program, Grant No. BCS-9980054,
    Cross-Modal Analysis of Signal and Sense
    Multimedia Corpora and Tools for Gesture, Speech,
    and Gaze Research
  • NSF-ITR program, Grant No. IIS-0219875, Beyond
    The Talking Head and Animated Icon Behaviorally
    Situated Avatars for Tutoring
  • ARDA-VACE II program From Video to Information
    Cross-Modal Analysis of Planning Meetings
  • Francis Quek
  • Professor of Computer Science
  • Director, Center for Human Computer Interaction
  • Virginia Tech

2
Corpus Rationale
  • A quest for meaning Embodied cognition and
    language production drives our research
  • Analysis of natural human human meetings
  • Resource in support of research in
  • Multimodal language analysis
  • Speech recognition and analysis
  • Vision-based communicative behavior analysis

3
Why Multimodal Language Analysis?
  • S1 you know like those fireworks?
  • S2 well if we're trying to drive'em / out
    herltrgte we need to put'em up herltrgte
  • S1 yeah well what I'm saying is we should
  • S2 in front
  • S1 we should do it we should make it a linltngte
    through the roomltmgts / so that they explode
    like here then here then here then here

4
Multimodal Language Example
5
Embodied Communicative Behavior
  • Constructed dynamically at the moment of speaking
    (thinking for speaking)
  • Dependent on cultural, personal, social,
    cognitive differences
  • Speaker is often unwitting of gestures
  • Reveals the contrastive foci of language stream
    (Hajcova, Halliday et. al.)
  • Is co-expressive (co-temporal) with speech
  • Is multiply determined
  • Temporal synchrony is critical for analysis

6
In a Nutshell
  • Gesture/Speech Framework (McNeill 1992, 2000,
    2001, Quek et al 1999-2003)

7
ARDA/VACE Program
  • ARDA is to the intelligence community what DARPA
    is to the military
  • Interest is in the exploitation of video data
    (Video Analysis and Content Exploitation)
  • A key VACE challenge Meeting Analysis
  • Our key theme Multimodal communication analysis

8
From Video to Information Cross-Modal Analysis
for Planning Meetings
9
Team
  • Multimodal Meeting Analysis A Cross-Disciplinary
    Enterprise

10
Overarching Approach
  • Coordinated multidisciplinary research
  • Corpus assembly
  • Data is transcribed and coded for relevant
    speech/language structure
  • War-gaming (planning) scenarios are captured to
    provide real planning behavior in a controlled
    experimental context (reducing many unknowns)
  • Meeting room is multiply instrumented with
    cross-calibrated video, synchronized audio/video,
    motion tracking
  • All data components are time-aligned across the
    dataset
  • Multimodal video processing research
  • Research on posture, head position/orientation,
    gesture tracking, hand-shape recognition, and in
    multimodal integration
  • Research in tools for analysis, coding and
    interpretation
  • Speech analysis research in support of
    multimodality

11
Scenarios
  • Each Scenario to have Five Participants
  • Roles Tailored to Available Participant Expertise
  • Five Initial Scenarios
  • Delta II Rocket Launch
  • Foreign Material Exploitation
  • Intervention to Support Democratic Movement
  • Humanitarian Assistance
  • Scholarship Selection

12
Scenarios (contd)
  • Planned Scenarios (to be Developed)
  • Lost Aircraft Crisis Response
  • Hostage Rescue
  • Downed Pilot Search Rescue
  • Bomb Shelter Design

13
Scenario Development
  • Humanitarian Assistance Walkthrough
  • Purpose Develop Plan for Immediate Military
    Support to Dec 04 Asian Tsunami Victims
  • Considerable Open Source Information from
    Internet for Scenario Development
  • Roles
  • Medical Officer
  • Task Force Commander
  • Intel Officer
  • Operations Officer
  • Weather Officer

Mission Goals Priorities Provided for Each Role
14
Meulaboh, Indonesia
As intelligence officer, your role is to provide
intelligence support to OPERATION UNIFIED
ASSISTANCE. While the extent of damage is still
unknown, early reporting indicates that coastal
areas throughout South Asia have been affected.
Communications have been lost with entire towns.
Currently, the only means of determining the
magnitude of destruction is from overhead assets.
Data from the South Asia and Sri Lanka region
has already been received from civilian remote
sensing satellites. Although the US military
will be operating in the region on a strictly
humanitarian mission, the threat still exists of
hostile action to US personnel by terrorist
factions opposed to the US. As intel officer,
you are responsible for briefing the nature of
the terrorist threat in the region.
Before Tsunami
After Tsunami
15
Corpus Assembly
16
Data Acquisition Processing
Video Processing 10-Camera Calibration, Vector
Extraction, Hand Tracking, Gaze Tracking, Head
Modeling, Head Tracking, Body Tracking
Multi-modal Elicitation Experiment
Motion Capture Interpretation
Time Aligned Multimedia Transcription
Speech Psycholinguistic Coding Speech
Transcription, Psycholinguistic Coding
Speech Audio ProcessingAutomatic Transcript
Word/Syllable Alignment to Audio, Audio Feature
Extraction
17
Meeting Room and Camera Configuration
G
F
H
E
F
H
A
H
G
F
1 C9C3 7 C7C10
2 C1C3 8 C2C5
3 C9C1 9 C2C4
4 C4C8 10 C3C5
5 C4C6 11 C7C9
6 C6C8 12 C8C10
C1 DEF C6 BAH
C2 HGF C7 DCB
C3 FE C8 BA
C4 HA C9 DE
C5 FGH C10 BCD
E
F
D
H
A
B
B
C
D
A
B
E
D
D
C
B
18
Cam1
19
Global Pairwise Camera Calibration
  • 48 Calibration Dots for Calibration
  • 18 Vicon Markers for Coordinate System
    Transformation
  • YRXT

20
Error Distributions in Meeting Room Area
(Camera pair 512)
X Direction maximum 0.5886mm minimum 0.4mm
mean 0.4755mm
Error Distribution in X Direction
Error Distribution in Y Direction
Y Direction maximum 0.6925mm minimum 0.3077mm
mean 0.4529mm
Z Direction maximum 0.5064 mm minimum 0.3804mm
mean 0.4317mm
Error Distribution in Z Direction
21
VICON Motion Capture
  • Motion capture technology
  • Near-IR cameras
  • Retro-reflective markers
  • Datastation PC workstation
  • Vicon modes of operation
  • Individual points (as seen in calibration)
  • Kinematic models
  • Individual objects

22
VICON Motion Capture
  • Learning about MoCap
  • 11/03 Initial Installation
  • 6/04 Pilot scenario, using kinematic models
  • 10/04 Follow-up training using object models
  • 11/04 Rehearsed using Vicon with object models
  • 1/05 Data captured for FME scenario
  • Export position information for each
    participants head, hand, body position
    orientation
  • Post-processing of motion capture data 1 hour
    per minute for a 5-participant meeting
  • Incorporating MoCap into Workflow
  • Labeling of point clusters is labor intensive
  • 3 Work Studies _at_ 20 hours/wk 60 minutes (1
    dataset) per week

23
Speech Processing Tasks
  • Formulate an audio work flow to support the
    efficient and effective construction of a
    large-size high quality multimodal corpus
  • Implement support tools to achieve the goal
  • Package time-aligned word transcriptions into
    appropriate data formats that can be efficiently
    shared and used

24
Audio Processing
Forced Alignment
audio
Audio Recording, Meeting Metadata Annotation
OOV Word Resolution
Audio Segmentation
audio
segmentation
transcription
Corpus Integration
Manual Transcription
25
VACE Metadata Approach
26
Data Collection Status
  • Pilot June 04
  • Low Audio Volume. Sound Mixer Purchased
  • Video Frame Drop-out. Purchased High Grade DV
    Tapes
  • (AFIT 02-07-05 Democratic movement assistance)
  • (AFIT 02-07-05 Democratic movement assistance,
    session 2)
  • audio clipping in close-in mikes -- may be able
    to salvage data using the desktop mics.
  • AFIT 02-24-05 Humanitarian Assistance (Tsunami)
  • AFIT 03-04-05 Humanitarian Assistance (Tsunami)
  • AFIT 03-18-05 Scholarship selection
  • AFIT 04-08-05 Humanitarian Assistance (Tsunami)
  • AFIT 04-08-05 Card Game
  • AFIT 04-25-05 Problem Solving Task, (cause of
    deterioration of Lincoln Memorial)
  • AFIT 06-??-05 Problem Solving Task

27
Some Multimodal Meeting Room Results
28
F2 F1 Lance Armstrong Episode
NIST Microcorpus July 29, 2003 Meeting Dynamics
F1 vs F2
29
Gaze - NIST July 29, 2003 Data
Instrumental gaze
Gaze direction tracks social patterns
(interactive gaze) and engagement of objects
(instrumental gaze), which may be a form of
pointing as well as perception
Interactive gaze occurrences
Interactive gaze - 5 min. sample
Gaze target
Gaze source
Instrumental gaze
30
Gaze - AFIT data
Gazee
Gazer
31
F-formation analysis
  • An F-formation arises when two or more people
    cooperate together to maintain a space between
    them to which they all have direct and exclusive
    equal access. (A. Kendon 1977).
  • An F-formation is discovered from tracking gaze
    direction in a social group.
  • It is not only about shared space.
  • It reveals common ground and has an associated
    meaning.
  • The cooperative property is crucial.
  • It is useful for detecting units of thematic
    content being jointly developed in a conversation.

32
NIST-F-Formation Coding (76.11s92.27s)
33
NIST-F-Formation Coding (92.27s108.97s)
34
Summary
  • Corpus collection based on sound scientific
    foundations
  • Data includes audio, video, motion-capture,
    speech transcription, and manual codings
  • A suite of tools for visualizing and coding the
    cotemporal data has been developed
  • Research results demonstrate multimodal discourse
    segmentation and meeting dynamics analysis
Write a Comment
User Comments (0)
About PowerShow.com