Title: VACE Multimodal Meeting Corpus Lei Chen, Travis Rose, Fey Perill, Xu Han, Jilin Tu, Zhongquian Huang, Mary Harper, Francis Quek, David McNeill, Ronald Tuttle, and Thomas Huang
1VACE Multimodal Meeting CorpusLei Chen, Travis
Rose, Fey Perill, Xu Han, Jilin Tu, Zhongquian
Huang, Mary Harper, Francis Quek, David McNeill,
Ronald Tuttle, and Thomas Huang
- We acknowledge support from
- NSF-STIMULATE program, Grant No. IRI-9618887,
Gesture, Speech, and Gaze in Discourse
Segmentation - NSF- KDI program, Grant No. BCS-9980054,
Cross-Modal Analysis of Signal and Sense
Multimedia Corpora and Tools for Gesture, Speech,
and Gaze Research - NSF-ITR program, Grant No. IIS-0219875, Beyond
The Talking Head and Animated Icon Behaviorally
Situated Avatars for Tutoring - ARDA-VACE II program From Video to Information
Cross-Modal Analysis of Planning Meetings
- Francis Quek
- Professor of Computer Science
- Director, Center for Human Computer Interaction
- Virginia Tech
2Corpus Rationale
- A quest for meaning Embodied cognition and
language production drives our research - Analysis of natural human human meetings
- Resource in support of research in
- Multimodal language analysis
- Speech recognition and analysis
- Vision-based communicative behavior analysis
3Why Multimodal Language Analysis?
- S1 you know like those fireworks?
- S2 well if we're trying to drive'em / out
herltrgte we need to put'em up herltrgte - S1 yeah well what I'm saying is we should
- S2 in front
- S1 we should do it we should make it a linltngte
through the roomltmgts / so that they explode
like here then here then here then here
4 Multimodal Language Example
5Embodied Communicative Behavior
- Constructed dynamically at the moment of speaking
(thinking for speaking) - Dependent on cultural, personal, social,
cognitive differences - Speaker is often unwitting of gestures
- Reveals the contrastive foci of language stream
(Hajcova, Halliday et. al.) - Is co-expressive (co-temporal) with speech
- Is multiply determined
- Temporal synchrony is critical for analysis
6In a Nutshell
- Gesture/Speech Framework (McNeill 1992, 2000,
2001, Quek et al 1999-2003)
7ARDA/VACE Program
- ARDA is to the intelligence community what DARPA
is to the military - Interest is in the exploitation of video data
(Video Analysis and Content Exploitation) - A key VACE challenge Meeting Analysis
- Our key theme Multimodal communication analysis
8From Video to Information Cross-Modal Analysis
for Planning Meetings
9Team
- Multimodal Meeting Analysis A Cross-Disciplinary
Enterprise
10Overarching Approach
- Coordinated multidisciplinary research
- Corpus assembly
- Data is transcribed and coded for relevant
speech/language structure - War-gaming (planning) scenarios are captured to
provide real planning behavior in a controlled
experimental context (reducing many unknowns) - Meeting room is multiply instrumented with
cross-calibrated video, synchronized audio/video,
motion tracking - All data components are time-aligned across the
dataset - Multimodal video processing research
- Research on posture, head position/orientation,
gesture tracking, hand-shape recognition, and in
multimodal integration - Research in tools for analysis, coding and
interpretation - Speech analysis research in support of
multimodality
11Scenarios
- Each Scenario to have Five Participants
- Roles Tailored to Available Participant Expertise
- Five Initial Scenarios
- Delta II Rocket Launch
- Foreign Material Exploitation
- Intervention to Support Democratic Movement
- Humanitarian Assistance
- Scholarship Selection
12Scenarios (contd)
- Planned Scenarios (to be Developed)
- Lost Aircraft Crisis Response
- Hostage Rescue
- Downed Pilot Search Rescue
- Bomb Shelter Design
13Scenario Development
- Humanitarian Assistance Walkthrough
- Purpose Develop Plan for Immediate Military
Support to Dec 04 Asian Tsunami Victims - Considerable Open Source Information from
Internet for Scenario Development - Roles
- Medical Officer
- Task Force Commander
- Intel Officer
- Operations Officer
- Weather Officer
Mission Goals Priorities Provided for Each Role
14Meulaboh, Indonesia
As intelligence officer, your role is to provide
intelligence support to OPERATION UNIFIED
ASSISTANCE. While the extent of damage is still
unknown, early reporting indicates that coastal
areas throughout South Asia have been affected.
Communications have been lost with entire towns.
Currently, the only means of determining the
magnitude of destruction is from overhead assets.
Data from the South Asia and Sri Lanka region
has already been received from civilian remote
sensing satellites. Although the US military
will be operating in the region on a strictly
humanitarian mission, the threat still exists of
hostile action to US personnel by terrorist
factions opposed to the US. As intel officer,
you are responsible for briefing the nature of
the terrorist threat in the region.
Before Tsunami
After Tsunami
15Corpus Assembly
16Data Acquisition Processing
Video Processing 10-Camera Calibration, Vector
Extraction, Hand Tracking, Gaze Tracking, Head
Modeling, Head Tracking, Body Tracking
Multi-modal Elicitation Experiment
Motion Capture Interpretation
Time Aligned Multimedia Transcription
Speech Psycholinguistic Coding Speech
Transcription, Psycholinguistic Coding
Speech Audio ProcessingAutomatic Transcript
Word/Syllable Alignment to Audio, Audio Feature
Extraction
17Meeting Room and Camera Configuration
G
F
H
E
F
H
A
H
G
F
1 C9C3 7 C7C10
2 C1C3 8 C2C5
3 C9C1 9 C2C4
4 C4C8 10 C3C5
5 C4C6 11 C7C9
6 C6C8 12 C8C10
C1 DEF C6 BAH
C2 HGF C7 DCB
C3 FE C8 BA
C4 HA C9 DE
C5 FGH C10 BCD
E
F
D
H
A
B
B
C
D
A
B
E
D
D
C
B
18Cam1
19Global Pairwise Camera Calibration
- 48 Calibration Dots for Calibration
- 18 Vicon Markers for Coordinate System
Transformation - YRXT
20Error Distributions in Meeting Room Area
(Camera pair 512)
X Direction maximum 0.5886mm minimum 0.4mm
mean 0.4755mm
Error Distribution in X Direction
Error Distribution in Y Direction
Y Direction maximum 0.6925mm minimum 0.3077mm
mean 0.4529mm
Z Direction maximum 0.5064 mm minimum 0.3804mm
mean 0.4317mm
Error Distribution in Z Direction
21VICON Motion Capture
- Motion capture technology
- Near-IR cameras
- Retro-reflective markers
- Datastation PC workstation
- Vicon modes of operation
- Individual points (as seen in calibration)
- Kinematic models
- Individual objects
22VICON Motion Capture
- Learning about MoCap
- 11/03 Initial Installation
- 6/04 Pilot scenario, using kinematic models
- 10/04 Follow-up training using object models
- 11/04 Rehearsed using Vicon with object models
- 1/05 Data captured for FME scenario
- Export position information for each
participants head, hand, body position
orientation - Post-processing of motion capture data 1 hour
per minute for a 5-participant meeting - Incorporating MoCap into Workflow
- Labeling of point clusters is labor intensive
- 3 Work Studies _at_ 20 hours/wk 60 minutes (1
dataset) per week
23Speech Processing Tasks
- Formulate an audio work flow to support the
efficient and effective construction of a
large-size high quality multimodal corpus - Implement support tools to achieve the goal
- Package time-aligned word transcriptions into
appropriate data formats that can be efficiently
shared and used
24Audio Processing
Forced Alignment
audio
Audio Recording, Meeting Metadata Annotation
OOV Word Resolution
Audio Segmentation
audio
segmentation
transcription
Corpus Integration
Manual Transcription
25VACE Metadata Approach
26Data Collection Status
- Pilot June 04
- Low Audio Volume. Sound Mixer Purchased
- Video Frame Drop-out. Purchased High Grade DV
Tapes - (AFIT 02-07-05 Democratic movement assistance)
- (AFIT 02-07-05 Democratic movement assistance,
session 2) - audio clipping in close-in mikes -- may be able
to salvage data using the desktop mics. - AFIT 02-24-05 Humanitarian Assistance (Tsunami)
- AFIT 03-04-05 Humanitarian Assistance (Tsunami)
- AFIT 03-18-05 Scholarship selection
- AFIT 04-08-05 Humanitarian Assistance (Tsunami)
- AFIT 04-08-05 Card Game
- AFIT 04-25-05 Problem Solving Task, (cause of
deterioration of Lincoln Memorial) - AFIT 06-??-05 Problem Solving Task
27Some Multimodal Meeting Room Results
28F2 F1 Lance Armstrong Episode
NIST Microcorpus July 29, 2003 Meeting Dynamics
F1 vs F2
29Gaze - NIST July 29, 2003 Data
Instrumental gaze
Gaze direction tracks social patterns
(interactive gaze) and engagement of objects
(instrumental gaze), which may be a form of
pointing as well as perception
Interactive gaze occurrences
Interactive gaze - 5 min. sample
Gaze target
Gaze source
Instrumental gaze
30Gaze - AFIT data
Gazee
Gazer
31F-formation analysis
- An F-formation arises when two or more people
cooperate together to maintain a space between
them to which they all have direct and exclusive
equal access. (A. Kendon 1977). - An F-formation is discovered from tracking gaze
direction in a social group. - It is not only about shared space.
- It reveals common ground and has an associated
meaning. - The cooperative property is crucial.
- It is useful for detecting units of thematic
content being jointly developed in a conversation.
32NIST-F-Formation Coding (76.11s92.27s)
33NIST-F-Formation Coding (92.27s108.97s)
34Summary
- Corpus collection based on sound scientific
foundations - Data includes audio, video, motion-capture,
speech transcription, and manual codings - A suite of tools for visualizing and coding the
cotemporal data has been developed - Research results demonstrate multimodal discourse
segmentation and meeting dynamics analysis