Title: FaceTrack:%20Tracking%20and%20summarizing%20faces%20from%20compressed%20video
1FaceTrack Tracking and summarizing faces from
compressed video
- Hualu Wang, Harold S. Stone, Shih-Fu Chang
- Dept. of Electrical Engineering, Columbia
University - NEC Research Institute
Presentation by Andy Rova School of Computing
Science Simon Fraser University
2Introduction
- FaceTrack
- System for both tracking and summarizing faces in
compressed video data - Tracking
- Detect faces and trace them through time in video
shots - Summarizing
- Cluster the faces across video shots and
associate them with different people - Compressed video
- Avoids the costly overhead of decoding prior to
face detection
3System Overview
- The FaceTrack systems goals are related to ideas
discussed in previous presentations - A face-based video summary can help users decide
if they want to download the whole video - The summary provides good visual indexing
information for a database search engine
4Problem definition
- The goal of the FaceTrack system is to take an
input video sequence and generate a list of
prominent faces that appear in the video, and
determine the time periods where each of the
faces appears
5General Approach
- Track faces within shots
- Once tracking is done, group faces across video
shots into faces of different people - Output a list of faces for each sequence
- For each face, list shots where it appears, and
when - Face recognition is not performed
- Very difficult in unconstrained videos due to the
broad range of face sizes, numbers, orientations
and lighting conditions
6General Approach
- Try to work in the compressed domain as much as
possible - MPEG-1 and MPEG-2 videos
- Used in applications such as digital TV and DVD
- Macroblocks and motion vectors can be used
directly in tracking - Greater computational speed compared to decoding
- Can always decode select frames down to the pixel
level for further analysis - For example, grouping faces across shots
7MPEG Review
- 3 types of frame data
- Intra-frames (I-frames)
- Forward predictive frames (P-frames)
- Bidirectional predictive frames (B-frames)
- Macroblocks are coding units which combine pixel
information via DCT - Luminance and chrominance are separated
- P-frames and B-frames are subjected to motion
compensation - Motion vectors are found and their differences
are encoded
8System Diagram
9Face Tracking
- Challenges
- Locations of detected faces may not be accurate,
since the face detection algorithm works on 16x16
macroblocks - False alarms and misses
- Multiple faces cause ambiguities when they move
close to each other - The motion approximated by the MPEG motion
vectors may not be accurate - A tracking framework which can handle these
issues in the compressed domain is needed
10The Kalman Filter
- A linear, discrete-time dynamic system is defined
by the following difference equations - We only have access to a sequence of measurements
- Given this noisy observation data, the problem is
to find the optimal estimate of the unknown
system state variables
11The Kalman Filter
- The filter is actually an iterative algorithm
which keeps taking in new observations - The new states are successively estimated
- The error of the prediction of is called the
innovation - The innovation is amplified by a gain matrix and
used as a correction for the state prediction - The corrected prediction is the new state estimate
12The Kalman Filter
- In the FaceTrack system, the state vector of
the Kalman filter is the kinematic information of
the face - position, velocity (and sometimes acceleration)
- The observation vector is the position of
the detected face - May not be accurate
- The Kalman filter lets the system predict and
update the position and parameters of the faces
13The Kalman Filter
- The FaceTrack system uses a 0.1 second time
interval for state updates - This corresponds to every I-frame and P-frame for
typical MPEG GOP structure - GOP Group Of Pictures frame structure
- For example, IBBPBBP
14The Kalman Filter
- For I-frames, the face detector results are used
directly - For P-frames, the face detector results are more
prone to false alarms - Instead, P-frame face locations are predicted
based on the MPEG motion vectors (approximately) - These locations are then fed into the Kalman
filter as observations - (in contrast with previous trackers, which
assumed that the motion-vector calculated
locations were correct alone)
15The Face Tracking Framework
- How to discriminate new faces from previous ones
during tracking? - The Mahalanobis distance is a quantitative
indicator of how close the new observation is to
the prediction - This can help separate new faces from existing
tracks if the Mahalanobis distance is greater
than a certain threshold, then the newly detected
face is unlikely to belong to a particular
existing track
16The Face Tracking Framework
- In the case where two faces move close together,
Mahalanobis distance alone cannot keep track of
multiple faces - Case where a face is missed or occluded
- Hypothesize the continuation of the face track
- Case of false alarm or faces close together
- Hypothesize creation of a new track
- The idea is to wait for new observation data
before making the final decision about a track
17Intra-shot Tracking Challenges
- Multiple hypothesis method
18Kalman Motion Models
- The Kalman filter is a framework which can model
different types of motion, depending on the
system matrices used - Several models were tested for the paper, with
varying results - Intuition who pays to research object tracking?
- The military!
- Hence many tracking models are based on
trajectories that are unlike those that faces in
video will likely exhibit - For example, in most commercial video, a human
face will not maneuver like a jet or missile
19Kalman Motion Models
- Four motion models were tested for FaceTrack
- Constant Velocity (CV)
- Constant Acceleration (CA)
- Correlated Acceleration (AA)
- Variable Dimension (VDF)
- The testing was done against ground truth
consisting of manually identified face centers in
each frame
20Kalman Motion Models
- Rather than go through the whole process in exact
detail, the next several slides are an
illustration of the differences between the CV
and CA models - Also, the matrices are expanded to show how the
states are updated
21Constant Velocity (CV) Model
22Constant Velocity (CV) Model
simplify
23Constant Velocity (CV) Model
simplify
24Constant Acceleration (CA) Model
Acceleration is now added to the state vector,
and is explicitly modeled as constants disturbed
by random noises
25Constant Acceleration (CA) Model
26The Correlated Acceleration Model
- Replaces constant accelerations with a AR(1)
model - AR(1) First order autoregressive
- A stochastic process where the immediately
previous value has an effect on the current value
(plus some random noise) - Why?
- There is a strong negative autocorrelation
between the accelerations of consecutive frames - Positive accelerations tend to be followed by
negative accelerations - Implies that faces tend to stabilize
27The Variable Dimension Filter
- A system that switches between CV (constant
velocity) and CA (constant acceleration) modes - The dimension of the state vector changes when a
maneuver is detected, hence VDF - Developed for tracking highly maneuverable
targets (probably military jets)
28Comparison of Motion Models
average tracking error
tracking runs (first 16)
29Comparison of Motion Models
- Why does CV perform best?
- Small sampling interval justifies viewing face
motion as piecewise linear movements - The face cannot achieve very high accelerations
(as opposed to a jet fighter) - AA also performs well because it fits the nature
of the face motion well - Commercial video faces exhibit few persistent
accelerations (negative autocorrelation)
30Summarization Across Shots
- Select representative frames for tracked faces
- Large, frontal-view faces are best
- Decode representative frames into the pixel
domain - Use clustering algorithms to group the faces into
different persons - Make use of domain knowledge
- For example, people do not usually change clothes
within a news segment, but often do change
outfits within a sitcom episode
31Simulation Results
32Conclusions Future Research
- The FaceTrack is an effective face tracking (and
summarization) architecture, within which
different detection and tracking methods can be
used - Could be updated to use new face detection
algorithms or improved motion models - Based on the results, the CV and AA motion models
are sufficient for commercial face motion - Summarization techniques need the most
development, followed by optimizing tracking for
adverse situations