Title: Human-Assisted Motion Annotation
1Human-Assisted Motion Annotation
Yair Weiss The Hebrew University of Jerusalem
Ce Liu William T. Freeman Edward H.
Adelson Massachusetts Institute of Technology
- Motivations
- Existing motion databases are either synthetic or
limited to indoor, experimental setups 1. Can
we have ground-truth motion for arbitrary,
real-world videos? - Humans are an expert at segmenting moving objects
and perceiving difference between two frames. Can
we have a computer vision system to quantify
human perception of motion and generate
ground-truth for motion analysis? - Several issues need to addressed
- Is human labeling reliable (compared to the
veridical ground-truth) and consistent (across
subjects)? - How to efficiently label every pixel at every
frame for hundreds of real-world videos?
Figure 1. The graphical user interface (GUI) of
our system (a) main window for labeling contours
and feature points (b) depth controller to
change depth value (c) magnifier (d) optical
flow viewer (e) control panel.
- Our work
- We designed a human-in-loop system to annotate
motion for real-world videos 2 - Semiautomatic layer segmentationThe user labels
contours using polygons, and the system
automatically propagates the contours to other
frames. The system can also propagate users
correction across frames. - Automatic layer-wise optical flowThe system
automatically computes dense optical flow fields
for every layer at every frame using
user-specified parameters. For each layer, the
user picks up the best flow that yields the
correct matching and agrees with the smoothness
and discontinuities of the image. - Semiautomatic motion labelingWhen the flow
estimation fails, the user can label sparse
correspondences between two frames, and the
system automatically interpolates it to a dense
flow field. - Automatic full-frame motion composition.
- Our methodology is examined by comparing with
veridical ground-truth data and user studies. - We created a ground-truth motion database
consisting of 10 real-world video sequences
(still growing). This database can be used for
evaluating motion analysis algorithms as well as
other vision and graphics applications.
- Experiment
- We applied our system to annotating a veridical
example from 1 (Figure 3). Our annotation is
very close to theirs 3.21 AAE, 0.104 AEP. The
main difference is on the occluding boundary. - We tested the consistency of human annotation
(Figure 3). The mean error is 0.989 AAE, 0.112
AEP. The error magnitude correlates with the
blurriness of the image. - We created a ground-truth motion database
containing 10 real-world videos with 341 frames
(Figure 5, Table 1) for both indoor and outdoor
scenes. The statistics of the ground-truth motion
are plotted in Figure 4.
Figure 2. The consistency of nine subjects
annotation. Clockwise from top left the image
frame, mean labeled motion, mean absolute error
(red higher error, white lower error), and
error histogram.
Figure 5. Some frames of the ground-truth motion
database we created. We obtained ground-truth
flow fields that are consistent with object
boundaries, as shown in column (3) and (4). In
comparison, the output of an optical flow
algorithm 3 is shown in column (5). From Table
1, the performance of this algorithm on our
database is worse than the performance on the
Yosemite sequence (1.723 AAE, 0.071 AEP).
- System Features
- We used the-state-of-the art computer vision
algorithms to design our system. Many of the
objective functions in contour tracking, flow
estimation and flow interpolation have L1 norms
for robustness. Techniques such as iterative
reweighted least square (IRLS), pyramid-based
coarse-to-fine search and occlusion/outlier
detection were intensively used for optimizing
these nonlinear objective functions. - The system was written in C, and QtTM 4.3 was
used for GUI design (Figure 1). Our system has
all the components to make annotation simple and
easy, and also gives the user full freedom to
label motion manually.
Table 1. The performance of an optical flow
algorithm 3 on our database
(a) (b) (c) (d) (e) (f) (g) (h)
AAE 8.996º 58.905º 2.573º 5.313º 1.924º 5.689º 5.243º 13.306º
AEP 0.976 4.181 0.456 0.346 0.085 0.196 0.385 1.567
References
1 S. Baker, D. Scharstein, J. Lewis, S. Roth, M. J. Black, and R. Szeliski. A database and evaluation methodo-logy for optical flow. In Proc. ICCV, 2007.
2 C Liu, W. T. Freeman, E. H. Adelson, Y. Weiss. Human-Assisted Motion Annotation. Submitted to CVPR08.
3 A. Bruhn, J.Weickert, , and C. Schnörr. Lucas/Kanade meets Horn/Schunk combining local and global optical flow methods. IJCV, 61(3)211231, 2005.
Figure 4. The marginal ((a)(h)) and joint
((i)(n)) statistics of the ground-truth motion
from the database we created (log histogram).
Symbol u and v denotes horizontal and vertical
motion, respectively. From these statistics it is
evident that horizontal motion dominates
vertical vertical motion is sparser than
horizontal flow fields are sparser than natural
images spatial derivatives are sparser than
temporal derivatives.