Title: Recognizing Action at a Distance
1Recognizing Action at a Distance
- A.A. Efros, A.C. Berg, G. Mori, J. Malik
- UC Berkeley
2Looking at People
Far field
Near field
- 3-pixel man
- Blob tracking
- vast surveillance literature
- 300-pixel man
- Limb tracking
- e.g. Yacoob Black, Rao Shah, etc.
3Medium-field Recognition
4Appearance vs. Motion
5Goals
- Recognize human actions at a distance
- Low resolution, noisy data
- Moving camera, occlusions
- Wide range of actions (including non-periodic)
6Our Approach
- Motion-based approach
- Non-parametric use large amount of data
- Classify a novel motion by finding the most
similar motion from the training set - Related Work
- Periodicity analysis
- Polana Nelson Seitz Dyer Bobick et al
Cutler Davis Collins et al. - Model-free
- Temporal Templates Bobick Davis
- Orientation histograms Freeman et al Zelnik
Irani - Using MoCap data Zhao Nevatia, Ramanan
Forsyth
7Gathering action data
- Tracking
- Simple correlation-based tracker
- User-initialized
8Figure-centric Representation
- Stabilized spatio-temporal volume
- No translation information
- All motion caused by persons limbs
- Good news indifferent to camera motion
- Bad news hard!
- Good test to see if actions, not just
translation, are being captured
9Remembrance of Things Past
- Explain novel motion sequence by matching to
previously seen video clips - For each frame, match based on some temporal
extent
input sequence
Challenge how to compare motions?
10How to describe motion?
- Appearance
- Not preserved across different clothing
- Gradients (spatial, temporal)
- same (e.g. contrast reversal)
- Edges/Silhouettes
- Too unreliable
- Optical flow
- Explicitly encodes motion
- Least affected by appearance
- but too noisy
11Spatial Motion Descriptor
Image frame
Optical flow
12Spatio-temporal Motion Descriptor
Sequence A
S
Sequence B
t
13Football Actions matching
Input Sequence
Matched Frames
input
matched
14Football Actions classification
10 actions 4500 total frames 13-frame motion
descriptor
15Classifying Ballet Actions
16 Actions 24800 total frames 51-frame motion
descriptor. Men used to classify women and vice
versa.
16Classifying Tennis Actions
6 actions 4600 frames 7-frame motion
descriptor Woman player used as training, man as
testing.
17Classifying Tennis
- Red bars show classification results
18Querying the Database
input sequence
database
192D Skeleton Transfer
- We annotate database with 2D joint positions
- After matching, transfer data to novel sequence
- Ajust the match for best fit
Input sequence
Transferred 2D skeletons
203D Skeleton Transfer
- We populate database with rendered stick figures
from 3D Motion Capture data - Matching as before, we get 3D joint positions
(kind of)!
Input sequence
Transferred 3D skeletons
21Do as I Do Motion Synthesis
input sequence
synthetic sequence
- Matching two things
- Motion similarity across sequences
- Appearance similarity within sequence (like
VideoTextures) - Dynamic Programming
22Do as I Do
Source Motion
Source Appearance
3400 Frames
Result
23Do as I Say Synthesis
run walk left swing walk
right jog
run
jog
swing
walk right
walk left
synthetic sequence
- Synthesize given action labels
- e.g. video game control
24Do as I Say
- Red box shows when constraint is applied
25Actor Replacement
SHOW VIDEO
26Conclusions
- In medium field action is about motion
- What we propose
- A way of matching motions at coarse scale
- What we get out
- Action recognition
- Skeleton transfer
- Synthesis Do as I Do Do as I say
- What we learned?
- A lot to be said for the little guy!
27Thank You
28Smoothness for Synthesis
- is action similarity between source and
target - is appearance similarity within target
frames - For every source frame i, find best target frame
- by maximizing following cost function
- Optimize using dynamic programming
29The Database Analogy
30Conclusions
- Action is about motion
- Purely motion-based descriptor for actions
- We treat optical flow
- Not as measurement of pixel displacement
- But as a set of noisy features that are carefully
smoothed and aggregated - Can handle very poor, noisy data
31Cool Video, Attempt II
32(No Transcript)
33Comparing motion descriptors
t