Title: Human Activity Recognition at Mid and Near Range
1Human Activity Recognition at Mid and Near Range
- Ram Nevatia
- University of Southern California
- Based on work of several collaborators
- F. Lv, P. Natarajan, S. Lee, C. Huang
- International Workshop on Video 2009
- May 26, 2009
2Activity Recognition Motivation
- Is the key content of a video (along with scene
description) - Useful for
- Monitoring (alerts)
- Indexing (forensic, deep analysis,
entertainment) - HCI
- ..
3Activity Recognition Goals
- Goal is not just to give a name, but also a
description (not just the verb but a sentence) - Who, what, when, where, why etc?
- Some of these inferences require object
recognition in addition to action recognition - Actor, object, instrument.
- Context and story understanding is important to
infer intent
4Action as Change of State
- A change in state, is given by some function, say
f (s, s, t), - Example walking changes position of the walker
- An event can also be defined over an interval
where some properties of f are constant (or
within a certain range) - Example walking at a constant speed or in the
same direction - Recognition methods require some estimate of the
state, such as positions or pose of actors, their
trajectories and relation to scene objects
5Event Composition
- Composite Events
- Compositions of other, simpler events.
- Composition is usually, but not necessarily, a
sequence operation, e.g. getting out of a car,
opening a door and entering a building. - Primitive events those we choose not to
decompose, e.g. walking - Primitive events can be recognized directly from
observables, by using standard classifiers. - Graphical models, such as HMMs and CRFs are
natural tools for recognition of composite events.
6Hierarchical Models
- Hierarchical structure of events is naturally
reflected in hierarchical graphical models
7Issues in Activity Recognition
- Variations in image/video appearance due to
changes in viewpoint, illumination, clothing,
style of activity etc. - Inherent ambiguities in 2-D videos
- Reliable detection and tracking of objects,
especially those directly involved in activities - Temporal segmentation
- Recognition of novel events
8Mid vs Near Range
- Mid-range
- Limbs of human body, particularly the arms, are
not distinguishable - Common approach is to detect and track moving
objects and make inferences based on trajectories - Near-range
- Hands/arms are visible activities are defined by
pose transitions, not just the position
transitions - Pose tracking is difficult top-down methods are
commonly used
9Mid-Range Example
- Example of abandoned luggage detection
- Based on trajectory analysis and simple object
detection/recognition - Uses a simple Bayesian classifier and logical
reasoning about order of sub-events - Tested on PETS and ETISEO data
10Tracking in Crowded Environments
- Results from CVPR09 paper
11Dealing with Track Failures
- In crowded environments, track fragmentation is
common - Events of interest themselves may cause
occlusions, e.g. two (or more) people meeting - Possible event detection can trigger a
re-evaluation of the tracks - Meeting event example
- People must have been separate, then get close to
each other and stay together for some time - How to distinguish between passing by and
meeting? Both may cause tracks to vanish.
12Meeting Event Result (Videos)
Meeting Event Detection Result
Tracking Result
13Events requiring fine Pose Tracking
- Many events, e.g. gestures, requiring tracking of
body pose, not just position - Humans pose has large degrees of freedom
- gt50 joint angles/positions
- Bottom up pose tracking approaches are slow and
not robust - Top down approaches attempt to recognize activity
and pose simultaneously - Note that usually data is not pre-segmented into
primitive action segments - Closed-world assumption
14Activity Recognition w/o Tracking
check watch
Action segments
punch
kick
pick up
throw
15Difficulties
- Viewpoint change pose ambiguity (with a single
camera view)
- Spatial and temporal variations (style, speed)
16Key Poses and Action Nets
- Key poses are determined by an automatic method
that computes large changes in energy key poses
may be shared among different actions
17Experiments Training Set
15 action models 177 key poses 6372 nodes in
Action Net
18Action Net Apply constraints
0o
10o
19Experiments Test Set
50 clips, average length 1165 frames 5
viewpoints 10 actors (5 men, 5 women)
20Experiments Results
21A Video Result
extracted blob ground truth
original frame
with action net
without action net
22Working with Natural Environments
- Foreground segmentation is difficult
- Leads to use of lower level features, e.g. edges
and optical flow - Key poses are not discriminative enough w/o
accurate segmentation actor position also needs
to be inferred - We introduce use of continuous pose sequence.
- More general graphical models that include
- Hierarchy
- Transition probabilities may depend on
observations - Observations may depend on multiple states
- Duration models (HMMs imply an exponential decay)
23Experiments
- Tested the approach on videos of 6 actions-
- sit-on-ground(SG),
- standup-from-ground(StG),
- sit-on-chair(SC),
- standup-from-ground(StC),
- pickup(PK),
- point(P).
- Collected instances of these actions around 4
tilt angles and 5 pan angles - A total of 400 instances over all actions with
various backgrounds. - We compared the relative importance of shape,
flow and duration features with our system
(shapeflowduration).
24Results
- Combining flow and shape produces a clear
improvement. - Bulk of the expense is in computing the flow.