Human Activity Recognition at Mid and Near Range - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Human Activity Recognition at Mid and Near Range

Description:

Some of these inferences require object recognition in addition to 'action' recognition ... Limbs of human body, particularly the arms, are not distinguishable ... – PowerPoint PPT presentation

Number of Views:261

Avg rating:3.0/5.0

Slides: 25

Provided by: Kei2

Category:

more less

Transcript and Presenter's Notes

Title: Human Activity Recognition at Mid and Near Range

1
Human Activity Recognition at Mid and Near Range

Ram Nevatia
University of Southern California
Based on work of several collaborators
F. Lv, P. Natarajan, S. Lee, C. Huang
International Workshop on Video 2009
May 26, 2009

2
Activity Recognition Motivation

Is the key content of a video (along with scene
description)
Useful for
Monitoring (alerts)
Indexing (forensic, deep analysis,
entertainment)
HCI
..

3
Activity Recognition Goals

Goal is not just to give a name, but also a
description (not just the verb but a sentence)
Who, what, when, where, why etc?
Some of these inferences require object
recognition in addition to action recognition
Actor, object, instrument.
Context and story understanding is important to
infer intent

4
Action as Change of State

A change in state, is given by some function, say
f (s, s, t),
Example walking changes position of the walker
An event can also be defined over an interval
where some properties of f are constant (or
within a certain range)
Example walking at a constant speed or in the
same direction
Recognition methods require some estimate of the
state, such as positions or pose of actors, their
trajectories and relation to scene objects

5
Event Composition

Composite Events
Compositions of other, simpler events.
Composition is usually, but not necessarily, a
sequence operation, e.g. getting out of a car,
opening a door and entering a building.
Primitive events those we choose not to
decompose, e.g. walking
Primitive events can be recognized directly from
observables, by using standard classifiers.
Graphical models, such as HMMs and CRFs are
natural tools for recognition of composite events.

6
Hierarchical Models

Hierarchical structure of events is naturally
reflected in hierarchical graphical models

7
Issues in Activity Recognition

Variations in image/video appearance due to
changes in viewpoint, illumination, clothing,
style of activity etc.
Inherent ambiguities in 2-D videos
Reliable detection and tracking of objects,
especially those directly involved in activities
Temporal segmentation
Recognition of novel events

8
Mid vs Near Range

Mid-range
Limbs of human body, particularly the arms, are
not distinguishable
Common approach is to detect and track moving
objects and make inferences based on trajectories
Near-range
Hands/arms are visible activities are defined by
pose transitions, not just the position
transitions
Pose tracking is difficult top-down methods are
commonly used

9
Mid-Range Example

Example of abandoned luggage detection
Based on trajectory analysis and simple object
detection/recognition
Uses a simple Bayesian classifier and logical
reasoning about order of sub-events
Tested on PETS and ETISEO data

10
Tracking in Crowded Environments

Results from CVPR09 paper

11
Dealing with Track Failures

In crowded environments, track fragmentation is
common
Events of interest themselves may cause
occlusions, e.g. two (or more) people meeting
Possible event detection can trigger a
re-evaluation of the tracks
Meeting event example
People must have been separate, then get close to
each other and stay together for some time
How to distinguish between passing by and
meeting? Both may cause tracks to vanish.

12
Meeting Event Result (Videos)
Meeting Event Detection Result
Tracking Result
13
Events requiring fine Pose Tracking

Many events, e.g. gestures, requiring tracking of
body pose, not just position
Humans pose has large degrees of freedom
gt50 joint angles/positions
Bottom up pose tracking approaches are slow and
not robust
Top down approaches attempt to recognize activity
and pose simultaneously
Note that usually data is not pre-segmented into
primitive action segments
Closed-world assumption

14
Activity Recognition w/o Tracking
check watch
Action segments
punch
kick
pick up
throw

15
Difficulties

Viewpoint change pose ambiguity (with a single
camera view)

Spatial and temporal variations (style, speed)

16
Key Poses and Action Nets

Key poses are determined by an automatic method
that computes large changes in energy key poses
may be shared among different actions

17
Experiments Training Set
15 action models 177 key poses 6372 nodes in
Action Net
18
Action Net Apply constraints
0o
10o

19
Experiments Test Set
50 clips, average length 1165 frames 5
viewpoints 10 actors (5 men, 5 women)
20
Experiments Results
21
A Video Result
extracted blob ground truth
original frame
with action net
without action net
22
Working with Natural Environments

Foreground segmentation is difficult
Leads to use of lower level features, e.g. edges
and optical flow
Key poses are not discriminative enough w/o
accurate segmentation actor position also needs
to be inferred
We introduce use of continuous pose sequence.
More general graphical models that include
Hierarchy
Transition probabilities may depend on
observations
Observations may depend on multiple states
Duration models (HMMs imply an exponential decay)

23
Experiments

Tested the approach on videos of 6 actions-
sit-on-ground(SG),
standup-from-ground(StG),
sit-on-chair(SC),
standup-from-ground(StC),
pickup(PK),
point(P).
Collected instances of these actions around 4
tilt angles and 5 pan angles
A total of 400 instances over all actions with
various backgrounds.
We compared the relative importance of shape,
flow and duration features with our system
(shapeflowduration).

24
Results