Human Activity Recognition at Mid and Near Range - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Human Activity Recognition at Mid and Near Range

Description:

Some of these inferences require object recognition in addition to 'action' recognition ... Limbs of human body, particularly the arms, are not distinguishable ... – PowerPoint PPT presentation

Number of Views:261
Avg rating:3.0/5.0
Slides: 25
Provided by: Kei2
Category:

less

Transcript and Presenter's Notes

Title: Human Activity Recognition at Mid and Near Range


1
Human Activity Recognition at Mid and Near Range
  • Ram Nevatia
  • University of Southern California
  • Based on work of several collaborators
  • F. Lv, P. Natarajan, S. Lee, C. Huang
  • International Workshop on Video 2009
  • May 26, 2009

2
Activity Recognition Motivation
  • Is the key content of a video (along with scene
    description)
  • Useful for
  • Monitoring (alerts)
  • Indexing (forensic, deep analysis,
    entertainment)
  • HCI
  • ..

3
Activity Recognition Goals
  • Goal is not just to give a name, but also a
    description (not just the verb but a sentence)
  • Who, what, when, where, why etc?
  • Some of these inferences require object
    recognition in addition to action recognition
  • Actor, object, instrument.
  • Context and story understanding is important to
    infer intent

4
Action as Change of State
  • A change in state, is given by some function, say
    f (s, s, t),
  • Example walking changes position of the walker
  • An event can also be defined over an interval
    where some properties of f are constant (or
    within a certain range)
  • Example walking at a constant speed or in the
    same direction
  • Recognition methods require some estimate of the
    state, such as positions or pose of actors, their
    trajectories and relation to scene objects

5
Event Composition
  • Composite Events
  • Compositions of other, simpler events.
  • Composition is usually, but not necessarily, a
    sequence operation, e.g. getting out of a car,
    opening a door and entering a building.
  • Primitive events those we choose not to
    decompose, e.g. walking
  • Primitive events can be recognized directly from
    observables, by using standard classifiers.
  • Graphical models, such as HMMs and CRFs are
    natural tools for recognition of composite events.

6
Hierarchical Models
  • Hierarchical structure of events is naturally
    reflected in hierarchical graphical models

7
Issues in Activity Recognition
  • Variations in image/video appearance due to
    changes in viewpoint, illumination, clothing,
    style of activity etc.
  • Inherent ambiguities in 2-D videos
  • Reliable detection and tracking of objects,
    especially those directly involved in activities
  • Temporal segmentation
  • Recognition of novel events

8
Mid vs Near Range
  • Mid-range
  • Limbs of human body, particularly the arms, are
    not distinguishable
  • Common approach is to detect and track moving
    objects and make inferences based on trajectories
  • Near-range
  • Hands/arms are visible activities are defined by
    pose transitions, not just the position
    transitions
  • Pose tracking is difficult top-down methods are
    commonly used

9
Mid-Range Example
  • Example of abandoned luggage detection
  • Based on trajectory analysis and simple object
    detection/recognition
  • Uses a simple Bayesian classifier and logical
    reasoning about order of sub-events
  • Tested on PETS and ETISEO data

10
Tracking in Crowded Environments
  • Results from CVPR09 paper

11
Dealing with Track Failures
  • In crowded environments, track fragmentation is
    common
  • Events of interest themselves may cause
    occlusions, e.g. two (or more) people meeting
  • Possible event detection can trigger a
    re-evaluation of the tracks
  • Meeting event example
  • People must have been separate, then get close to
    each other and stay together for some time
  • How to distinguish between passing by and
    meeting? Both may cause tracks to vanish.

12
Meeting Event Result (Videos)
Meeting Event Detection Result
Tracking Result
13
Events requiring fine Pose Tracking
  • Many events, e.g. gestures, requiring tracking of
    body pose, not just position
  • Humans pose has large degrees of freedom
  • gt50 joint angles/positions
  • Bottom up pose tracking approaches are slow and
    not robust
  • Top down approaches attempt to recognize activity
    and pose simultaneously
  • Note that usually data is not pre-segmented into
    primitive action segments
  • Closed-world assumption

14
Activity Recognition w/o Tracking
check watch
Action segments
punch
kick
pick up
throw

15
Difficulties
  • Viewpoint change pose ambiguity (with a single
    camera view)
  • Spatial and temporal variations (style, speed)

16
Key Poses and Action Nets
  • Key poses are determined by an automatic method
    that computes large changes in energy key poses
    may be shared among different actions

17
Experiments Training Set
15 action models 177 key poses 6372 nodes in
Action Net
18
Action Net Apply constraints
0o
10o

19
Experiments Test Set
50 clips, average length 1165 frames 5
viewpoints 10 actors (5 men, 5 women)
20
Experiments Results
21
A Video Result
extracted blob ground truth
original frame
with action net
without action net
22
Working with Natural Environments
  • Foreground segmentation is difficult
  • Leads to use of lower level features, e.g. edges
    and optical flow
  • Key poses are not discriminative enough w/o
    accurate segmentation actor position also needs
    to be inferred
  • We introduce use of continuous pose sequence.
  • More general graphical models that include
  • Hierarchy
  • Transition probabilities may depend on
    observations
  • Observations may depend on multiple states
  • Duration models (HMMs imply an exponential decay)

23
Experiments
  • Tested the approach on videos of 6 actions-
  • sit-on-ground(SG),
  • standup-from-ground(StG),
  • sit-on-chair(SC),
  • standup-from-ground(StC),
  • pickup(PK),
  • point(P).
  • Collected instances of these actions around 4
    tilt angles and 5 pan angles
  • A total of 400 instances over all actions with
    various backgrounds.
  • We compared the relative importance of shape,
    flow and duration features with our system
    (shapeflowduration).

24
Results
  • Combining flow and shape produces a clear
    improvement.
  • Bulk of the expense is in computing the flow.
Write a Comment
User Comments (0)
About PowerShow.com