Title: Extracting%20features%20from%20spatio-temporal%20volumes%20(STVs)%20for%20activity%20recognition
1Extracting features from spatio-temporal volumes
(STVs) for activity recognition
- Dheeraj Singaraju
- Reading group 06/29/06
2Motivation for dealing with STVs
- Optical flow based methods would be able to
capture only first order motion. - Methods that use HMMs deal with single point
trajectories that carry only motion information
and no spatial information - We aim at a direct scheme for event detection
and classification that does not require feature
tracking, segmentation or computation of optical
flow -
- We want to detect points in the space-time
volume which have significant local variation in
both space and time.
3Approaches that we shall discuss
- On Space-Time Interest Points Ivan Laptev
- Local image features provide compact and abstract
representations of images, eg corners - Extend the concept of a spatial corner detector
to a spatio-temporal corner detector - Actions as Objects A Novel Action Represenation
Alper Yilmaz and Mubarak Shah - Concepts of differential geometry Extract
features from the STV based on local variations
in curvatures of points on the volume - The curvatures show invariance to rotation and
translation
4Detecting interest points in space
- An image can be modeled by
its linear scale representation
as follows - To look for interest points one analyzes the
matrix of 2nd moments
A more familiar form of the matrix
5Detecting interest points in space (contd.)
- We want to choose corners in the image since they
have significant spatial variation. - We therefore detect positive maxima of the
following function - How do we detect interest points in space-time ?
6Results of detecting interest points in space
- Detecting interest points in space gives interest
points in the stationary background also - We want to find interest points that have
information in the space as well as the temporal
domain.
7Detecting interest points in space-time
- A spatio-temporal image sequence can be modeled
by its linear scale representation
as follows - Note that there are different scales for the
spatial and the temporal scale, i.e. and
respectively
8Detecting interest points in space-time (contd.)
- To look for interest points one analyzes the
matrix of 2nd moments - We therefore look for the maxima of the
following spatio-temporal corner function
9Results of detecting interest points in the STV
- Consider a synthetic sequence of a ball moving
towards a wall and colliding with it - An interest point is detected at the collision
point
10Results of detecting interest points in the STV
- Consider a synthetic sequence of 2 balls moving
towards each other - Different interest points are calculated at
different spatial and temporal scales
coarser scale
11Effects of scales on interest point detection
Long temporal events are detected for large
values of while short events are detected
for small values of
Long spatial events are detected for large values
of while short events are detected for
small values of
12Scale selection in space-time
- We consider a prototype event modeled by a
spatio-temporal Gaussian blob - The scale space representation of f is hence
given by
13Scale selection in space-time (contd.)
- We want to find a differential operator that
assumes simultaneous extrema over spatial and
temporal scales that are characteristic of this
Gaussian prototype event - To recover the spatio-temporal extent of f, we
consider second order derivatives of L normalized
by the scales as - By solving for the fact that the above normalized
2nd order derivatives assume maxima at scales
and we get a 1, b ¼,
c ½ and d ¾.
14Scale selection in space-time (contd.)
- We therefore define a normalized spatio-temporal
Laplace operator as follows - The following plots show that the zero crossings
correspond to the maxima that are detected at
and
15Scale adapted space time interest points
- So far we have found events that are local
extrema in the space time volume at a particular
choice of space and time scales - We would like to detect interest points that are
extrema over the space time volume as well as
over the scale of the scale-normalized Laplace
operator - The reason for doing so is that different events
would in general have different spatial and
temporal extents
16Algorithm for detecting interest points
17Results on a previously used synthetic example
Note that all the extrema are detected
irrespective of their spatial and temporal extents
DOUBT Why are these points not detected as
interest points ?
18Results of the algorithm on real seq.
Note that events of all spatial and temporal
extents are captured. The size of the circle
shows the spatial extent of the event
19Results of interest pt. detection
Note that the regularity and extent of the
spatio-temporal interest points is actually
representative of the true events in time
20Classification of events
- Every interest point is described by its local
spatio-temporal neighbor and we compare
neighborhoods of events to classify events - The neighborhood of an interest point is defined
by evaluating the following event descriptors
This normalization guarantees the invariance of
the derivative response to image scaling
21Classification of events (contd.)
- To compare two events, we compute the Mahalanobis
distance between their descriptors as - To detect similar events in the given data, we
apply k-means clustering to the event descriptors
and thus detect groups of interest points with
similar spatio-temporal neighbourhoods - Once the cluster centers are evaluated from the
training data, given a new event, we evaluate its
distance from the cluster centers. If the
distance from all the centers is above a
threshold we declare it as a background event.
22Results of classification
23Recognizing gaits
- We extract the following features from the
spatio-temporal volume - Positions of the interest points
- The corresponding scales
- The class of interest points
- We introduce a state for the model determined by
the vector
, where the variables are - Position of person in the image
- His/her size
- Frequency of the gait
- Phase of the gait at current moment
- Temporal variations of
24Recognizing gaits (contd.)
- We then have the following model for walking
- Such a model helps handle translations as well as
uniform rescaling in the image and the temporal
domain
25Recognizing gaits (contd.)
- Given a model state X, a current time , a
length of time window , and a set of data
features detected from the recent time window
, the match between the model and the data is
defined by a weighted sum of distances h between
the model features and the data
features . - is a data feature minimizing the
distance h for a given and is the variance
for the exponential function.
26Recognizing gaits (contd.)
- To find the best match between the model and the
data, we search for the model state that
minimizes
27Summary of the approach
- An interest point detector is developed that
finds local image features that show high
variation of the image values in space and in
time - The spatio-temporal extents of detected events
can be estimated by using a normalized Laplacian
operator - The neighborhoods of the events are described
using scale invariant spatio-temporal descriptors - Different actions are then compared by checking
for the matches between the event descriptors
28Actions as objects Action sketches
- This methods analyzes the spatio-temporal volume
by using the differential geometric surface
properties such as peaks, pits, valleys and
ridges - The authors claim that these are important action
descriptors as they capture both spatial and
temporal properties - These descriptors are related to the convex and
concave parts of the object contours and/or to
the maxima in the spatio-temporal curvature of a
trajectory, and are hence view invariant.
29STV a collection of contours
- In this approach the spatio-temporal volume is
really a hollow solid object whose boundaries are
defined by the contours of the boundaries of a
person in every image frame. - It is assumed that the STV can be considered as a
manifold, which helps us to consider small
neighborhoods around a point to be nearly flat. - Since the STV is really the time evolution of a
contour, we can define a 2D parametric
representation by considering arc length s of the
contour and time t.
30STV a collection of contours (contd.)
t varying, s fixed
s varying, t fixed
The STV is a continuous representation in the
normalized time scale and it
does not require ay time warping for matching two
sequences of different lengths.
31Action descriptors
- We want to compute action descriptors that
correspond to changes in direction, speed and
shape of parts of contour - Changes in these quantities are reflected on the
surface of the STV and can be computed using
differential geometry by identifying different
landmarks. - These landmarks can be classified by basis of the
local curvatures at points on the STV
32Action descriptors (contd.)
- Differential geometry gives us the concept of
Gaussian Curvature K and Mean Curvature H that
can be evaluated at points on the manifold of the
STV. These curvatures exhibit invariance to
algebraic transformations such as translation and
rotation. - Local extrema of these curvatures can therefore
be used to identify interest points for
describing actions
33Action descriptors (contd.)
- The following table shows the different surface
types and their associated curvatures
34Analysis of action descriptors
- We consider three types of contours concave
contours, convex contours and straight contours - The following contours generate typical landmarks
in the spatial-temporal volume - Straight contour ridge, valley or flat surface
- Convex contour peak, ridge or saddle ridge
- Concave contour pit, valley or saddle valley
Shapes generated from straight contours
35STVs corresponding to hand motion
The STV generated by a hand staying stable. Such
a motion (or lack of it) creates a ridge
36STVs corresponding to hand motion
The STV created by a hand that first moves
downwards and then upwards. Note that a saddle
ridge is created at the point of change of motion
37Properties of the event descriptors
- The landmarks discussed so far are essentially
produced due to stable motion or change in stable
motion. - The stability of motion enforces that the STV
is smooth enough so that one can consider valid
local planar neighborhoods at points - Some of the landmarks are related to the
curvature of the point trajectories and body
contours as follows
38View invariance of event descriptors
- Since the landmarks are associated with extrema
of local curvatures, even when the view changes
the transformed landmarks are extrema in the new
STV - DOUBT Not very confident about the
derivation of the above - Due to this view invariance, comparing two STV
volumes is equivalent to checking if there is a
valid Fundamental Matrix relating the set of
event descriptors in 2 given action volumes.
Derived formula relating curvatures of
corresponding points in 2 different views
39Comparing two actions
- We check if a linear system of the following kind
is satisfied by the event descriptors in both the
actions - This boils down to checking if the last singular
value of A is 0. From a set of possible matches
between the input action sketch and the known
action sketches, we select the action with the
minimum matching score
40Summary of the approach
- Using concepts of differential geometry, extract
interest points action sketches that have local
spatiotemporal information by virtue of being
local extrema of curvatures in space-time - These event descriptors are associated with
uniform motion or stable changes in uniform
motion - Since the action sketches are view invariant,
comparing 2 actions is equivalent to checking if
there is a valid Fundamental Matrix relating the
positions of the action sketches for the
individual actions.