Title: Ankur Agarwal and Bill Triggs
1Learning to Reconstruct 3D Human Pose and Motion
from Silhouettes
- Ankur Agarwal and Bill Triggs
- LEAR
- GRAVIR-CNRS-INRIA, Grenoble
Pattern Recognition and Machine Learning in
Computer Vision Workshop 05 May 2004
2Goal
- Recover 3D human body pose from image silhouettes
- 3D pose joint angles
- Use either individual images or video sequences
- Applications
- motion capture
- human-computer interaction
- action recognition
- visual surveillance
32 Broad Classes of Approaches
- Model based approaches
- Presuppose an explicitly known parametric body
model - Inverting kinematics / Numerical optimization
- subcase Model based tracking
- Learning based approaches
- Avoid accurate 3D modeling/rendering
- e.g. Example based methods
4Model Free Learning based Approach
- Recovers 3D pose (joint angles) by direct
regression on robust silhouette descriptors - Sparse kernel-based regressor trained used human
motion capture data
- Advantages
- no need to build an explicit 3D model
- easily adapted to different people / appearances
- may be more robust than model based approach
- Disadvantages
- harder to interpret than explicit model, and may
be less - accurate
5The Basic Idea
-
- To learn a compact system that directly outputs
pose from an image - Represent the input (image) by a descriptor
vector z. - Write the multi-parameter output (pose) as a
vector x. - Learn a regressor
- x F(z) e
- Note this assumes a functional relationship
between z and x, which might not really be the
case.
6Silhouette Descriptors
7Why Use Silhouettes ?
- Captures most of the available pose information
- Can (often) be extracted from real images
- Insensitive to colour, texture, clothing
- No prior labeling (e.g. of limbs) required
- Limitations
- Artifacts like attached shadows are
- common
- Depth ordering / sidedness information
- is lost
8Ambiguities
- Which arm / leg is forwards? Front or back
view? - Where is occluded arm? How much is knee
bent? - Silhouette-to-pose problem is inherently
multi-valued - Single-valued regressors sometimes behave
erratically
9Shape Context Histograms
- Need to capture silhouette shape but be robust
against occlusions/segmentation failures - Avoid global descriptors like moments
- Use Shape Context Histograms distributions of
local shape context responses
10Shape Context Histograms Encode Locality
- First 2 principal components of Shape Context
(SC) distribution from combined training data,
with k-means centres superimposed, and an SC
distribution from a single silhouette. - SCs implicitly encode position on silhouette
an average overall human silhouettes -like form
is discernable
11Nonlinear Regression
12Regression Model
- Predict output vector x (here 3D human pose),
given input vector z (here a shape context
histogram) -
- x ? akfk(z) e A f(z) e
- fk(z) k 1p basis functions
- A (a1 a2 ap)
- f(z) (f1(z) f2(z) fp(z))T
- Kernel bases fk K(z,zk) for given centre
points zk and kernel K. - e.g. K(z,zk) exp(-ßz-zk2)
p
k1
A
13Regularized Least Squares
n
- A arg min ? A f(zi) - xi2 R(A)
- arg min A F - X2 R(A)
- R(A) Regularizer / penalty function to control
overfitting - Ridge Regression
- R(A) trace(A T A)
i1
A
A
14Relevance Vector Machine a brief introduction
- A sparse Bayesian approach to classification and
regression, proposed in M. Tipping, NIPS 01. - Gaussian priors on each parameter (or group of
parameters) - Non-convex priors of the form
- R(a) ? loga (dR/da ?/a)
- R(A) ??klogak
- ?Pruning/shrinkage strength
15Contd.
- Advantage Sparse solutions
- With kernel bases only relevant examples are
retained - With linear bases (fk(z) z), relevant features
are selected
16Pose from Static Images
17Training Test Data
- For the movements, we use real human motion
capture data - captures typical human movements, not just
possible ones - from www.ict.usc.edu/graphics/animWeb/humanoid
- Unfortunately we don t have the corresponding
silhouettes, so we synthesize realistic ones - POSER human modeler from Curious Labs
- somewhat artificial, but gives ground truth for
testing, allows a wide range of training
viewpoints. - Also test on real sequences of another person
(without ground truth)
18Methods Tested
- Regressors we tested both ridge regression and
RVM - Basis we tested both linear basis (in our
nonlinear SC Histogram descriptors) and Gaussian
kernels of various widths. - Performance is very similar for all methods
- Gaussian kernels are a little better than the
linear basis. - The RVM regressors are much sparser than ridge
regressors, with very similar performance.
19Synthetic Spiral Walk Test Sequence
Single image, RVM with Gaussian kernel, sparsity
6 (2636 examples, 156 support vectors). Mean
angular error per d.o.f. is 6.0o
20Spiral Walk Test Sequence
Mostly OK, but 15 glitches owing to pose
ambiguities
21Some statistics ..
- Mean RMS reconstruction error over all joints
6.02o - Graphs for left hip angle and overall heading
angle
22Glitches
- Results are OK most of the time, but there are
frequent glitches - regressor either chooses wrong case of an
ambiguous pair, or remains undecided. - Problem is especially evident for heading angle
the most visible pose variable. - For heading, we can quantify the conflict
- it has a 360o range so we actually regress
(cos,sin) - denormalization of this unit vector is a sign of
conflict
23(No Transcript)
24Real Image example
25Understanding the Problem
- x vs z is actually a multi-branched surface
- Functional treatment can lead to learning the
mean of possible solutions, or zig-zagging
between different solutions (in kernel spaces ) - Real solution Multi-valued regression
- A possible solution to resolve ambiguities using
temporal information
26Pose from Video Sequences
27Tracking Framework
- Reduce glitches by embedding problem in a
tracking framework. - Idea using temporal information to serve as a
hint to select the correct solution - To include state information, we use the familiar
(dynamical prediction) (observation update)
framework, but implement both parts using learned
regression models.
28Joint Regression equations
- Dynamics
- 2nd order linear autoregressive model
- xt A xt-1 B xt-2
- State-sensitive observation update
- Nonlinear dependence on state prediction
- xt C xt ?dkfk(xt,zt) e
- Kernel selects examples close in both z and x
space
29Results with Joint Regression
- Ensures temporal smoothness, handles ambiguities
- Mean RMS reconstruction error over all joints
4.1o - Graphs for left hip angle and overall heading
angle
30Spiral Walk Test Sequence
RMS reconstruction error is about 4 degrees per
joint angle
31Real Images Test Sequence
Weakness a weak observation may lead to
domination of the dynamical model..
32Conclusion
- Advantages
- Compact model for direct regression on image
observations - No explicit 3D model ?easy adaptability
- Exploits temporal coherency in sequences by
explicitly modeling dynamics - Potentially self-initialized tracking
- (84 correctness using automatic initialization)
- Disadvantages
- Requires segmentation
33Approximating the prior with quadratic bridges
in the RVM training algorithm