Title: Part 2: part-based models
1Part 2 part-based models
by Rob Fergus (MIT)
2Problem with bag-of-words
- All have equal probability for bag-of-words
methods - Location information is important
3Overview of section
- Representation
- Computational complexity
- Location
- Appearance
- Occlusion, Background clutter
- Recognition
- Demos
4Representation
5Model Parts and Structure
6Representation
- Object as set of parts
- Generative representation
- Model
- Relative locations between parts
- Appearance of part
- Issues
- How to model location
- How to represent appearance
- Sparse or dense (pixels or regions)
- How to handle occlusion/clutter
Figure from Fischler Elschlager 73
7History of Parts and Structure approaches
- Fischler Elschlager 1973
- Yuille 91
- Brunelli Poggio 93
- Lades, v.d. Malsburg et al. 93
- Cootes, Lanitis, Taylor et al. 95
- Amit Geman 95, 99
- Perona et al. 95, 96, 98, 00, 03, 04, 05
- Felzenszwalb Huttenlocher 00, 04
- Crandall Huttenlocher 05, 06
- Leibe Schiele 03, 04
- Many papers since 2000
8Sparse representation
Computationally tractable (105 pixels ? 101 --
102 parts) Generative representation of class
Avoid modeling global variability Success in
specific object recognition
- Throw away most image information - Parts need
to be distinctive to separate from other classes
9Region operators
- Local maxima of interest operator function
- Can give scale/orientation invariance
Figures from Kadir, Zisserman and Brady 04
10The correspondence problem
- Model with P parts
- Image with N possible assignments for each part
- Consider mapping to be 1-1
11The correspondence problem
- 1 1 mapping
- Each part assigned to unique feature
- As opposed to
- 1 Many
- Bag of words approaches
- Sudderth, Torralba, Freeman 05
- Loeff, Sorokin, Arora and Forsyth 05
- Many 1
- - Quattoni, Collins and Darrell, 04
12Location
13Connectivity of parts
- Complexity is given by size of maximal clique in
graph - Consider a 3 part model
- Each part has set of N possible locations in
image - Location of parts 2 3 is independent, given
location of L - Each part has an appearance term, independent
between parts.
Shape Model
Factor graph
Variables
L
3
2
L
3
2
S(L,2)
S(L,3)
A(L)
A(2)
A(3)
Factors
S(L)
Shape
Appearance
14from Sparse Flexible Models of Local
FeaturesGustavo Carneiro and David Lowe, ECCV
2006
Different connectivity structures
Felzenszwalb Huttenlocher 00
Fergus et al. 03 Fei-Fei et al. 03
Crandall et al. 05 Fergus et al. 05
Crandall et al. 05
O(N2)
O(N6)
O(N2)
O(N3)
Csurka 04 Vasconcelos 00
Bouchard Triggs 05
Carneiro Lowe 06
15How much does shape help?
- Crandall, Felzenszwalb, Huttenlocher CVPR05
- Shape variance increases with increasing model
complexity - Do get some benefit from shape
16Hierarchical representations
- Pixels ? Pixel groupings ? Parts ? Object
- Multi-scale approach increases number of
low-level features - Amit and Geman 98
- Bouchard Triggs 05
Images from Amit98,Bouchard05
17Some class-specific graphs
- Articulated motion
- People
- Animals
- Special parameterisations
- Limb angles
Images from Kumar, Torr and Zisserman 05,
Felzenszwalb Huttenlocher 05
18Dense layout of parts
- Layout CRF Winn Shotton, CVPR 06
19How to model location?
- Explicit Probability density functions
- Implicit Voting scheme
- Invariance
- Translation
- Scaling
- Similarity/affine
- Viewpoint
20Explicit shape model
- Cartesian
- E.g. Gaussian distribution
- Parameters of model, ? and ?
- Independence corresponds to zeros in ?
- Burl et al. 96, Weber et al. 00, Fergus et al.
03 - Polar
- Convenient forinvariance to rotation
Mikolajczyk et al., CVPR 06
21Implicit shape model
- Use Hough space voting to find object
- Leibe and Schiele 03,05
Learning
- Learn appearance codebook
- Cluster over interest points on training images
- Learn spatial distributions
- Match codebook to training images
- Record matching positions on object
- Centroid is given
Recognition
Interest Points
22Deformable Template Matching
Berg, Berg and Malik CVPR 2005
Query
Template
- Formulate problem as Integer Quadratic
Programming - O(NP) in general
- Use approximations that allow P50 and N2550 in
lt2 secs
23Other invariance methods
- Search over transformations
- Large space ( pixels x scales .)
- Closed form solution for translation and scale
(Helmer and Lowe 04) - Features give information
- Characteristic scale
- Characteristic orientation (noisy)
Figures from Mikolajczyk Schmid
24Multiple views
- Mixture of 2-D models
- Weber, Welling and Perona CVPR 00
Component 1
Component 2
Frontal
Profile
25Multiple view points
Thomas, Ferrari, Leibe, Tuytelaars, Schiele, and
L. Van Gool. Towards Multi-View Object Class
Detection, CVPR 06
Hoiem, Rother, Winn, 3D LayoutCRF for Multi-View
Object Class Recognition and Segmentation, CVPR
07
26Appearance
27Representation of appearance
- Needs to handle intra-class variation
- Task is no longer matching of descriptors
- Implicit variation (VQ to get discrete
appearance) - Explicit model of appearance (e.g. Gaussians in
SIFT space)
- Dependency structure
- Often assume each parts appearance is
independent - Common to assume independence with location
28Representation of appearance
- Invariance needs to match that of shape model
- Insensitive to small shifts in translation/scale
- Compensate for jitter of features
- e.g. SIFT
- Illumination invariance
- Normalize out
29Appearance representation
Lepetit and Fua CVPR 2005
Figure from Winn Shotton, CVPR 06
30Occlusion
- Explicit
- Additional match of each part to missing state
- Implicit
- Truncated minimum probability of appearance
µpart
Appearance space
Log probability
31Background clutter
- Explicit model
- Generative model for clutter as well as
foreground object - Use a sub-window
- At correct position, no clutter is present
32Recognition
33What task?
- Classification
- Object present/absent in image
- Background may be correlated with object
- Localization / Detection
- Localize object within the frame
- Bounding box or pixel-level segmentation
34Efficient search methods
- Interpretation tree (Grimson 87)
- Condition on assigned parts to give search
regions for remaining ones - Branch bound, A
35Distance transforms
- Felzenszwalb and Huttenlocher 00 05
- Distance transforms
- O(N2P) ? O(NP) for tree structured models
- How it works
- Assume location model is Gaussian (i.e. e-d2 )
- Consider a two part model with µ0, s1 on a 1-D
image
xi
Image pixel
Appearance log probability at xi for part 2
A2(xi)
Log probability
f(d) -d2
36Distance transforms 2
- For each position of landmark part, find best
position for part 2 - Finding most probable xi is equivalent finding
maximum over set of offset parabolas - Upper envelope computed in O(N) rather than
obvious O(N2) via distance transform (see
Felzenszwalb and Huttenlocher 05). - Add AL(x) to upper envelope (offset by µ) to get
overall probability map
xi
xg
xj
xl
xh
xk
Image pixel
Log probability
37Parts and Structure demo
- Gaussian location model star configuration
- Translation invariant only
- Use 1st part as landmark
- Appearance model is template matching
- Manual training
- User identifies correspondence on training images
- Recognition
- Run template for each part over image
- Get local maxima ? set of possible locations for
each part - Impose shape model - O(N2P) cost
- Score of each match is combination of shape model
and template responses.
38Demo images
- Sub-set of Caltech face dataset
- Caltech background images
39Demo Web Page
40Demo (2)
41Demo (3)
42Demo (4)
43Demo efficient methods
44Stochastic Grammar of ImagesS.C. Zhu et al. and
D. Mumford
45Context and Hierarchy in a Probabilistic Image
ModelJin Geman (2006)
e.g. animals, trees, rocks
e.g. contours, intermediate objects
e.g. linelets, curvelets, T-junctions
e.g. discontinuities, gradient
animal head instantiated by tiger head
46Parts and Structure modelsSummary
- Correspondence problem
- Efficient methods for large parts and
positions in image - Challenge to get representation with desired
invariance - Future directions
- Multiple views
- Approaches to learning
- Multiple category training
47(No Transcript)
48References
49Agarwal02 S. Agarwal and D. Roth. Learning a
sparse representation for object detection. In
Proceedings of the 7th European Conference on
Computer Vision, Copenhagen, Denmark, pages
113-130, 2002. Agarwal_Dataset Agarwal, S. and
Awan, A. and Roth, D. UIUC Car dataset.
http//l2r.cs.uiuc.edu/ cogcomp/Data/Car,
2002. Amit98 Y. Amit and D. Geman. A
computational model for visual selection. Neural
Computation, 11(7)1691-1715, 1998. Amit97 Y.
Amit, D. Geman, and K. Wilder. Joint induction of
shape features and tree classi- ers. IEEE
Transactions on Pattern Analysis and Machine
Intelligence, 19(11)1300-1305, 1997. Amores05
J. Amores, N. Sebe, and P. Radeva. Fast spatial
pattern discovery integrating boosting with
constellations of contextual discriptors. In
Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, San Diego, volume
2, pages 769-774, 2005. Bar-Hillel05 A.
Bar-Hillel, T. Hertz, and D. Weinshall. Object
class recognition by boosting a part based model.
In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, San Diego, volume
1, pages 702-709, 2005. Barnard03 K. Barnard,
P. Duygulu, N. de Freitas, D. Forsyth, D. Blei,
and M. Jordan. Matching words and pictures. JMLR,
31107-1135, February 2003. Berg05 A. Berg, T.
Berg, and J. Malik. Shape matching and object
recognition using low distortion correspondence.
In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, San Diego, CA,
volume 1, pages 26-33, June 2005. Biederman87
I. Biederman. Recognition-by-components A theory
of human image understanding. Psychological
Review, 94115-147, 1987. Biederman95 I.
Biederman. An Invitation to Cognitive Science,
Vol. 2 Visual Cognition, volume 2, chapter
Visual Object Recognition, pages 121-165. MIT
Press, 1995.
50Blei03 D. Blei, A. Ng, and M. Jordan. Latent
Dirichlet allocation. Journal of Machine
Learning Research, 3993-1022, January
2003. Borenstein02 E. Borenstein. and S.
Ullman. Class-specic, top-down segmentation. In
Proceedings of the 7th European Conference on
Computer Vision, Copenhagen, Denmark, pages
109-124, 2002. Burl96 M. Burl and P. Perona.
Recognition of planar object classes. In Proc.
Computer Vision and Pattern Recognition, pages
223-230, 1996. Burl96a M. Burl, M. Weber, and
P. Perona. A probabilistic approach to object
recognition using local photometry and global
geometry. In Proc. European Conference on
Computer Vision, pages 628-641, 1996. Burl98 M.
Burl, M. Weber, and P. Perona. A probabilistic
approach to object recognition using local
photometry and global geometry. In Proceedings of
the European Conference on Computer Vision, pages
628-641, 1998. Burl95 M.C. Burl, T.K. Leung,
and P. Perona. Face localization via shape
statistics. In Int. Workshop on Automatic Face
and Gesture Recognition, 1995. Canny86 J. F.
Canny. A computational approach to edge
detection. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 8(6)679-698,
1986. Crandall05 D. Crandall, P. Felzenszwalb,
and D. Huttenlocher. Spatial priors for
part-based recognition using statistical models.
In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, San Diego, volume
1, pages 10-17, 2005. Csurka04 G. Csurka, C.
Bray, C. Dance, and L. Fan. Visual categorization
with bags of keypoints. In Workshop on
Statistical Learning in Computer Vision, ECCV,
pages 1-22, 2004. Dalal05 N. Dalal and B.
Triggs. Histograms of oriented gradients for
human detection. In Proceedings of the IEEE
Conference on Computer Vision and Pattern
Recognition, San Diego, CA, pages 886--893,
2005. Dempster76 A. Dempster, N. Laird, and D.
Rubin. Maximum likelihood from incomplete data
via the EM algorithm. JRSS B, 391-38,
1976. Dorko04 G. Dorko and C. Schmid. Object
class recognition using discriminative local
features. IEEE Transactions on Pattern Analysis
and Machine Intelligence, Review(Submitted),
2004.
51FeiFei03 L. Fei-Fei, R. Fergus, and P. Perona.
A Bayesian approach to unsupervised one-shot
learning of object categories. In Proceedings of
the 9th International Conference on
Computer Vision, Nice, France, pages 1134-1141,
October 2003. FeiFei04 L. Fei-Fei, R. Fergus,
and P. Perona. Learning generative visual models
from few training examples an incremental
bayesian approach tested on 101 object
categories. In Workshop on Generative-Model Based
Vision, 2004. FeiFei05 L. Fei-Fei and P.
Perona. A Bayesian hierarchical model for
learning natural scene categories. In Proceedings
of the IEEE Conference on Computer Vision and
Pattern Recognition, San Diego, CA, volume 2,
pages 524-531, June 2005. Felzenszwalb00 P.
Felzenszwalb and D. Huttenlocher. Pictorial
structures for object recognition. In Proceedings
of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 2066-2073,
2000. Felzenszwalb05 P. Felzenszwalb and D.
Huttenlocher. Pictorial structures for object
recognition. International Journal of Computer
Vision, 6155-79, January 2005. Fergus_Datasets
R. Fergus and P. Perona. Caltech Object Category
datasets. http//www.vision. caltech.edu/html-file
s/archive.html, 2003. Fergus03 R. Fergus, P.
Perona, and P. Zisserman. Object class
recognition by unsupervised scaleinvariant learnin
g. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, volume
2, pages 264-271, 2003. Fergus04 R. Fergus, P.
Perona, and A. Zisserman. A visual category lter
for google images. In Proceedings of the 8th
European Conference on Computer Vision, Prague,
Czech Republic, pages 242-256. Springer-Verlag,
May 2004. Fergus05 R. Fergus, P. Perona, and A.
Zisserman. A sparse object category model for
ecient learning and exhaustive recognition. In
Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, San
Diego, volume 1, pages 380-387,
2005. Fergus_Technote R. Fergus, M. Weber, and
P. Perona. Ecient methods for object recognition
using the constellation model. Technical report,
California Institute of Technology, 2001.
Fischler73 M.A. Fischler and R.A. Elschlager.
The representation and matching of pictorial
structures. IEEE Transactions on Computer,
c-22(1)67-92, Jan. 1973.
52Grimson87 W. E. L. Grimson and T. Lozano-Perez.
Localizing overlapping parts by searching
the interpretation tree. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 9(4)46
9-482, 1987. Harris98 C. J. Harris and M.
Stephens. A combined corner and edge detector. In
Proceedings of the 4th Alvey Vision Conference,
Manchester, pages 147-151, 1988. Hart68 P.E.
Hart, N.J. Nilsson, and B. Raphael. A formal
basis for the determination of minimum cost
paths. IEEE Transactions on SSC, 4100-107,
1968. Helmer04 S. Helmer and D. Lowe. Object
recognition with many local features. In Workshop
on Generative Model Based Vision 2004 (GMBV),
Washington, D.C., July 2004. Hofmann99 T.
Hofmann. Probabilistic latent semantic indexing.
In SIGIR '99 Proceedings of the 22nd Annual
International ACM SIGIR Conference on Research
and Development in Information Retrieval, August
15-19, 1999, Berkeley, CA, USA, pages 50-57. ACM,
1999. Holub05 A. Holub and P. Perona. A
discriminative framework for modeling object
classes. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition,
San Diego, volume 1, pages 664-671,
2005. Kadir01 T. Kadir and M. Brady. Scale,
saliency and image description. International
Journal of Computer Vision, 45(2)83-105,
2001. Kadir_Code T. Kadir and M. Brady. Scale
Scaliency Operator. http//www.robots.ox.ac.uk/ t
imork/salscale.html, 2003. Kumar05 M. P. Kumar,
P. H. S. Torr, and A. Zisserman. Obj cut. In
Proceedings of IEEE Conference on Computer Vision
and Pattern Recognition, San Diego, pages 18-25,
2005. Leibe04 B. Leibe, A. Leonardis, and B.
Schiele. Combined object categorization and
segmentation with an implicit shape model. In
Workshop on Statistical Learning in Computer
Vision, ECCV, 2004. Leung98 T. Leung and J.
Malik. Contour continuity and region based image
segmentation. In Proceedings of the 5th European
Conference on Computer Vision, Freiburg,
Germany, LNCS 1406, pages 544-559.
Springer-Verlag, 1998. Leung95 T.K. Leung, M.C.
Burl, and P. Perona. Finding faces in cluttered
scenes using random labeled graph matching.
Proceedings of the 5th International Conference
on Computer Vision, Boston, pages 637-644, June
1995.
53Leung98 T.K. Leung, M.C. Burl, and P. Perona.
Probabilistic ane invariants for recognition.
In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 678-684,
1998. Lindeberg98 T. Lindeberg. Feature
detection with automatic scale selection.
International Journal of Computer Vision,
30(2)77-116, 1998. Lowe99 D. Lowe. Object
recognition from local scale-invariant features.
In Proceedings of the 7th International
Conference on Computer Vision, Kerkyra, Greece,
pages 1150-1157, September 1999. Lowe01 D.
Lowe. Local feature view clustering for 3D object
recognition. In Proceedings of the IEEE
Conference on Computer Vision and Pattern
Recognition, Kauai, Hawaii, pages 682-688.
Springer, December 2001. Lowe04 D. Lowe.
Distinctive image features from scale-invariant
keypoints. International Journal of Computer
Vision, 60(2)91-110, 2004. Mardia89 K.V.
Mardia and I.L. Dryden. \Shape Distributions for
Landmark Data". Advances in Applied Probability,
21742-755, 1989. Sivic05 J. Sivic, B. Russell,
A. Efros, A. Zisserman, and W. Freeman.
Discovering object categories in image
collections. Technical Report A. I. Memo
2005-005, Massachusetts Institute of Technology,
2005. Sivic03 J. Sivic and A. Zisserman. Video
Google A text retrieval approach to object
matching in videos. In Proceedings of the
International Conference on Computer Vision,
pages 1470-1477, October 2003. Sudderth05 E.
Sudderth, A. Torralba, W. Freeman, and A.
Willsky. Learning hierarchical models of scenes,
objects, and parts. In Proceedings of the IEEE
International Conference on Computer Vision,
Beijing, page To appear, 2005. Torralba04 A.
Torralba, K. P. Murphy, and W. T. Freeman.
Sharing features ecient boosting procedures for
multiclass object detection. In Proceedings of
the IEEE Conference on Computer Vision and
Pattern Recognition, Washington, DC, pages
762-769, 2004.
54Viola01 P. Viola and M. Jones. Rapid object
detection using a boosted cascade of simple
features. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages
511518, 2001. Weber00 M.Weber. Unsupervised
Learning of Models for Object Recognition. PhD
thesis, California Institute of Technology,
Pasadena, CA, 2000. Weber00a M. Weber, W.
Einhauser, M. Welling, and P. Perona.
Viewpoint-invariant learning and detection of
human heads. In Proc. 4th IEEE Int. Conf. Autom.
Face and Gesture Recog., FG2000, pages 2027,
March 2000. Weber00b M. Weber, M. Welling, and
P. Perona. Towards automatic discovery of object
categories. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages
21012108, June 2000. Weber00c M. Weber, M.
Welling, and P. Perona. Unsupervised learning of
models for recognition. In Proc. 6th Europ. Conf.
Comp. Vis., ECCV2000, volume 1, pages 1832, June
2000. Welling05 M. Welling. An expectation
maximization algorithm for inferring oset-normal
shape distributions. In Tenth International
Workshop on Articial Intelligence and
Statistics, 2005. Winn05 J. Winn and N. Joijic.
Locus Learning object classes with unsupervised
segmentation. In Proceedings of the IEEE
International Conference on Computer Vision,
Beijing, page To appear, 2005
55Quest for A Stochastic Grammar of
ImagesSong-Chun Zhu and David Mumford
56(No Transcript)
57(No Transcript)
58(No Transcript)
59(No Transcript)
60Example scheme
- Model shape using Gaussian distribution on
location between parts - Model appearance as pixel templates
- Represent image as collection of regions
- Extracted by template matching normalized-cross
correlation - Manually trained model
- Click on training images
61Connectivity of parts
- To find best match in image, we want most
probable state of L, - Run max-product message passing
L
3
2
md
ma
mb
mc
S(L,2)
S(L,3)
A(L)
A(2)
A(3)
S(L)
Take O(N2) to compute
For each of the N values of L, need to find max
over N states
62Different graph structures
6
1
3
5
3
2
3
2
1
2
1
4
5
4
6
4
5
6
Fully connected
Star structure
Tree structure
O(N6)
O(N2)
O(N2)
- Sparser graphs cannot capture all interactions
between parts
63Euclidean Affine Shape
- Translation, rotation and scaling
Euclidean Shape - Removal of camera foreshortenings
Affine Shape - Assume Gaussian density in figure space
- What is the probability density for the shape
variables in each of the different spaces?
Figures from Leung98
64Translation-invariant shape
- Translation-invariant form
e.g. P3, move 1st part to origin
- Shape space density is still Gaussian
65Affine Shape Density
- Affine Shape density (Dryden-Mardia)
- Euclidean Shape density is of similar form
- Can learnt parameters of DM density with EM!
Leung98,Welling05
66Shape
- Shape is what remains after differences due to
translation, rotation, and scale have been
factored out. Kendall84 - Statistical theory of shape Kendall, Bookstein,
Mardia Dryden
Y
V
U
X
Shape Space
Figure Space
Figures from Leung98
67Learning
68Learning situations
- Varying levels of supervision
- Unsupervised
- Image labels
- Object centroid/bounding box
- Segmented object
- Manual correspondence (typically sub-optimal)
- Generative models naturally incorporate labelling
information (or lack of it) - Discriminative schemes require labels for all
data points
Contains a motorbike
69(No Transcript)
70Learning using EM
- Task Estimation of model parameters
- Chicken and Egg type problem, since we initially
know neither - Model parameters
- - Assignment of regions to parts
- Let the assignments be a hidden variable and use
EM algorithm to learn them and the model
parameters
71Learning procedure
- Find regions their location appearance
- Initialize model parameters
- Use EM and iterate to convergence
E-step Compute assignments for which regions
belong to which part M-step Update model
parameters
- Trying to maximize likelihood consistency in
shape appearance
72Example scheme, using EM for maximum likelihood
learning
1. Current estimate of ?
2. Assign probabilities to constellations
Large P
...
pdf
Image i
Image 1
Image 2
Small P
3. Use probabilities as weights to re-estimate
parameters. Example ?
Large P
x
Small P
x
new estimate of ?
73Priors
- Implicit
- Structure of dependencies in model
- Parameterisation of model
- Feature detectors
- Explicit
- p(?)
- MAP / Bayesian learning
- Fei-Fei 03
74Learning Shape Appearance simultaneously
Fergus et al. 03
75Learn appearance then shape
Weber et al. 00
Model 1
Choice 1
Parameter Estimation
Model 2
Choice 2
Parameter Estimation
Preselected Parts (?100)
Predict / measure model performance (validation
set or directly from model)
76Discriminative training
- Sparse so parts need to be distinctive of class
- Boosted parts and structure models
- Amores et al. CVPR 2005
- Bar Hillel et al. CVPR 2005
- Discriminative features
- Weber et al. 2000
- Ullman et al.
- Train discriminatively on parameters of
generative model - Holub, Welling, Perona ICCV 2005
77Number of training images
- More supervision, fewer images needed
- Few unknown parameters
- Less supervision, more images.
- Lots of unknown parameters
- Over-fitting problems
78Number of training examples
6 part Motorbike model