Title: Building local part models for categorylevel recognition Cordelia Schmid, INRIA
1Building local part models for category-level
recognitionCordelia Schmid, INRIA
- Joint work with J. Zhang, M. Marszalek, S.
Lazebnik, J. Ponce
2Motivation
- Invariant local descriptors
- gt robust recognition of specific objects or
scenes
3Motivation
- Recognition of textures and object classes
- gt description of intra-class variation,
selection of discriminant features, spatial
relations
texture recognition
car detection
4Local parts textons/visual words
- Clusters of local descriptor
5Semi-local parts
- Descriptors geometric relations
6Overview
- Textons / visual words SMV
- Semi-local parts maximum entropy framework
7Visual words SVM
SMV
Classification
Extract invariant regions
Compute invariant descriptors
Find clusters and signatures
Compute distance matrix
8Region extraction
affine-invariant Harris detector Mikolajczyk
and Schmid02
affine-invariant Laplacian detector Lindeberg98
9Region extraction
- Scale/affine rectification process
Rectified patches (rotational ambiguity)
10Descriptors SIFT Lowe99
- distribution of the gradient over an image patch
3D histogram
gradient
image patch
x
?
?
y
4x4 location grid and 8 orientations (128
dimensions)
very good performance in image matching
Mikolaczyk and Schmid03
11Descriptors SPIN Lazebnik03
12Region extraction and description
- Combination of detectors and descriptors
- H Harris, L Laplacian, SIFT, SPIN
- Different levels of invariance
- HA Harris affine
- HSR Harris scale rotation
- HS Harris scale
13Signature and EMD
- Hierarchical clustering
- Signature
- cluster center and
relative weight - Earth movers distance
-
- robust distance, optimizes the flow between
distributions - computed from ground distances
- can match signatures of different size
- not sensitive to the number of clusters
S ( m1 , w1 ) , , ( mk , wk )
mi
wi
D( S , S ) ?i,j fij d( mi , mj) / ?i,j
fij
d( mi , mj)
14Vocabulary - distance
- K-means clustering of the training images
- cluster centers ? vocabulary
- Frequency histogram of the clusters for each
image - Histogram comparison with - distance
15Classification
- Distance D(I1,I2) between two images
- Gaussian kernel
- Binary or multi-class SVM
16In-depth study of the approach
- Evaluation of different parameters of the system
- Comparison with existing methods (textures,
categories) - Influence of the background features
- Winner of several competitions of the PASCAL
challenge
17Evaluation of detectors and descriptors
- The combination of detectors and descriptors
gives best results - Laplacian SIFT is acceptable with less
computational cost
18Evaluation of invariance
- Best level of invariance depends on the dataset
- Affine is rarely an advantage
19Evaluation of different kernels
- EMD and -kernel give comparable results
20Texture classification
- Lazebnik et al. (CVPR03)
- RIFTSPIN for sparse interest point description
- EMDKNN classifier
- VZ-Joint (Vamar and Zisserman CVPR 2003)
- Image patch descriptor for dense description
(every pixel) - -distance KNN classifier
- Eric Hayman (ECCV 2004)
- VZ-joint for dense description (every pixel)
- -kernel SVM
- Global Gabor MeanSTD (Manjunath et.al PAMI
1996) - Gabor features for single feature vector
description - Mahalanobis distance KNN classifier
21UIUC textures textured surfaces
- 25 classes, 40 sample images each
22Comparison on UIUC
23CUReT texture dataset
felt
plaster
styrofoam
- 61 classes, 92 sample images each
- significant illumination changes, viewpoint
changes
24Comparison on CUReT
25Category classification
- Constellation model Fergus03
- Bag of features Csurka04
- Matching kernels Wallraven03, Grauman03
- Features selection Dorko03,Opelt04
26Xerox 7 categories
bikes
books
building
cars
people
phones
trees
27Misclassified images of Xerox7
books - misclassified into faces, faces, buildings
buildings - misclassified into faces, trees, trees
cars - misclassified into buildings, phones,
phones
28Graz bike and people database
bikes
people
background
29Misclassified images of Graz dataset
misclassified bikes
misclassified people
30Comparison on the CalTech database
31Category PASCAL dataset
bikes
cars
motorbikes
people
test set 1
test set 2
training
32Influence of background
- Three types of background
- Original background features
- Randomly background features
- Constant background features
- Three test group
- Background features only
- Foreground features with different types of
background for training and testing - Foreground features with different types of
background for training but test on the original
test set.
33BF training/testing
34FFBF training/testing
35Training (FFBF) / original test set
36Semi-local part maximum entropy framework
- Semi-local parts a higher-level image
representation - Combination of appearance and spatial layout
- Maximum entropy a probabilistic framework for
combining parts and inter-part relations - Discriminative framework
- No independence assumptions
- Many kinds of features, relations can be combined
within a single framework - Optimization problem is convex, finding exact
optimum is tractable
37Semi-Local Parts
- Geometric invariance (scale, similarity, affine)
- Robustness to clutter, occlusion, intra-class
variability - Weakly supervised learning
38Learning a Part Vocabulary
- Ideal approach simultaneous correspondence
search across entire training set
39Two-Image Matching
- Goal to find collections of local regions that
can be mapped onto each other using a single
rigid transformation - Implementation local search based on geometric
and photometric consistency constraints - Returns multiple correspondence hypotheses
- Automatically determines number of regions in
correspondence - Works on unsegmented, cluttered images (weakly
supervised learning)
A
40Scale-Invariant Parts
- Contour-based detector Jurie Schmid 04
41Learning a Part Vocabulary
- Match multiple pairs of training images from the
same class to produce candidate parts - Perform part selection (validation)
- Match candidate part against validation set (both
positive and negative images) - Validation score -distance between
repeatability histograms for positive and
negative images - Learn a probabilistic model of the object class
- Naïve Bayes
- Exponential model
42Feature Functions
- (Absolute) repeatability of a detected part
instance number of detected regions, denoted
?k(I ) - Single-part features
- Overlap features
43CalTech Database
- Four classes airplanes, cars (rear), faces,
motorbikes - 100 training images per class
- 50 initial images (50 largest candidate parts
retained) - 50 validation (20 highest-scoring parts retained)
- 200 test images per class
- 300 total
44CalTech Database Parts
45CalTech Results
46Airplane Part Detection
misclassified image
47Car Part Detection
misclassified image
48Face Part Detection
misclassified image
49Motorbike Part Detection
misclassified image
50The Birds Database
- Six classes egret, mandarin duck, snowy owl,
puffin, toucan, wood duck - 50 training images per class
- 20 initial images (50 largest candidate parts
retained) - 30 validation (20 highest-scoring parts retained)
- 50 test images per class
- 100 total
51Bird Parts
52Birds Database Results
53Bird Part Detection
54Bird Part Detection (cont.)
55Misclassified Images
56Classification rate vs. dictionary size
Birds
Caltech
57Conclusion
- Both local part models perform well for category
recognition - Classification results with both methods are
excellent - Future work localization, preliminary results
with semi-local parts