Title: Robust real-time face detection
1Robust real-time face detection
- Paul A. Viola and Michael J. Jones
- Intl. J. Computer Vision
- 57(2), 137154, 2004
- (originally in CVPR2001)
- (slides adapted from Bill Freeman, MIT 6.869,
April 2005)
2Scan classifier over locs. scales
3Learn classifier from data
- Training Data
- 5000 faces (frontal)
- 108 non faces
- Faces are normalized
- Scale, translation
- Many variations
- Across individuals
- Illumination
- Pose (rotation both in plane and out)
4Characteristics of Algorithm
- Feature set (is huge about 16M features)
- Efficient feature selection using AdaBoost
- New image representation Integral Image
- Cascaded Classifier for rapid detection
- Fastest known frontal face detector for gray
scale images
5Integral Image
- Allows for fast feature evaluation
- Do not work directly on image intensities
- Compute integral image using a few operations per
pixel (similar with Haar Basis functions)
6Simple and Efficient Classifier
- Select a small number of important features from
a huge library of potential features using
AdaBoost Freund and Schapire,1995
7AdaBoost, Adaptive Boosting
- Formulated by Yoav Freund and Robert Schapire.1
- It is a meta-algorithm, can be used in
conjunction with many other learning algorithms
to improve their performance. - AdaBoost is adaptive
- subsequent classifiers are tweaked in favor of
instances misclassified by previous classifiers. - Sensitive to noisy data and outliers.
- Less susceptible to the overfitting problem than
most algorithms in some problems. - Calls a weak classifier repeatedly in a series of
rounds from T classifiers. - For each call
- a distribution of weights Dt is updated that
indicates the importance of examples in the data
set - On each round,
- the weights of each incorrectly classified
example are increased - Or alternatively, the weights of each correctly
classified example are decreased), - The new classifier focuses more on those examples
8AdaBoost
- Given ,
- Initialize
- For
- For each classifier that
minimizes the error with respect to the
distribution -
- is the weighted error rate of classifier
- If , then stop
- Choose , typically
- Update
- where is a normalized factor (choose so that
Dt1 will sum_x1)
9AdaBoost
- Output the final classifier
- The equation to update the distribution Dt is
constructed so that - After selecting an optimal classifier for the
distribution - Examples that the classifier identified correctly
are weighted less - Examples that is identified incorrectly are
weighted more. - When the algorithm is testing the classifiers on
the distribution - it will select a classifier that better
identifies those examples that the previous
classifier missed.
10Characteristics of Algorithm
- Feature set (is huge about 16M features)
- Efficient feature selection using AdaBoost
- New image representation Integral Image
- Cascaded Classifier for rapid detection
11Cascaded Classifier
- Combining successively more complex classifiers
in a cascade structure - Dramatically increases the speed of the detector
by - Focusing attention on promising regions of the
image. - Focus of attention approaches
- It is often possible to rapidly determine where
in an image a face might occur (Tsotsos et al.,
1995 Itti et al., 1998 Amit and Geman, 1999
Fleuret and Geman, 2001). - More complex processing is reserved only for
these promising regions. -
- The key measure of such an approach is the false
negative rate of the attentional process.
12Cascaded Classifier
- Training process
- An extremely simple and efficient classifier
- Used as a supervised focus of attention
operator. - A face detection attentional operator
- Filter out over 50 of the image
- Preserving 99 of the faces over a large dataset
- This filter is exceedingly efficient
- it can be evaluated in 20 simple operations per
location/scale
13Overview
- Features form and computing
- Combing features to form a classifier AdaBoost
- Constructing cascade of classifiers
- Experimental results
- Discussions
14Features
- Using features rather than image pixels
- Features act to encode ad-hoc domain knowledge
that is difficult to learn using a finite
quantity of training data - Much faster than a pixel-based system
15Image features
- Rectangle filters Papageorgiou et al. 1998
- Similar to Haar wavelets
- Differences between sums of pixels inadjacent
rectangles - About 160000 rectangle features for a 200x200
image
16Integral Image
- Partial sum
- Any rectangle is
- D 14-(23)
- Also known as
- summed area tables Crow84
- boxlets Simard98
17Huge library of filters
18Feature Discussion
- Primitive when compared with steerable filters,
etc - Excellent for the detailed analysis of
boundaries, image compression, and texture
analysis. - Sensitive to the presence of edges, bars, and
other simple image structure - Quite coarse only three orientations (, X, --)
- Overcomplete 400 times, aspect ratio, location
19Computational Advantage
- Face detector scans the input at many scales
- starting at the base scale detect face at a size
of 24 24 pixels, - Then at 12 scales, 1.25 larger than the last
- 384 288 pixel image is scanned at the top scale
- The conventional approach
- Compute a pyramid of 12 images (smaller and
smaller image) - A fixed scale detector is scanned at each image.
- Computation of the pyramid directly requires
significant time. - It takes around .05 seconds to compute a 12 level
pyramid of this size (on an Intel PIII 700 MHz
processor) - Implemented efficiently on conventional hardware
(using bilinear interpolation to scale each level
of the pyramid)
20Computational Advantage
- Define a meaningful set of rectangle features
- A single feature can be evaluated at any scale
and location in a few operations. - Effective detectors is constructed with two
rectangle features. - Computational efficiency of features
- Face detection process can be completed for an
entire image at every scale at 15 frames per
second - About the same time required to evaluate the 12
level image pyramid alone.
21Learning Classification Functions
- Any machine learning methods
- Given the feature set and training set
- Mixture of Gaussian model (Sung and Poggio, 1998)
- Simple image feature and neural network (Rowley
et al. 1998) - Support Vector Machine (Osuna et al. 1997b)
- Winnow learning procedure (Roth et al. 2000)
160000 features Even though each feature can be
computed very efficiently, computing the complete
set is prohibitively expensive
22AdaBoost
- A very small number of features can be combined
to form an effective classifier - Boost the classification performance
- Combining a collection of weak classification
functions to form a stronger classifier - Weak learner
- Do not expect even the best classification
function to classify the training data well - The first round of learning
- Examples are re-weighted in order to emphasize
those which were incorrectly classified by the
previous weak classifier. - The final strong classifier
- takes the form of a perceptron, a weighted
combination of weak classifiers followed by a
threshold.6
Training error of the strong classifier
approaches zero exponentially in the number of
rounds
23AdaBoost
- Selecting a small set of good classification
functions nevertheless have significant variety - Select effective features which nevertheless have
significant variety - Restrict the weak learner to classification
functions - Each function depends on a single feature
- Select the single rectangle feature
- which best separates the positive and negative
examples
24x24 subwindow
Polarity indicating the direction of inequality
threshold
feature
24AdaBoost
- No single feature can perform the classification
task with low error - Features selected early error rates 0.10.3
- Features selected later error rates 0.40.5
- Threshold single features
- Single node decision trees
- Decision stumps
25Constructing the classifier
- Perceptron yields a sufficiently powerful
classifier - Use AdaBoost to efficiently choose best features
- add a new hi(x) at each round
- each hi(xk) is a decision stump
26Constructing the Classifier
- For each round of boosting
- Evaluate each rectangle filter on each example
- Sort examples by filter values
- Select best threshold for each filter (min error)
- Use sorting to quickly scan for optimal threshold
- Select best filter/threshold combination
- Weight is a simple function of error rate
- Reweight examples
- (There are many tricks to make this more
efficient.)
27AdaBoost using single rectangular feature
- Given example images ,
- Initialize weight
- For
- Normalize the weights
- Select the best classifier with respect to the
weighted error -
- Define with the
parameters minimizing - Update weights
28AdaBoost using single rectangular feature
- The final strong classifier
29Good Reference on Boosting
- Friedman, J., Hastie, T. and Tibshirani, R.
Additive Logistic Regression a Statistical View
of Boosting - http//www-stat.stanford.edu/hastie/Papers/boost
.ps - We show that boosting fits an additive logistic
regression model by stagewise optimization of a
criterion very similar to the log-likelihood, and
present likelihood based alternatives. We also
propose a multi-logit boosting procedure which
appears to have advantages over other methods
proposed so far.
30Learning Discussion
- The set of weak classifier is extraordinarily
large - One weak classifier for each distinct
feature/threshold combination - KN weak classifier
- K the number of features
- N the number of examples
- Others have larger classifier sets
- Wrapper method
- M weak classifier O(MNKN) 1016 operations
- AdaBoost
- O(MKN) 1011 operations
31Learning Discussion
- Dependency on N?
- Suppose that the examples are sorted by a given
feature value. - Any two thresholds that lie between the same pair
of sorted examples is equivalent. - Therefore the total number of distinct thresholds
is N - For each feature, sort the examples based on
feature value - Compute optimal threshold for that feature in a
single pass over this sorted list. - For each element in the list, Compute
- Total sum of positive example weights T
- Total sum of negative example weights T-
- the sum of positive weights below the current
example S - The sum of negative weights below the current
example S-
32Learning Discussion
- Error of a threshold split the list
- The final application demanded a very aggressive
process which would discard the vast majority of
features.
33Other feature selection
- Papageorgiou et al.1998
- Feature selection based on feature variance.
- 37 features out of 1734 features for every image
subwindow still large - Roth et al. 2000
- Feature selection process based on the Winnow
exponential perceptron learning rule - A very large and unusual feature set each pixel
is mapped into a binary vector of d dimensions - Concatenate all pixels to nd-D vector
- Perceptron assign weight to each dimension
- Winnow learning process
- Converges to a solution where many of the weights
are zero - Very large number of features are retained
(perhaps a few hundred or thousand).
34Learning Results
- The classifier constructed from 200 features
would yield reasonable results
For a face detector to be practical for real
applications, the false positive rate must be
closer to 1 in 1,000,000.
1 in 14084
35Learning Results
- Features selected by AdaBoost are meaningful and
easily interpreted - In terms of detection
- Results are compelling but not sufficient for
many real-world tasks. - In terms of computation
- Very fast, requiring 0.7 seconds to scan an 384
by 288 pixel image.
36Attentional Cascade
- Achieves increased detection performance while
radically reducing computation time - Construct boost classifier
- Rejecting many of negative sub-windows
- Detecting almost all positive instances.
- Adjusting the strong classifier threshold to
minimize false negatives lower threshold
37Attentional Cascade
- Further processing
- Evaluate the rectangle features (requires between
6 and 9 array references per feature). - Compute the weak classifier for each feature
(requires one threshold operation per feature) - Combine the weak classifiers (requires one
multiply per feature, an addition, and finally a
threshold).
38Attentional Cascade
39Trading speed for accuracy
- Given a nested set of classifier hypothesis
classes - Computational Risk Minimization
40Training a Cascade of Classifiers
- Detection Goals
- Good detection rates (8595) and
- Extremely low false positive rates (on the order
of 10-5 or 10-6). - False positive rate of the cascade
- Detection rate
- To achieve a detection rate of 0.9 by a 10 stage
classifier - each stage has a detection rate of 0.99
- false positive rate 30 (0.3010 6 10-6).
41Training a Cascade of Classifiers
- The expected number of features
- Scheme for trading off these errors is to adjust
the threshold of the perceptron produced by
AdaBoost
42Tradeoffs in Training
- Classifiers with more features
- Achieve higher detection rates and lower false
positive rates. - require more time to compute
- An optimization framework in which
- the number of classifier stages,
- the number of features, ni, of each stage,
- the threshold of each stage
- are traded off in order to minimize the expected
number of features N given a target for F and D. -
- Finding this optimum is a tremendously difficult
problem.
43Training Cascaded Detector
- A simple framework to produce effective and
efficient classifier - The user selects the maximum acceptable rate for
fi and the minimum acceptable rate for di . - Each layer of the cascade is trained by AdaBoost
with the number of features used being increased
until the target detection and false positive
rates are met for this level. - The rates are determined by testing the current
detector on a validation set. - If the overall target false positive rate is not
yet met then another layer is added to the
cascade. - The negative set for training subsequent layers
is obtained by collecting all false detections
found by running the current detector on a set of
images which do not contain any instances of
faces.
44Training Cascaded Detector
- User selects values for f , the maximum
acceptable false positive rate per layer and d,
the minimum acceptable detection rate per layer. - User selects target overall false positive
rate, F_target . - P set of positive examples, N set of
negative examples - F0 1.0 D0 1.0, i 0
- while F_i gt F_target
- i ?i 1
- ni 0 Fi Fi-1
- while Fi gt f Fi-1
- ni ? ni 1
- Use P and N to train a classifier with ni
features using AdaBoost - Evaluate current cascaded classifier on
validation set to determine Fi and Di . - Decrease threshold for the ith classifier until
the current cascaded classifier has a detection
rate of at least d Di-1 (this also affects Fi ) - N ? Ø
- If Fi gt Ftarget
- Evaluate the current cascaded detector on the set
of non-face images - put any false detections into the set N
45Simple Experiment
- A monolithic 200-feature classifier and
- A cascade of ten 20-feature classifiers
- Training using
- 5000 faces 10000 nonface sub-windows
46(No Transcript)
47Simple Experiment
- A monolithic 200-feature classifier and
- A cascade of ten 20-feature classifiers
- Training using
- 5000 faces 10000 nonface sub-windows
- Little difference between them in terms of
accuracy - But cascaded classifier is nearly 10 times faster
- since its first stage throws out most non-faces
so that they are never evaluated by subsequent
stages.
48Detector Cascade Discussion
- Similar to Rowley et al. (1998) (fast)
- Trained two neural networks
- One was moderately complex
- focused on a small region of the image,
- detected faces with a low false positive rate.
- Second neural network much faster
- focused on a larger regions of the image, and
- detected faces with a higher false positive rate
- This method
- two stage cascade ? include 38 stages
49Training Dataset
- 4916 hand labeled faces scaled and aligned to a
base resolution of 24 by 24 pixels.
50Structure of the Detector Cascade
- 38 layer cascade of classifiers included a total
of 6060 features - First classifier constructed using two features
- rejects about 50 of non-faces while
- correctly detecting close to 100 of faces.
- The next classifier has ten features
- rejects 80 of nonfaces while
- detecting almost 100 of faces.
- The next two layers are 25-feature classifiers
- Then three 50-feature classifiers
- Then classifiers with variety of different
numbers of features chosen according
51Speed of Face Detector
- Speed is proportional to the average number of
features computed per sub-window. - On the MITCMU test set, an average of 9 features
(/ 6061) are computed per sub-window. - On a 700 Mhz Pentium III, a 384x288 pixel image
takes about 0.067 seconds to process (15 fps). - Roughly 15 times faster than Rowley-Baluja-Kanade
and 600 times faster than Schneiderman-Kanade.
52Scanning The Detector
- Multiple scales
- Scaling is achieved by scaling the detector
itself, rather than scaling the image - The features can be evaluated at any scale with
the same cost - Locations
- Subsequent locations are obtained by shifting the
window some number of pixels D - choice of D affects both speed and accuracy
- a step size gt 1 pixel tends to
- decrease the detection rate slightly while also
- decreasing the number of false positives
53(No Transcript)
54Integration of Multiple Detections
- Postprocess combine overlapping detections into
a single detection - The set of detections are first partitioned into
disjoint subsets - Two detections are in the same subset if their
bounding regions overlap. - Each partition yields a single final detection.
- The corners of the final bounding region are the
average of the corners of all detections in the
set. - Decreases the number of false positives.
55Integration of Multiple Detections
- A simple Voting Scheme further improves results
- Three detections performed similarly on the final
task, but in some cases errors were different. - Retaining only those detections where at least 2
out of 3 detectors agree. - This improves the final detection rate as well as
eliminating more false positives. - Since detector errors are not uncorrelated, the
combination results in a measurable, but modest,
improvement over the best single detector.
56Sample results
MIT CMU test set
57Failure Cases
- Trained on frontal, upright faces.
- The faces were only very roughly aligned so there
is some variation in rotation both in plane and
out of plane. - Detect faces that are tilted up to about 15
degrees in plane and about 45 degrees out of
plane (toward a profile view). - The detector becomes unreliable with more
rotation. - Harsh backlighting in which the faces are very
dark while the background is relatively light
sometimes causes failures. - Nonlinear variance normalization based on robust
statistics to remove outliers - The problem with such a normalization is the
greatly increased computational cost within our
integral image framework. - Fails on significantly occluded faces.
- Occluded eyes usually fail.
- The face with covered mouth will usually still be
detected.
58Summary (Viola-Jones)
- Fastest known face detector for gray images
- Three contributions with broad applicability
- Cascaded classifier yields rapid classification
- AdaBoost as an extremely efficient feature
selector - Rectangle Features Integral Image can be used
for rapid image analysis
59Face detector comparison
- Informal study by Andrew Gallagher, CMU,for CMU
16-721 Learning-Based Methods in Vision, Spring
2007 - The Viola Jones algorithm OpenCV implementation
was used. (lt2 sec per image). - For Schneiderman and Kanade, Object Detection
Using the Statistics of Parts IJCV04, the
www.pittpatt.com demo was used. (10-15 seconds
per image, including web transmission).
60SchneidermanKanade
ViolaJones