Robust real-time face detection presentation

About This Presentation

Transcript and Presenter's Notes

Title: Robust real-time face detection

1
Robust real-time face detection

Paul A. Viola and Michael J. Jones
Intl. J. Computer Vision
57(2), 137154, 2004
(originally in CVPR2001)
(slides adapted from Bill Freeman, MIT 6.869,
April 2005)

2
Scan classifier over locs. scales
3
Learn classifier from data

Training Data
5000 faces (frontal)
108 non faces
Faces are normalized
Scale, translation
Many variations
Across individuals
Illumination
Pose (rotation both in plane and out)

4
Characteristics of Algorithm

Feature set (is huge about 16M features)
Efficient feature selection using AdaBoost
New image representation Integral Image
Cascaded Classifier for rapid detection
Fastest known frontal face detector for gray
scale images

5
Integral Image

Allows for fast feature evaluation
Do not work directly on image intensities
Compute integral image using a few operations per
pixel (similar with Haar Basis functions)

6
Simple and Efficient Classifier

Select a small number of important features from
a huge library of potential features using
AdaBoost Freund and Schapire,1995

7
AdaBoost, Adaptive Boosting

Formulated by Yoav Freund and Robert Schapire.1
It is a meta-algorithm, can be used in
conjunction with many other learning algorithms
to improve their performance.
AdaBoost is adaptive
subsequent classifiers are tweaked in favor of
instances misclassified by previous classifiers.
Sensitive to noisy data and outliers.
Less susceptible to the overfitting problem than
most algorithms in some problems.
Calls a weak classifier repeatedly in a series of
rounds from T classifiers.
For each call
a distribution of weights Dt is updated that
indicates the importance of examples in the data
set
On each round,
the weights of each incorrectly classified
example are increased
Or alternatively, the weights of each correctly
classified example are decreased),
The new classifier focuses more on those examples

8
AdaBoost

Given ,
Initialize
For
For each classifier that
minimizes the error with respect to the
distribution
is the weighted error rate of classifier
If , then stop
Choose , typically
Update
where is a normalized factor (choose so that
Dt1 will sum_x1)

9
AdaBoost

Output the final classifier
The equation to update the distribution Dt is
constructed so that
After selecting an optimal classifier for the
distribution
Examples that the classifier identified correctly
are weighted less
Examples that is identified incorrectly are
weighted more.
When the algorithm is testing the classifiers on
the distribution
it will select a classifier that better
identifies those examples that the previous
classifier missed.

10
Characteristics of Algorithm

Feature set (is huge about 16M features)
Efficient feature selection using AdaBoost
New image representation Integral Image
Cascaded Classifier for rapid detection

11
Cascaded Classifier

Combining successively more complex classifiers
in a cascade structure
Dramatically increases the speed of the detector
by
Focusing attention on promising regions of the
image.
Focus of attention approaches
It is often possible to rapidly determine where
in an image a face might occur (Tsotsos et al.,
1995 Itti et al., 1998 Amit and Geman, 1999
Fleuret and Geman, 2001).
More complex processing is reserved only for
these promising regions.
The key measure of such an approach is the false
negative rate of the attentional process.

12
Cascaded Classifier

Training process
An extremely simple and efficient classifier
Used as a supervised focus of attention
operator.
A face detection attentional operator
Filter out over 50 of the image
Preserving 99 of the faces over a large dataset
This filter is exceedingly efficient
it can be evaluated in 20 simple operations per
location/scale

13
Overview

Features form and computing
Combing features to form a classifier AdaBoost
Constructing cascade of classifiers
Experimental results
Discussions

14
Features

Using features rather than image pixels
Features act to encode ad-hoc domain knowledge
that is difficult to learn using a finite
quantity of training data
Much faster than a pixel-based system

15
Image features

Rectangle filters Papageorgiou et al. 1998
Similar to Haar wavelets
Differences between sums of pixels inadjacent
rectangles
About 160000 rectangle features for a 200x200
image

16
Integral Image

Partial sum
Any rectangle is
D 14-(23)
Also known as
summed area tables Crow84
boxlets Simard98

17
Huge library of filters
18
Feature Discussion

Primitive when compared with steerable filters,
etc
Excellent for the detailed analysis of
boundaries, image compression, and texture
analysis.
Sensitive to the presence of edges, bars, and
other simple image structure
Quite coarse only three orientations (, X, --)
Overcomplete 400 times, aspect ratio, location

19
Computational Advantage

Face detector scans the input at many scales
starting at the base scale detect face at a size
of 24 24 pixels,
Then at 12 scales, 1.25 larger than the last
384 288 pixel image is scanned at the top scale
The conventional approach
Compute a pyramid of 12 images (smaller and
smaller image)
A fixed scale detector is scanned at each image.
Computation of the pyramid directly requires
significant time.
It takes around .05 seconds to compute a 12 level
pyramid of this size (on an Intel PIII 700 MHz
processor)
Implemented efficiently on conventional hardware
(using bilinear interpolation to scale each level
of the pyramid)

20
Computational Advantage

Define a meaningful set of rectangle features
A single feature can be evaluated at any scale
and location in a few operations.
Effective detectors is constructed with two
rectangle features.
Computational efficiency of features
Face detection process can be completed for an
entire image at every scale at 15 frames per
second
About the same time required to evaluate the 12
level image pyramid alone.

21
Learning Classification Functions

Any machine learning methods
Given the feature set and training set
Mixture of Gaussian model (Sung and Poggio, 1998)
Simple image feature and neural network (Rowley
et al. 1998)
Support Vector Machine (Osuna et al. 1997b)
Winnow learning procedure (Roth et al. 2000)

160000 features Even though each feature can be
computed very efficiently, computing the complete
set is prohibitively expensive
22
AdaBoost

A very small number of features can be combined
to form an effective classifier
Boost the classification performance
Combining a collection of weak classification
functions to form a stronger classifier
Weak learner
Do not expect even the best classification
function to classify the training data well
The first round of learning
Examples are re-weighted in order to emphasize
those which were incorrectly classified by the
previous weak classifier.
The final strong classifier
takes the form of a perceptron, a weighted
combination of weak classifiers followed by a
threshold.6

Training error of the strong classifier
approaches zero exponentially in the number of
rounds
23
AdaBoost

Selecting a small set of good classification
functions nevertheless have significant variety
Select effective features which nevertheless have
significant variety
Restrict the weak learner to classification
functions
Each function depends on a single feature
Select the single rectangle feature
which best separates the positive and negative
examples

24x24 subwindow
Polarity indicating the direction of inequality
threshold
feature
24
AdaBoost

No single feature can perform the classification
task with low error
Features selected early error rates 0.10.3
Features selected later error rates 0.40.5
Threshold single features
Single node decision trees
Decision stumps

25
Constructing the classifier

Perceptron yields a sufficiently powerful
classifier
Use AdaBoost to efficiently choose best features
add a new hi(x) at each round
each hi(xk) is a decision stump

26
Constructing the Classifier

For each round of boosting
Evaluate each rectangle filter on each example
Sort examples by filter values
Select best threshold for each filter (min error)
Use sorting to quickly scan for optimal threshold
Select best filter/threshold combination
Weight is a simple function of error rate
Reweight examples
(There are many tricks to make this more
efficient.)

27
AdaBoost using single rectangular feature

Given example images ,
Initialize weight
For
Normalize the weights
Select the best classifier with respect to the
weighted error
Define with the
parameters minimizing
Update weights

28
AdaBoost using single rectangular feature

The final strong classifier

29
Good Reference on Boosting

Friedman, J., Hastie, T. and Tibshirani, R.
Additive Logistic Regression a Statistical View
of Boosting
http//www-stat.stanford.edu/hastie/Papers/boost
.ps
We show that boosting fits an additive logistic
regression model by stagewise optimization of a
criterion very similar to the log-likelihood, and
present likelihood based alternatives. We also
propose a multi-logit boosting procedure which
appears to have advantages over other methods
proposed so far.

30
Learning Discussion

The set of weak classifier is extraordinarily
large
One weak classifier for each distinct
feature/threshold combination
KN weak classifier
K the number of features
N the number of examples
Others have larger classifier sets
Wrapper method
M weak classifier O(MNKN) 1016 operations
AdaBoost
O(MKN) 1011 operations

31
Learning Discussion

Dependency on N?
Suppose that the examples are sorted by a given
feature value.
Any two thresholds that lie between the same pair
of sorted examples is equivalent.
Therefore the total number of distinct thresholds
is N
For each feature, sort the examples based on
feature value
Compute optimal threshold for that feature in a
single pass over this sorted list.
For each element in the list, Compute
Total sum of positive example weights T
Total sum of negative example weights T-
the sum of positive weights below the current
example S
The sum of negative weights below the current
example S-

32
Learning Discussion

Error of a threshold split the list
The final application demanded a very aggressive
process which would discard the vast majority of
features.

33
Other feature selection

Papageorgiou et al.1998
Feature selection based on feature variance.
37 features out of 1734 features for every image
subwindow still large
Roth et al. 2000
Feature selection process based on the Winnow
exponential perceptron learning rule
A very large and unusual feature set each pixel
is mapped into a binary vector of d dimensions
Concatenate all pixels to nd-D vector
Perceptron assign weight to each dimension
Winnow learning process
Converges to a solution where many of the weights
are zero
Very large number of features are retained
(perhaps a few hundred or thousand).

34
Learning Results

The classifier constructed from 200 features
would yield reasonable results

For a face detector to be practical for real
applications, the false positive rate must be
closer to 1 in 1,000,000.
1 in 14084
35
Learning Results

Features selected by AdaBoost are meaningful and
easily interpreted
In terms of detection
Results are compelling but not sufficient for
many real-world tasks.
In terms of computation
Very fast, requiring 0.7 seconds to scan an 384
by 288 pixel image.

36
Attentional Cascade

Achieves increased detection performance while
radically reducing computation time
Construct boost classifier
Rejecting many of negative sub-windows
Detecting almost all positive instances.
Adjusting the strong classifier threshold to
minimize false negatives lower threshold

37
Attentional Cascade

Further processing
Evaluate the rectangle features (requires between
6 and 9 array references per feature).
Compute the weak classifier for each feature
(requires one threshold operation per feature)
Combine the weak classifiers (requires one
multiply per feature, an addition, and finally a
threshold).

38
Attentional Cascade

Subsequent classifiers

39
Trading speed for accuracy

Given a nested set of classifier hypothesis
classes
Computational Risk Minimization

40
Training a Cascade of Classifiers

Detection Goals
Good detection rates (8595) and
Extremely low false positive rates (on the order
of 10-5 or 10-6).
False positive rate of the cascade
Detection rate

To achieve a detection rate of 0.9 by a 10 stage
classifier
each stage has a detection rate of 0.99
false positive rate 30 (0.3010 6 10-6).

41
Training a Cascade of Classifiers

The expected number of features
Scheme for trading off these errors is to adjust
the threshold of the perceptron produced by
AdaBoost

42
Tradeoffs in Training

Classifiers with more features
Achieve higher detection rates and lower false
positive rates.
require more time to compute
An optimization framework in which
the number of classifier stages,
the number of features, ni, of each stage,
the threshold of each stage
are traded off in order to minimize the expected
number of features N given a target for F and D.
Finding this optimum is a tremendously difficult
problem.

43
Training Cascaded Detector

A simple framework to produce effective and
efficient classifier
The user selects the maximum acceptable rate for
fi and the minimum acceptable rate for di .
Each layer of the cascade is trained by AdaBoost
with the number of features used being increased
until the target detection and false positive
rates are met for this level.
The rates are determined by testing the current
detector on a validation set.
If the overall target false positive rate is not
yet met then another layer is added to the
cascade.
The negative set for training subsequent layers
is obtained by collecting all false detections
found by running the current detector on a set of
images which do not contain any instances of
faces.

44
Training Cascaded Detector

User selects values for f , the maximum
acceptable false positive rate per layer and d,
the minimum acceptable detection rate per layer.
User selects target overall false positive
rate, F_target .
P set of positive examples, N set of
negative examples
F0 1.0 D0 1.0, i 0
while F_i gt F_target
i ?i 1
ni 0 Fi Fi-1
while Fi gt f Fi-1
ni ? ni 1
Use P and N to train a classifier with ni
features using AdaBoost
Evaluate current cascaded classifier on
validation set to determine Fi and Di .
Decrease threshold for the ith classifier until
the current cascaded classifier has a detection
rate of at least d Di-1 (this also affects Fi )
N ? Ø
If Fi gt Ftarget
Evaluate the current cascaded detector on the set
of non-face images
put any false detections into the set N

45
Simple Experiment

A monolithic 200-feature classifier and
A cascade of ten 20-feature classifiers
Training using
5000 faces 10000 nonface sub-windows

46
(No Transcript)
47
Simple Experiment

A monolithic 200-feature classifier and
A cascade of ten 20-feature classifiers
Training using
5000 faces 10000 nonface sub-windows
Little difference between them in terms of
accuracy
But cascaded classifier is nearly 10 times faster
since its first stage throws out most non-faces
so that they are never evaluated by subsequent
stages.

48
Detector Cascade Discussion

Similar to Rowley et al. (1998) (fast)
Trained two neural networks
One was moderately complex
focused on a small region of the image,
detected faces with a low false positive rate.
Second neural network much faster
focused on a larger regions of the image, and
detected faces with a higher false positive rate
This method
two stage cascade ? include 38 stages

49
Training Dataset

4916 hand labeled faces scaled and aligned to a
base resolution of 24 by 24 pixels.

50
Structure of the Detector Cascade

38 layer cascade of classifiers included a total
of 6060 features
First classifier constructed using two features
rejects about 50 of non-faces while
correctly detecting close to 100 of faces.
The next classifier has ten features
rejects 80 of nonfaces while
detecting almost 100 of faces.
The next two layers are 25-feature classifiers
Then three 50-feature classifiers
Then classifiers with variety of different
numbers of features chosen according

51
Speed of Face Detector

Speed is proportional to the average number of
features computed per sub-window.
On the MITCMU test set, an average of 9 features
(/ 6061) are computed per sub-window.
On a 700 Mhz Pentium III, a 384x288 pixel image
takes about 0.067 seconds to process (15 fps).
Roughly 15 times faster than Rowley-Baluja-Kanade
and 600 times faster than Schneiderman-Kanade.

52
Scanning The Detector

Multiple scales
Scaling is achieved by scaling the detector
itself, rather than scaling the image
The features can be evaluated at any scale with
the same cost
Locations
Subsequent locations are obtained by shifting the
window some number of pixels D
choice of D affects both speed and accuracy
a step size gt 1 pixel tends to
decrease the detection rate slightly while also
decreasing the number of false positives

53
(No Transcript)
54
Integration of Multiple Detections

Postprocess combine overlapping detections into
a single detection
The set of detections are first partitioned into
disjoint subsets
Two detections are in the same subset if their
bounding regions overlap.
Each partition yields a single final detection.
The corners of the final bounding region are the
average of the corners of all detections in the
set.
Decreases the number of false positives.

55
Integration of Multiple Detections

A simple Voting Scheme further improves results
Three detections performed similarly on the final
task, but in some cases errors were different.
Retaining only those detections where at least 2
out of 3 detectors agree.
This improves the final detection rate as well as
eliminating more false positives.
Since detector errors are not uncorrelated, the
combination results in a measurable, but modest,
improvement over the best single detector.

56
Sample results
MIT CMU test set
57
Failure Cases

Trained on frontal, upright faces.
The faces were only very roughly aligned so there
is some variation in rotation both in plane and
out of plane.
Detect faces that are tilted up to about 15
degrees in plane and about 45 degrees out of
plane (toward a profile view).
The detector becomes unreliable with more
rotation.
Harsh backlighting in which the faces are very
dark while the background is relatively light
sometimes causes failures.
Nonlinear variance normalization based on robust
statistics to remove outliers
The problem with such a normalization is the
greatly increased computational cost within our
integral image framework.
Fails on significantly occluded faces.
Occluded eyes usually fail.
The face with covered mouth will usually still be
detected.

58
Summary (Viola-Jones)

Fastest known face detector for gray images
Three contributions with broad applicability
Cascaded classifier yields rapid classification
AdaBoost as an extremely efficient feature
selector
Rectangle Features Integral Image can be used
for rapid image analysis

59
Face detector comparison

Informal study by Andrew Gallagher, CMU,for CMU
16-721 Learning-Based Methods in Vision, Spring
2007
The Viola Jones algorithm OpenCV implementation
was used. (lt2 sec per image).
For Schneiderman and Kanade, Object Detection
Using the Statistics of Parts IJCV04, the
www.pittpatt.com demo was used. (10-15 seconds
per image, including web transmission).

60
SchneidermanKanade
ViolaJones

Write a Comment

User Comments (0)

About PowerShow.com

Robust real-time face detection PowerPoint PPT Presentation