Robust real-time face detection - PowerPoint PPT Presentation

About This Presentation
Title:

Robust real-time face detection

Description:

Paul A. Viola and Michael J. Jones Intl. J. Computer Vision 57(2), 137 154, 2004 (originally in CVPR 2001) (s adapted from Bill Freeman, MIT 6.869, April 2005) – PowerPoint PPT presentation

Number of Views:100
Avg rating:3.0/5.0
Slides: 61
Provided by: xueji
Category:

less

Transcript and Presenter's Notes

Title: Robust real-time face detection


1
Robust real-time face detection
  • Paul A. Viola and Michael J. Jones
  • Intl. J. Computer Vision
  • 57(2), 137154, 2004
  • (originally in CVPR2001)
  • (slides adapted from Bill Freeman, MIT 6.869,
    April 2005)

2
Scan classifier over locs. scales
3
Learn classifier from data
  • Training Data
  • 5000 faces (frontal)
  • 108 non faces
  • Faces are normalized
  • Scale, translation
  • Many variations
  • Across individuals
  • Illumination
  • Pose (rotation both in plane and out)

4
Characteristics of Algorithm
  • Feature set (is huge about 16M features)
  • Efficient feature selection using AdaBoost
  • New image representation Integral Image
  • Cascaded Classifier for rapid detection
  • Fastest known frontal face detector for gray
    scale images

5
Integral Image
  • Allows for fast feature evaluation
  • Do not work directly on image intensities
  • Compute integral image using a few operations per
    pixel (similar with Haar Basis functions)

6
Simple and Efficient Classifier
  • Select a small number of important features from
    a huge library of potential features using
    AdaBoost Freund and Schapire,1995

7
AdaBoost, Adaptive Boosting
  • Formulated by Yoav Freund and Robert Schapire.1
  • It is a meta-algorithm, can be used in
    conjunction with many other learning algorithms
    to improve their performance.
  • AdaBoost is adaptive
  • subsequent classifiers are tweaked in favor of
    instances misclassified by previous classifiers.
  • Sensitive to noisy data and outliers.
  • Less susceptible to the overfitting problem than
    most algorithms in some problems.
  • Calls a weak classifier repeatedly in a series of
    rounds from T classifiers.
  • For each call
  • a distribution of weights Dt is updated that
    indicates the importance of examples in the data
    set
  • On each round,
  • the weights of each incorrectly classified
    example are increased
  • Or alternatively, the weights of each correctly
    classified example are decreased),
  • The new classifier focuses more on those examples

8
AdaBoost
  • Given ,
  • Initialize
  • For
  • For each classifier that
    minimizes the error with respect to the
    distribution
  • is the weighted error rate of classifier
  • If , then stop
  • Choose , typically
  • Update
  • where is a normalized factor (choose so that
    Dt1 will sum_x1)

9
AdaBoost
  • Output the final classifier
  • The equation to update the distribution Dt is
    constructed so that
  • After selecting an optimal classifier for the
    distribution
  • Examples that the classifier identified correctly
    are weighted less
  • Examples that is identified incorrectly are
    weighted more.
  • When the algorithm is testing the classifiers on
    the distribution
  • it will select a classifier that better
    identifies those examples that the previous
    classifier missed.

10
Characteristics of Algorithm
  • Feature set (is huge about 16M features)
  • Efficient feature selection using AdaBoost
  • New image representation Integral Image
  • Cascaded Classifier for rapid detection

11
Cascaded Classifier
  • Combining successively more complex classifiers
    in a cascade structure
  • Dramatically increases the speed of the detector
    by
  • Focusing attention on promising regions of the
    image.
  • Focus of attention approaches
  • It is often possible to rapidly determine where
    in an image a face might occur (Tsotsos et al.,
    1995 Itti et al., 1998 Amit and Geman, 1999
    Fleuret and Geman, 2001).
  • More complex processing is reserved only for
    these promising regions.
  • The key measure of such an approach is the false
    negative rate of the attentional process.

12
Cascaded Classifier
  • Training process
  • An extremely simple and efficient classifier
  • Used as a supervised focus of attention
    operator.
  • A face detection attentional operator
  • Filter out over 50 of the image
  • Preserving 99 of the faces over a large dataset
  • This filter is exceedingly efficient
  • it can be evaluated in 20 simple operations per
    location/scale

13
Overview
  • Features form and computing
  • Combing features to form a classifier AdaBoost
  • Constructing cascade of classifiers
  • Experimental results
  • Discussions

14
Features
  • Using features rather than image pixels
  • Features act to encode ad-hoc domain knowledge
    that is difficult to learn using a finite
    quantity of training data
  • Much faster than a pixel-based system

15
Image features
  • Rectangle filters Papageorgiou et al. 1998
  • Similar to Haar wavelets
  • Differences between sums of pixels inadjacent
    rectangles
  • About 160000 rectangle features for a 200x200
    image

16
Integral Image
  • Partial sum
  • Any rectangle is
  • D 14-(23)
  • Also known as
  • summed area tables Crow84
  • boxlets Simard98

17
Huge library of filters
18
Feature Discussion
  • Primitive when compared with steerable filters,
    etc
  • Excellent for the detailed analysis of
    boundaries, image compression, and texture
    analysis.
  • Sensitive to the presence of edges, bars, and
    other simple image structure
  • Quite coarse only three orientations (, X, --)
  • Overcomplete 400 times, aspect ratio, location

19
Computational Advantage
  • Face detector scans the input at many scales
  • starting at the base scale detect face at a size
    of 24 24 pixels,
  • Then at 12 scales, 1.25 larger than the last
  • 384 288 pixel image is scanned at the top scale
  • The conventional approach
  • Compute a pyramid of 12 images (smaller and
    smaller image)
  • A fixed scale detector is scanned at each image.
  • Computation of the pyramid directly requires
    significant time.
  • It takes around .05 seconds to compute a 12 level
    pyramid of this size (on an Intel PIII 700 MHz
    processor)
  • Implemented efficiently on conventional hardware
    (using bilinear interpolation to scale each level
    of the pyramid)

20
Computational Advantage
  • Define a meaningful set of rectangle features
  • A single feature can be evaluated at any scale
    and location in a few operations.
  • Effective detectors is constructed with two
    rectangle features.
  • Computational efficiency of features
  • Face detection process can be completed for an
    entire image at every scale at 15 frames per
    second
  • About the same time required to evaluate the 12
    level image pyramid alone.

21
Learning Classification Functions
  • Any machine learning methods
  • Given the feature set and training set
  • Mixture of Gaussian model (Sung and Poggio, 1998)
  • Simple image feature and neural network (Rowley
    et al. 1998)
  • Support Vector Machine (Osuna et al. 1997b)
  • Winnow learning procedure (Roth et al. 2000)

160000 features Even though each feature can be
computed very efficiently, computing the complete
set is prohibitively expensive
22
AdaBoost
  • A very small number of features can be combined
    to form an effective classifier
  • Boost the classification performance
  • Combining a collection of weak classification
    functions to form a stronger classifier
  • Weak learner
  • Do not expect even the best classification
    function to classify the training data well
  • The first round of learning
  • Examples are re-weighted in order to emphasize
    those which were incorrectly classified by the
    previous weak classifier.
  • The final strong classifier
  • takes the form of a perceptron, a weighted
    combination of weak classifiers followed by a
    threshold.6

Training error of the strong classifier
approaches zero exponentially in the number of
rounds
23
AdaBoost
  • Selecting a small set of good classification
    functions nevertheless have significant variety
  • Select effective features which nevertheless have
    significant variety
  • Restrict the weak learner to classification
    functions
  • Each function depends on a single feature
  • Select the single rectangle feature
  • which best separates the positive and negative
    examples

24x24 subwindow
Polarity indicating the direction of inequality
threshold
feature
24
AdaBoost
  • No single feature can perform the classification
    task with low error
  • Features selected early error rates 0.10.3
  • Features selected later error rates 0.40.5
  • Threshold single features
  • Single node decision trees
  • Decision stumps

25
Constructing the classifier
  • Perceptron yields a sufficiently powerful
    classifier
  • Use AdaBoost to efficiently choose best features
  • add a new hi(x) at each round
  • each hi(xk) is a decision stump

26
Constructing the Classifier
  • For each round of boosting
  • Evaluate each rectangle filter on each example
  • Sort examples by filter values
  • Select best threshold for each filter (min error)
  • Use sorting to quickly scan for optimal threshold
  • Select best filter/threshold combination
  • Weight is a simple function of error rate
  • Reweight examples
  • (There are many tricks to make this more
    efficient.)

27
AdaBoost using single rectangular feature
  • Given example images ,
  • Initialize weight
  • For
  • Normalize the weights
  • Select the best classifier with respect to the
    weighted error
  • Define with the
    parameters minimizing
  • Update weights

28
AdaBoost using single rectangular feature
  • The final strong classifier

29
Good Reference on Boosting
  • Friedman, J., Hastie, T. and Tibshirani, R.
    Additive Logistic Regression a Statistical View
    of Boosting
  • http//www-stat.stanford.edu/hastie/Papers/boost
    .ps
  • We show that boosting fits an additive logistic
    regression model by stagewise optimization of a
    criterion very similar to the log-likelihood, and
    present likelihood based alternatives. We also
    propose a multi-logit boosting procedure which
    appears to have advantages over other methods
    proposed so far.

30
Learning Discussion
  • The set of weak classifier is extraordinarily
    large
  • One weak classifier for each distinct
    feature/threshold combination
  • KN weak classifier
  • K the number of features
  • N the number of examples
  • Others have larger classifier sets
  • Wrapper method
  • M weak classifier O(MNKN) 1016 operations
  • AdaBoost
  • O(MKN) 1011 operations

31
Learning Discussion
  • Dependency on N?
  • Suppose that the examples are sorted by a given
    feature value.
  • Any two thresholds that lie between the same pair
    of sorted examples is equivalent.
  • Therefore the total number of distinct thresholds
    is N
  • For each feature, sort the examples based on
    feature value
  • Compute optimal threshold for that feature in a
    single pass over this sorted list.
  • For each element in the list, Compute
  • Total sum of positive example weights T
  • Total sum of negative example weights T-
  • the sum of positive weights below the current
    example S
  • The sum of negative weights below the current
    example S-

32
Learning Discussion
  • Error of a threshold split the list
  • The final application demanded a very aggressive
    process which would discard the vast majority of
    features.

33
Other feature selection
  • Papageorgiou et al.1998
  • Feature selection based on feature variance.
  • 37 features out of 1734 features for every image
    subwindow still large
  • Roth et al. 2000
  • Feature selection process based on the Winnow
    exponential perceptron learning rule
  • A very large and unusual feature set each pixel
    is mapped into a binary vector of d dimensions
  • Concatenate all pixels to nd-D vector
  • Perceptron assign weight to each dimension
  • Winnow learning process
  • Converges to a solution where many of the weights
    are zero
  • Very large number of features are retained
    (perhaps a few hundred or thousand).

34
Learning Results
  • The classifier constructed from 200 features
    would yield reasonable results

For a face detector to be practical for real
applications, the false positive rate must be
closer to 1 in 1,000,000.
1 in 14084
35
Learning Results
  • Features selected by AdaBoost are meaningful and
    easily interpreted
  • In terms of detection
  • Results are compelling but not sufficient for
    many real-world tasks.
  • In terms of computation
  • Very fast, requiring 0.7 seconds to scan an 384
    by 288 pixel image.

36
Attentional Cascade
  • Achieves increased detection performance while
    radically reducing computation time
  • Construct boost classifier
  • Rejecting many of negative sub-windows
  • Detecting almost all positive instances.
  • Adjusting the strong classifier threshold to
    minimize false negatives lower threshold

37
Attentional Cascade
  • Further processing
  • Evaluate the rectangle features (requires between
    6 and 9 array references per feature).
  • Compute the weak classifier for each feature
    (requires one threshold operation per feature)
  • Combine the weak classifiers (requires one
    multiply per feature, an addition, and finally a
    threshold).

38
Attentional Cascade
  • Subsequent classifiers

39
Trading speed for accuracy
  • Given a nested set of classifier hypothesis
    classes
  • Computational Risk Minimization

40
Training a Cascade of Classifiers
  • Detection Goals
  • Good detection rates (8595) and
  • Extremely low false positive rates (on the order
    of 10-5 or 10-6).
  • False positive rate of the cascade
  • Detection rate
  • To achieve a detection rate of 0.9 by a 10 stage
    classifier
  • each stage has a detection rate of 0.99
  • false positive rate 30 (0.3010 6 10-6).

41
Training a Cascade of Classifiers
  • The expected number of features
  • Scheme for trading off these errors is to adjust
    the threshold of the perceptron produced by
    AdaBoost

42
Tradeoffs in Training
  • Classifiers with more features
  • Achieve higher detection rates and lower false
    positive rates.
  • require more time to compute
  • An optimization framework in which
  • the number of classifier stages,
  • the number of features, ni, of each stage,
  • the threshold of each stage
  • are traded off in order to minimize the expected
    number of features N given a target for F and D.
  • Finding this optimum is a tremendously difficult
    problem.

43
Training Cascaded Detector
  • A simple framework to produce effective and
    efficient classifier
  • The user selects the maximum acceptable rate for
    fi and the minimum acceptable rate for di .
  • Each layer of the cascade is trained by AdaBoost
    with the number of features used being increased
    until the target detection and false positive
    rates are met for this level.
  • The rates are determined by testing the current
    detector on a validation set.
  • If the overall target false positive rate is not
    yet met then another layer is added to the
    cascade.
  • The negative set for training subsequent layers
    is obtained by collecting all false detections
    found by running the current detector on a set of
    images which do not contain any instances of
    faces.

44
Training Cascaded Detector
  • User selects values for f , the maximum
    acceptable false positive rate per layer and d,
    the minimum acceptable detection rate per layer.
  • User selects target overall false positive
    rate, F_target .
  • P set of positive examples, N set of
    negative examples
  • F0 1.0 D0 1.0, i 0
  • while F_i gt F_target
  • i ?i 1
  • ni 0 Fi Fi-1
  • while Fi gt f Fi-1
  • ni ? ni 1
  • Use P and N to train a classifier with ni
    features using AdaBoost
  • Evaluate current cascaded classifier on
    validation set to determine Fi and Di .
  • Decrease threshold for the ith classifier until
    the current cascaded classifier has a detection
    rate of at least d Di-1 (this also affects Fi )
  • N ? Ø
  • If Fi gt Ftarget
  • Evaluate the current cascaded detector on the set
    of non-face images
  • put any false detections into the set N

45
Simple Experiment
  • A monolithic 200-feature classifier and
  • A cascade of ten 20-feature classifiers
  • Training using
  • 5000 faces 10000 nonface sub-windows

46
(No Transcript)
47
Simple Experiment
  • A monolithic 200-feature classifier and
  • A cascade of ten 20-feature classifiers
  • Training using
  • 5000 faces 10000 nonface sub-windows
  • Little difference between them in terms of
    accuracy
  • But cascaded classifier is nearly 10 times faster
  • since its first stage throws out most non-faces
    so that they are never evaluated by subsequent
    stages.

48
Detector Cascade Discussion
  • Similar to Rowley et al. (1998) (fast)
  • Trained two neural networks
  • One was moderately complex
  • focused on a small region of the image,
  • detected faces with a low false positive rate.
  • Second neural network much faster
  • focused on a larger regions of the image, and
  • detected faces with a higher false positive rate
  • This method
  • two stage cascade ? include 38 stages

49
Training Dataset
  • 4916 hand labeled faces scaled and aligned to a
    base resolution of 24 by 24 pixels.

50
Structure of the Detector Cascade
  • 38 layer cascade of classifiers included a total
    of 6060 features
  • First classifier constructed using two features
  • rejects about 50 of non-faces while
  • correctly detecting close to 100 of faces.
  • The next classifier has ten features
  • rejects 80 of nonfaces while
  • detecting almost 100 of faces.
  • The next two layers are 25-feature classifiers
  • Then three 50-feature classifiers
  • Then classifiers with variety of different
    numbers of features chosen according

51
Speed of Face Detector
  • Speed is proportional to the average number of
    features computed per sub-window.
  • On the MITCMU test set, an average of 9 features
    (/ 6061) are computed per sub-window.
  • On a 700 Mhz Pentium III, a 384x288 pixel image
    takes about 0.067 seconds to process (15 fps).
  • Roughly 15 times faster than Rowley-Baluja-Kanade
    and 600 times faster than Schneiderman-Kanade.

52
Scanning The Detector
  • Multiple scales
  • Scaling is achieved by scaling the detector
    itself, rather than scaling the image
  • The features can be evaluated at any scale with
    the same cost
  • Locations
  • Subsequent locations are obtained by shifting the
    window some number of pixels D
  • choice of D affects both speed and accuracy
  • a step size gt 1 pixel tends to
  • decrease the detection rate slightly while also
  • decreasing the number of false positives

53
(No Transcript)
54
Integration of Multiple Detections
  • Postprocess combine overlapping detections into
    a single detection
  • The set of detections are first partitioned into
    disjoint subsets
  • Two detections are in the same subset if their
    bounding regions overlap.
  • Each partition yields a single final detection.
  • The corners of the final bounding region are the
    average of the corners of all detections in the
    set.
  • Decreases the number of false positives.

55
Integration of Multiple Detections
  • A simple Voting Scheme further improves results
  • Three detections performed similarly on the final
    task, but in some cases errors were different.
  • Retaining only those detections where at least 2
    out of 3 detectors agree.
  • This improves the final detection rate as well as
    eliminating more false positives.
  • Since detector errors are not uncorrelated, the
    combination results in a measurable, but modest,
    improvement over the best single detector.

56
Sample results
MIT CMU test set
57
Failure Cases
  • Trained on frontal, upright faces.
  • The faces were only very roughly aligned so there
    is some variation in rotation both in plane and
    out of plane.
  • Detect faces that are tilted up to about 15
    degrees in plane and about 45 degrees out of
    plane (toward a profile view).
  • The detector becomes unreliable with more
    rotation.
  • Harsh backlighting in which the faces are very
    dark while the background is relatively light
    sometimes causes failures.
  • Nonlinear variance normalization based on robust
    statistics to remove outliers
  • The problem with such a normalization is the
    greatly increased computational cost within our
    integral image framework.
  • Fails on significantly occluded faces.
  • Occluded eyes usually fail.
  • The face with covered mouth will usually still be
    detected.

58
Summary (Viola-Jones)
  • Fastest known face detector for gray images
  • Three contributions with broad applicability
  • Cascaded classifier yields rapid classification
  • AdaBoost as an extremely efficient feature
    selector
  • Rectangle Features Integral Image can be used
    for rapid image analysis

59
Face detector comparison
  • Informal study by Andrew Gallagher, CMU,for CMU
    16-721 Learning-Based Methods in Vision, Spring
    2007
  • The Viola Jones algorithm OpenCV implementation
    was used. (lt2 sec per image).
  • For Schneiderman and Kanade, Object Detection
    Using the Statistics of Parts IJCV04, the
    www.pittpatt.com demo was used. (10-15 seconds
    per image, including web transmission).

60
SchneidermanKanade
ViolaJones
Write a Comment
User Comments (0)
About PowerShow.com