Title: Neural NetworkBased Face Detection
1Neural Network-Based Face Detection
- Henry A. Rowley, Shumeet Baluja, and Takeo Kanade
- January 1998
Presented by Roscoe Cook UCSD ECE285 Jan 30, 2002
2Overview
- Upright frontal face detection using multiple
neural networks - Each neural network examines windows of varying
size and location, deciding if each contains a
face or not - The system then arbitrates between the results of
each neural network to improve results
3Stage 1 Neural Network-Based Filter
- Receives a preprocessed 20x20 pixel set
(subsampled if necessary) - Outputs a value ranging from -1 to 1
- Window sizes are incremented by a scale factor of
1.2 - Image examined using each window size at every
pixel
4Preprocessing
- Goal Compensate for differences in camera input
gains and improve contrast - Fit a linear function to an oval region inside
the window and subtract it from the image - Histogram equalization nonlinearly map intensity
values to expand the intensity range
5Preprocessing
6(No Transcript)
7Neural Network Architecture
- Three types of hidden units
- 4 look at 10x10 pixel subregions
- 16 look at 5x5 pixel subregions
- 6 look at 20x5 pixel horizontal stripes
- Horizontal stripes are useful for finding
features such as a mouth or pair of eyes - Square subregions are useful for finding
individual features, such as an eye or nose
8(No Transcript)
9Training
- Positive Test Set
- Normalized face examples
- Negative Test Set
- Generated during training using a bootstrap method
10Training Face Examples
- 1,050 face examples gathered from face databases
at CMU and Harvard and from the World Wide Web - Various sizes, orientations, positions, and
intensities - Eyes, tip of nose, and corners and center of
mouth labeled manually - Labeling used to normalize each face to the same
scale, orientation, and position - Normalization maps each face to a 20x20 pixel
window - Fifteen faces for the training set generated from
each original image by randomly rotating the
images up to 10, scaling between 90 and 110,
translating up to half a pixel, and mirroring
11Training Non-face examples
- Non-face examples collected during training
- Initial non-face set 1000 randomly generated,
then preprocessed images - Train to output 1 for face and -1 for non-face
inputs - Run the system on an image of scenery which
contains no faces, collecting subimages which the
network incorrectly identifies as a face (output
gt 0) - Select up to 250 of these images at random, apply
preprocessing, and add into training set as
negative examples - Go to step 2.
12Stage 2 Merging Overlapping Detections and
Arbitration
- Most faces are detected at multiple nearby
positions or scales - False detections are less consistent
- Heuristics eliminate many false detections
- Spatial thresholding collapse multiple
detections - Overlap elimination when detections overlap,
keep only the most dominant
13Initial Detection Results
14Arbitration Between Multiple Networks
- Detection and false-positive rates of individual
networks are quite close - Individual networks have different biases and
make different errors (because of self-selection
of negative training examples) - This allows for improved results by combining the
results of the individual networks
15Arbitration Strategies
- Simple logic strategies
- ANDing
- ORing
- Voting
- Neural Network strategies
- Input the number of detections in a 3x3 region
that each face-detecting neural net found - Output the decision of whether or not there is a
face at the center of the 3x3 region
16Results
- Sensitivity Analysis
- Which parts of the face is the detector most
sensitive to? - Testing 2 test sets
- Test Set 1 examines the false-positive rate
- Test Set 2 examines the angular sensitivity
17Sensitivity Analysis
- Goal Find which parts of the face are most
important for detection - Divide the 20x20 pixel input images into 100 2x2
pixel region - For every 2x2 region of every image in a positive
test set, replace the region with random noise
and input it into the neural network - The resulting RMS error of the network on the
test set is an indication of how important that
portion of the image is for detection
18Sensitivity Analysis Results
The networks rely most heavily on the eyes, then
the nose, then the mouth.
19Testing
- Two test sets
- Set 1 130 images from CMU
- Sources web, photographs, newspapers, TV
broadcast - Contains 507 frontal faces.
- Wide variety of complex backgrounds
- Useful in measuring false-detection rates
- Set 2 From FERET Database.
- One face per image
- Uniform background and good lighting
- Taken from a variety of angles
- Useful in measuring angular sensitivity
20Detection Threshold Analysis
- Output values range from -1 to 1
- Zero used as threshold for training
- Changing threshold varies how conservative the
systems is - Tradeoff false-positives vs. missed faces
- Detection and false-positive rates measured while
varying the threshold
21(No Transcript)
22Detection and Error Rates for Test Set 1
23(No Transcript)
24Function Legend
- threshold (distance, threshold) Only accept a
detection if there are at least threshold
detections within a cube (extending along x, y,
and scale) in the detection pyramid surrounding
the detection. The size of the cube is determined
by distance, which is the number of a pixels from
the center of the cube to its edge (in either
position or scale). - overlap elimination It is possible that a set of
detections erroneously indicate that faces are
overlapping with one another. This heuristic
examines detections in order (from those having
the most votes within a small neighborhood to
those having the least), and. removing
conflicting overlaps as it goes. - voting(distance), AND(distance), OR(distance)
These heuristics are used for arbitrating among
multiple networks. They take a distance
parameter, similar to that used by the threshold
heuristic, which indicates how close detections
from individual networks must be to one another
to be counted as occurring at the same location
and scale. A distance of zero indicates that the
detections must occur at precisely the same
location and scale. Voting requires two out of
three networks to detect a face, AND requires two
out of two, and OR requires one out of two to
signal a detection. - network arbitration(architecture) The results
from three detection networks are fed into an
arbitration network. The parameter specifies the
network architecture used a simple perceptron, a
network with a hidden layer of5 fully connected
hidden units, or a network with two hidden
layers of 5 fully connected hidden units each,
with additional connections from the first hidden
layer to the output.
25Example Output
26Improving Speed
- Applying two networks to a 320x240 pixel image
(246,766 windows) on a 200 MHz R4400 SGI Indigo 2
takes approximately 383 seconds. (computational
cost of arbitration is negligible less than one
second) - Increasing invariance to translation will allow
for less windows to be processed
27Fast Detection
- When training, allow the face to be offset as
much a 5 pixels in any direction - Increase window size to 30x30 pixels to ensure
that entire face falls within window - The center of the face will fall within a 10x10
window - The detector can be moved in steps of 10 pixels
28Fast Method
- Algorithm runs much faster
- Many more false-positives produced
- Detections used as candidates for the original
20x20 pixel method - 10x10 pixel regions surrounding all candidates
are scanned - Heuristics overlap removal, ANDing
- Processing time on same machine 7.2 sec.
- Restriction based on skin tones also increases
speed
29Comparison to Other Systems
Tested on a 23 image subset from test set 1
Performance is generally comparable or somewhat
improved, in comparison.
30Conclusion
- Detects between 77.9 and 90.3 percent of faces
from a database of 130 faces with unconstrained
backgrounds while maintaining an acceptable rate
of false detections - Can be adjusted to be more or less conservative,
depending on application - A fast version can process a 320x240 image in two
to four seconds on a 200 MHz R4400 SGI Indigo 2