Title: Indexing and Binning Large Databases
1Indexing and Binning Large Databases
2Abstract
- Problems with large databases
- Biometric identification (1N Matching) does not
scale well with size - No established way to organize high dimensional
biometric data - Proposed Solution
- Reduce search space before 1N matching
- Divide the database using Clustering Techniques
- Contributions
- We analyze the effect of implementing a binning
scheme on search performance and accuracy - We present binning and pruning approaches using
multiple biometrics - Using hand geometry and signature, we have
achieved a search space reduction of 95 without
any FRR
3Background
- Only biometric identification (1N matching) can
prevent duplicate enrollments, double dipping - Biometrics are being deployed for immigration and
national ID applications - US-VISIT program
- Voter ID and national ID programs3
- Potential size that can run into millions
- Current research is focused only on accuracy
- Apart from accuracy, scalability, speed and
efficiency also become important at this scale
4Challenges
- Textual/Numeric Data
- Data is scalar(1D)
- Textual/numeric data can be linearly ordered and
therefore easily indexed
- Biometric Data
- Biometric templates are high dimensional
- No linear ordering or sorting methods exists for
biometric data
5Search space analysis
- As number of stored templates increases, template
density (TD) also increases
6Identification problem
- Number of false positives grows geometrically
with the size of the database - Let FAR and FRR be the False Acceptance Rate
(probability) and False Reject Rate (probability)
for 11 matching - For a 1N matching,
-
- The total number of False Accepts is given by
7State of the Art
Biometrics State of the art Research Problems
Fingerprint 0.15 FRR at 1 FAR (FVC 2002) Fingerprint Enhancement Partial fingerprint matching
Face Recognition 10 FRR at 1 FAR (FRVT 2002) Improving accuracy Face alignment variation Handling lighting variations
Hand Geometry 4 FRR at 0 FAR (Transport Security Administration Tests) Developing reliable models Identification problem
Signature Verification 1.5(IBM Israel) Developing offline verification systems Handling skillful forgeries
Voice Verification lt1 FRR (Current Research) Handling channel normalization User habituation Text and language independence
8State of the Art
Biometrics State of the art Research Problems
Fingerprint 0.15 FRR at 1 FAR (FVC 2002) Fingerprint Enhancement Partial fingerprint matching
Face Recognition 10 FRR at 1 FAR (FRVT 2002) Improving accuracy Face alignment variation Handling lighting variations
Hand Geometry 2.6 FRR at 0.02 FAR (CUBS, SUNY-Buffalo) Developing reliable models Identification problem
Signature Verification 1.5(IBM Israel) Developing offline verification systems Handling skillful forgeries
Voice Verification lt1 FRR (Current Research) Handling channel normalization User habituation Text and language independence
9Identification problem (contd.)
- Even if FAR 0.0001, False accepts 1 in 10
for N100000(lower bound) in the identification
case. - No single biometric is capable of meeting this
security requirement individually - Ways to reduce identification errors
- Reduce FAR
- FAR is limited by feature representation and the
recognition algorithm - Cannot be indefinitely reduced
- Reduce N
- Classify or index the biometric database. (e.g
Henry classification system for fingerprints) - Index the records based on meta-data
- Can we do better?
10Fingerprint Features
Fingerprints can be classified based on the ridge
flow pattern
Fingerprints can be distinguished based on the
ridge characteristics 65 of fingerprints belong
to the Loop class
11Henry Classification of Fingerprints
- Ratha et al,1996 used Henry Classification on
database of 1800 templates, tested on 100
templates - Search Space 25 FRR 10
- Jain, Pankanti,2000 similar experiment on
database of 700 templates achieved FRR 7.4
(Focus on classification only) - State-of-art Fingerprint classification system
Capelli,Maio,Maltoni,Nanni,2003 has FRR 4.8
for 5 class problem and 3.7 for 4 class problem - Though natural class exists, still classification
is non-trivial - Natural classes do not exist for biometrics like
Hand Geometry - Need more sophistication for partitioning database
12Analysis of search space reduction
- We can improve performance by reducing the search
space during identification - Let PSYS Penetration rate between 0.0 and 1.0
- Penetration rate is the average fraction of the
database searched during identification - Effective size NPSYS
- The total number of False Accepts is given by
- State of the art fingerprint systems has PSYS0.5
13Effect of binning on accuracy
- For PSYS lt 0.2, the false accepts are almost
constant - Query response time improves by a factor of PSYS
- Capabilities of a low FAR system
- Will allow us to screen immigrants at airports
- Will make biometric systems more user-friendly by
eliminating the need to remember PINs and IDs
14Binning
- Binning can be used to achieve a smaller PSYS
- Partition the feature space
- Each bin is represented by a cluster center CK
- Records are compared with only NB cluster centers
- Bin representatives are computed offline during
training - Challenges
- How to handle clustering of large databases?
- How to handle additions and deletions?
15Tradeoff
- Although binning reduces search space, it
introduces another source of identification error
Bin Miss - If the bin in which the user record exists is not
searched, then FRR is generated no matter how
good the matcher is - If P(B) is the probability of getting the correct
bin - Binning increases the probability of False
Rejects - Not tolerable in security and screening
applications - Solution
- Use K-means clustering to find K bins
- Check Ns nearest bins for the record, such that
P(B) 1
16Formal definition of Binning
- In general a biometric template may be
represented as a vector - Vectors are represented into N distinct clusters
each represented by a code book vector - The code book vectors divide the feature space
into N distinct Voronoi regions - Every template is closest to the mean (codebook
vector) of the region it belongs to -
17Search Space Partition Voronoi Regions
18Hand Geometry Template
- Feature extraction stages
- Image capture
- Binarization
- Contour Extraction
- Noise Removal
- 35 Features are extracted
- 25 directly measured features
- 10 ratio and perimeter features
19Signature Template
11 Features Extracted
- Regression Constants b0,b1
- Compactness
- Signature Length
- Major Stroke Length
- Major Stroke Angle
- Connected Components
- Hole Count
- Hole Area
- Stroke Count
- Signing Time
20Results
35 Dimensional Hand Geometry data Best
Penetration 35.8 for 6 bins FRR 0
11 Dimensional Signature data Best Penetration
35.57 for 6 bins FRR 0
Dataset 250 Training Set 250 Testing Set
21Multi-modal approach
- Resulting bins have very high template densities
- A different biometric modality should be used to
classify templates within a bin - Multimodal biometrics
- Using multiple biometrics improves accuracy
- It is difficult to forge multiple biometrics
- Composite templates reduce template density
- Statistical independence ensures that individual
binning results are diverse - The search space (intersection of bins) is
reduced due to low commonality between the
individual binning results
22Multi-Modal Approach
23Multi-Modal Approach
Search Space 5 original database size FRR 0
24Results of Combination
Best combined penetration rate of 5
Dataset 250 Training Set 250 Testing Set
25Binning v/s Indexing
- Applications can have frequent insertions of new
templates - Binning works well when database is static
- Insertions will require re-partitioning the
entire database - Indexing can be used in both static and dynamic
database scenarios - Trees are commonly used for indexing
- Extend the concept of indexing relational
databases to indexing biometric databases - Much more challenging no concept of primary key
exists in biometric templates!
26Pyramid Technique spatial hashing
- Determine the Pyramid (i) within with which the
template lies - Determine height (h) of template from the apex
- The 1-D value Pyramid Number (i) Height (h)
- Indexing done using B Trees
27Various Indexing Techniques
- Grid Files KD Tree
- R Tree
R Tree X Tree
Pyramid Technique
28Comparative Study
Method Scalable Order Invariant Dynamic Range Query No Overlap
Grid File Y Y N N Y
R Tree Y N N N N
R Tree Y N N N N
R Tree Y N N N Y
KD Tree Y N N N Y
X Tree Y N Y Y Y
Pyramid Tech Y Y Y Y Y
29Results of Indexing
-
- 35 Dimensional Hand Geometry data
- Best Penetration 27
- FRR 0
- Dataset 450 Training Set
450 Testing Set - Parallel combination with signature will further
reduce the search space
30Multimodal Biometrics
312D Biometric Signature Fingerprint Fusion
Impostor Score Pairs
True Match Score Pairs
32Optimal Fusion AlgorithmSignature Fused With
Fingerprint
Unrealizable Performance Area
True Match Score Pairs
Optimal Fusion ROC
Fusion Algorithm
False Accept Rate (FAR)
Suboptimal Performance Area
Impostor Score Pairs
The ROC is the boundary between what is possible
and suboptimal performance.
33Optimal Fusion Algorithm Decision Regions99.04
Accuracy _at_ Specified FAR of 1 in a Million
2nd Biometric Score Axis
1st Biometric Score Axis
irregular decision region boundary due to finite
sample size the more data the smoother the
boundaries
34RSS Fusion Algorithm for Fingerprint Signature
Provides A Suboptimal Performance ROC
Optimal ROC
True Match Score Pairs
RSS Fusion ROC
RSS Fusion
False Accept Rate (FAR)
Impostor Score Pairs
35RSS Fusion Decision Regions96.11 Accuracy _at_
Specified FAR of 1 in a Million
2nd Biometric Score Axis
1st Biometric Score Axis
36OR Fusion Algorithm for Fingerprint Signature
Provides A Suboptimal Performance ROC
Optimal ROC
True Match Score Pairs
OR Fusion ROC
OR Fusion
False Accept Rate (FAR)
Impostor Score Pairs
37OR Fusion Decision Regions96.85 Accuracy _at_
Specified FAR of 1 in a Million
2nd Biometric Score Axis
1st Biometric Score Axis
38AND Fusion Algorithm for Fingerprint Signature
Provides A Suboptimal Performance ROC
Optimal ROC
True Match Score Pairs
AND Fusion ROC
AND Fusion
False Accept Rate (FAR)
Impostor Score Pairs
39AND Fusion Decision Regions62.91 Accuracy _at_
Specified FAR of 1 in a Million
2nd Biometric Score Axis
1st Biometric Score Axis
40ROC
41