indexing and binning large databases
DESCRIPTION
Indexing and Binning Large Databases. Abstract. Problems with large databases Biometric identification (1:N Matching) does not scale well with size No established way to organize high dimensional biometric data Proposed Solution Reduce search space before 1:N matching - PowerPoint PPT PresentationTRANSCRIPT
Indexing and Binning Large Databases
Abstract
Problems with large databases Biometric identification (1:N Matching) does not scale well
with size No established way to organize high dimensional biometric
data Proposed Solution
Reduce search space before 1:N matching Divide the database using Clustering Techniques
Contributions We analyze the effect of implementing a binning scheme
on search performance and accuracy We present binning and pruning approaches using multiple
biometrics Using hand geometry and signature, we have achieved a
search space reduction of 95% without any FRR
Background
Only biometric identification (1:N matching) can prevent duplicate enrollments, double dipping
Biometrics are being deployed for immigration and national ID applications US-VISIT program Voter ID and national ID programs[3]
Potential size that can run into millions Current research is focused only on accuracy Apart from accuracy, scalability, speed and efficiency
also become important at this scale
Challenges
Textual/Numeric Data
Data is scalar(1D) Textual/numeric data can be linearly
ordered and therefore easily indexed
Biometric Data
Biometric templates are high dimensional
No linear ordering or sorting methods exists for biometric data
Search space analysis
As number of stored templates increases, template density (TD) also increases
countCluster N
distancecluster intra Average
C
1
IC
C
IC
D
N
DTD
Number of false positives grows geometrically with the size of the database
Let FAR and FRR be the False Acceptance Rate (probability) and False Reject Rate (probability) for 1:1 matching
For a 1:N matching,
Identification problem
FRRFRR
FARNFARFAR
N
NN
)1(1
The total number of False Accepts is given by
FARNFARNFAR NN 2))1(1(N accepts False
State of the Art
Biometrics State of the art Research Problems
Fingerprint 0.15% FRR at 1% FAR(FVC 2002)
Fingerprint EnhancementPartial fingerprint matching
Face Recognition
10% FRR at 1% FAR(FRVT 2002)
Improving accuracyFace alignment variationHandling lighting variations
Hand Geometry 4% FRR at 0% FAR(Transport Security Administration Tests)
Developing reliable modelsIdentification problem
Signature Verification
1.5%(IBM Israel) Developing offline verification systemsHandling skillful forgeries
Voice Verification
<1% FRR (Current Research)
Handling channel normalizationUser habituationText and language independence
State of the Art
Biometrics State of the art Research Problems
Fingerprint 0.15% FRR at 1% FAR(FVC 2002)
Fingerprint EnhancementPartial fingerprint matching
Face Recognition
10% FRR at 1% FAR(FRVT 2002)
Improving accuracyFace alignment variationHandling lighting variations
Hand Geometry
2.6% FRR at 0.02% FAR(CUBS, SUNY-Buffalo)
Developing reliable modelsIdentification problem
Signature Verification
1.5%(IBM Israel) Developing offline verification systemsHandling skillful forgeries
Voice Verification
<1% FRR (Current Research)
Handling channel normalizationUser habituationText and language independence
Identification problem (contd.)
Even if FAR = 0.0001%, False accepts = 1 in 10 for N=100000(lower bound) in the identification case.
No single biometric is capable of meeting this security requirement individually
Ways to reduce identification errors: Reduce FAR
FAR is limited by feature representation and the recognition algorithm
Cannot be indefinitely reduced Reduce N
Classify or index the biometric database. (e.g Henry classification system for fingerprints)
Index the records based on meta-data Can we do better?
Fingerprint Features
Fingerprints can be classified based on the ridge flow pattern
Fingerprints can be distinguished based on the ridge characteristics
65% of fingerprints belong to the Loop class
Henry Classification of Fingerprints
[Ratha et al,1996] used Henry Classification on database of 1800 templates, tested on 100 templates Search Space: 25%; FRR: 10%
[Jain, Pankanti,2000] similar experiment on database of 700 templates achieved FRR: 7.4% (Focus on classification only)
State-of-art Fingerprint classification system [Capelli,Maio,Maltoni,Nanni,2003] has FRR 4.8% for 5 class problem and 3.7% for 4 class problem
Though natural class exists, still classification is non-trivial Natural classes do not exist for biometrics like Hand Geometry Need more sophistication for partitioning database
Analysis of search space reduction
For a 1:N matching,
We can improve performance by reducing the search space during identification
Let PSYS – Penetration rate [between 0.0 and 1.0] Penetration rate is the average fraction of the database
searched during identification Effective size = N*PSYS
FRRFRR
FARNPFARFAR
NP
SYSNP
NP
SYS
SYS
SYS
)1(1
The total number of False Accepts is given by
FARPNFAR SYSN 2SYS )(PN accepts False
State of the art fingerprint systems has PSYS=0.5
Effect of binning on accuracy
For PSYS < 0.2, the false accepts are almost constant
Query response time improves by a factor of PSYS
Capabilities of a low FAR system Will allow us to screen immigrants at airports Will make biometric systems more user-friendly by
eliminating the need to remember PINs and IDs
0 2 4 6 8 10
x 105
0
2
4
6
8
10x 10
7
N
Fal
se A
ccep
ts
1.0 0.75
0.5
0.3
0.1
Binning
Binning can be used to achieve a smaller PSYS
Partition the feature space Each bin is represented by a cluster center CK
Records are compared with only NB cluster centers
Bin representatives are computed offline during training Challenges
How to handle clustering of large databases? How to handle additions and deletions?
Tradeoff
Although binning reduces search space, it introduces another source of identification error : Bin Miss
If the bin in which the user record exists is not searched, then FRR is generated no matter how good the matcher is
If P(B) is the probability of getting the correct bin
Binning increases the probability of False Rejects Not tolerable in security and screening applications Solution:
Use K-means clustering to find K bins Check Ns nearest bins for the record, such that P(B) = 1
In general a biometric template may be represented as a vector
Vectors are represented into N distinct clusters; each represented by a ‘code book vector’
The code book vectors divide the feature space into N distinct Voronoi regions
Every template is closest to the mean (codebook vector) of the region it belongs to
Formal definition of Binning
kikiiii xxxxx ]....,,,[ 4321
kNYYYYY ]....,,,[ 4321
iV
jixYxY ji 22
ik
i VV and
Search Space Partition: Voronoi Regions
Hand Geometry Template
Feature extraction stages•Image capture•Binarization•Contour Extraction•Noise Removal
35 Features are extracted•25 directly measured features•10 ratio and perimeter features
Signature Template
11 Features Extracted
•Regression Constants b0,b1•Compactness•Signature Length•Major Stroke Length•Major Stroke Angle
•Connected Components•Hole Count•Hole Area•Stroke Count•Signing Time
Results
Signature Binning
0
10
20
30
40
50
60
70
80
3 4 5 6 7 8 9 10 12 14 16 18 20 25 30
Number of Bins
Pen
etr
atio
n R
ate
11 – Dimensional Signature dataBest Penetration: 35.57% for 6 binsFRR = 0%
Hand Geometry Binning
0
10
20
30
40
50
60
70
3 4 5 6 7 8 9 10 12 14 16 18 20 25 30
Number of Bins
Pen
etr
ati
on
Rate
35 – Dimensional Hand Geometry data Best Penetration: 35.8% for 6 bins
FRR = 0%
D
BBSSYS N
NTNP
database theof Size:N bins, ofNumber :N
binin templatesAverage :T search, tobins ofNumber :
DB
BSN
Dataset 250 Training Set & 250 Testing Set
Multi-modal approach
Resulting bins have very high template densities A different biometric modality should be used to classify
templates within a bin Multimodal biometrics
Using multiple biometrics improves accuracy It is difficult to forge multiple biometrics Composite templates reduce template density Statistical independence ensures that individual
binning results are diverse The search space (intersection of bins) is reduced
due to low commonality between the individual binning results
Multi-Modal Approach
Multi-Modal Approach
Search Space: 5% original database size; FRR – 0%
Results of Combination
0
5
10
15
20
25
30
35
40
45
5 6 7 8 9 10 12 14 16 18 20 25 30
Number of Bins
Pen
etr
ati
on
Rate
Signature
Hand Geometry
Combination
Best combined penetration rate of 5%
D
BSSYS N
TNP
database theof Size:N
binin templatesAverage :T search, tobins ofNumber :
D
BSN
Dataset 250 Training Set & 250 Testing Set
Binning v/s Indexing
Applications can have frequent insertions of new templates
Binning works well when database is static Insertions will require re-partitioning the entire
database
Indexing can be used in both – static and dynamic database scenarios
Trees are commonly used for indexing Extend the concept of indexing relational databases
to indexing biometric databases Much more challenging – no concept of primary key
exists in biometric templates!
Pyramid Technique spatial hashing
Determine the Pyramid (i) within with which the template lies
Determine height (h) of template from the apex The 1-D value = Pyramid Number (i) + Height (h) Indexing done using B+ Trees
Various Indexing Techniques
Grid Files KD Tree
R Tree R+ Tree X Tree
Pyramid Technique
Comparative Study
Method Scalable Order Invariant
Dynamic Range Query
No Overlap
Grid File Y Y N N Y
R Tree Y N N N N
R* Tree Y N N N N
R+ Tree Y N N N Y
KD Tree Y N N N Y
X Tree Y N Y Y Y
Pyramid Tech
Y Y Y Y Y
Results of Indexing
35 – Dimensional Hand Geometry data Best Penetration: 27% FRR = 0%
Dataset 450 Training Set & 450 Testing Set
Parallel combination with signature will further reduce the search space
Multimodal Biometrics
2D Biometric: Signature & Fingerprint Fusion
Impostor Score Pairs True Match Score Pairs
Optimal Fusion AlgorithmSignature Fused With Fingerprint
FusionAlgorithm
True Match Score Pairs
Impostor Score Pairs
Unrealizable Performance Area
Suboptimal Performance Area
The ROC is the boundary between what is possible and suboptimal performance.
Acc
ura
cy (
1-F
RR
)
False Accept Rate (FAR)
Optimal Fusion ROC
Optimal Fusion Algorithm Decision Regions99.04% Accuracy @ Specified FAR of 1 in a Million
No-Match ZoneMatch Zone
irregular decision region boundary due to finite sample sizethe more data the smoother the boundaries
2nd B
iom
etric
Sco
re A
xis
1st Biometric Score Axis
RSS Fusion Algorithm for Fingerprint & Signature Provides A Suboptimal Performance ROC
RSSFusion
True Match Score Pairs
Impostor Score PairsA
ccu
racy
(1
-FR
R)
False Accept Rate (FAR)
RSS Fusion ROC
Optimal ROC
RSS Fusion Decision Regions96.11% Accuracy @ Specified FAR of 1 in a Million
No-Match ZoneMatch Zone
2nd B
iom
etric
Sco
re A
xis
1st Biometric Score Axis
OR Fusion Algorithm for Fingerprint & Signature Provides A Suboptimal Performance ROC
ORFusion
True Match Score Pairs
Impostor Score PairsA
ccu
racy
(1
-FR
R)
False Accept Rate (FAR)
OR Fusion ROC
Optimal ROC
OR Fusion Decision Regions96.85% Accuracy @ Specified FAR of 1 in a Million
No-Match ZoneMatch Zone
2nd B
iom
etric
Sco
re A
xis
1st Biometric Score Axis
AND Fusion Algorithm for Fingerprint & Signature Provides A Suboptimal Performance ROC
ANDFusion
True Match Score Pairs
Impostor Score PairsA
ccu
racy
(1
-FR
R)
False Accept Rate (FAR)
AND Fusion ROC
Optimal ROC
AND Fusion Decision Regions62.91% Accuracy @ Specified FAR of 1 in a Million
No-Match ZoneMatch Zone
2nd B
iom
etric
Sco
re A
xis
1st Biometric Score Axis
ROC
Thank You