indexing and binning large databases

Indexing and Binning Large Databases

Abstract

Problems with large databases Biometric identification (1:N Matching) does not scale well

with size No established way to organize high dimensional biometric

data Proposed Solution

Reduce search space before 1:N matching Divide the database using Clustering Techniques

Contributions We analyze the effect of implementing a binning scheme

on search performance and accuracy We present binning and pruning approaches using multiple

biometrics Using hand geometry and signature, we have achieved a

search space reduction of 95% without any FRR

Background

Only biometric identification (1:N matching) can prevent duplicate enrollments, double dipping

Biometrics are being deployed for immigration and national ID applications US-VISIT program Voter ID and national ID programs[3]

Potential size that can run into millions Current research is focused only on accuracy Apart from accuracy, scalability, speed and efficiency

also become important at this scale

Challenges

Textual/Numeric Data

Data is scalar(1D) Textual/numeric data can be linearly

ordered and therefore easily indexed

Biometric Data

Biometric templates are high dimensional

No linear ordering or sorting methods exists for biometric data

Search space analysis

As number of stored templates increases, template density (TD) also increases

countCluster N

distancecluster intra Average

C

1

IC

C

IC

D

N

DTD

Number of false positives grows geometrically with the size of the database

Let FAR and FRR be the False Acceptance Rate (probability) and False Reject Rate (probability) for 1:1 matching

For a 1:N matching,

Identification problem

FRRFRR

FARNFARFAR

N

NN

)1(1

The total number of False Accepts is given by

FARNFARNFAR NN 2))1(1(N accepts False

State of the Art

Biometrics State of the art Research Problems

Fingerprint 0.15% FRR at 1% FAR(FVC 2002)

Fingerprint EnhancementPartial fingerprint matching

Face Recognition

10% FRR at 1% FAR(FRVT 2002)

Improving accuracyFace alignment variationHandling lighting variations

Hand Geometry 4% FRR at 0% FAR(Transport Security Administration Tests)

Developing reliable modelsIdentification problem

Signature Verification

1.5%(IBM Israel) Developing offline verification systemsHandling skillful forgeries

Voice Verification

<1% FRR (Current Research)

Handling channel normalizationUser habituationText and language independence

State of the Art

Biometrics State of the art Research Problems

Fingerprint 0.15% FRR at 1% FAR(FVC 2002)

Fingerprint EnhancementPartial fingerprint matching

Face Recognition

10% FRR at 1% FAR(FRVT 2002)

Improving accuracyFace alignment variationHandling lighting variations

Hand Geometry

2.6% FRR at 0.02% FAR(CUBS, SUNY-Buffalo)

Developing reliable modelsIdentification problem

Signature Verification

1.5%(IBM Israel) Developing offline verification systemsHandling skillful forgeries

Voice Verification

<1% FRR (Current Research)

Handling channel normalizationUser habituationText and language independence

Identification problem (contd.)

Even if FAR = 0.0001%, False accepts = 1 in 10 for N=100000(lower bound) in the identification case.

No single biometric is capable of meeting this security requirement individually

Ways to reduce identification errors: Reduce FAR

FAR is limited by feature representation and the recognition algorithm

Cannot be indefinitely reduced Reduce N

Classify or index the biometric database. (e.g Henry classification system for fingerprints)

Index the records based on meta-data Can we do better?

Fingerprint Features

Fingerprints can be classified based on the ridge flow pattern

Fingerprints can be distinguished based on the ridge characteristics

65% of fingerprints belong to the Loop class

Henry Classification of Fingerprints

[Ratha et al,1996] used Henry Classification on database of 1800 templates, tested on 100 templates Search Space: 25%; FRR: 10%

[Jain, Pankanti,2000] similar experiment on database of 700 templates achieved FRR: 7.4% (Focus on classification only)

State-of-art Fingerprint classification system [Capelli,Maio,Maltoni,Nanni,2003] has FRR 4.8% for 5 class problem and 3.7% for 4 class problem

Though natural class exists, still classification is non-trivial Natural classes do not exist for biometrics like Hand Geometry Need more sophistication for partitioning database

Analysis of search space reduction

For a 1:N matching,

We can improve performance by reducing the search space during identification

Let PSYS – Penetration rate [between 0.0 and 1.0] Penetration rate is the average fraction of the database

searched during identification Effective size = N*PSYS

FRRFRR

FARNPFARFAR

NP

SYSNP

NP

SYS

SYS

SYS

)1(1

The total number of False Accepts is given by

FARPNFAR SYSN 2SYS )(PN accepts False

State of the art fingerprint systems has PSYS=0.5

Effect of binning on accuracy

For PSYS < 0.2, the false accepts are almost constant

Query response time improves by a factor of PSYS

Capabilities of a low FAR system Will allow us to screen immigrants at airports Will make biometric systems more user-friendly by

eliminating the need to remember PINs and IDs

0 2 4 6 8 10

x 105

0

2

4

6

8

10x 10

7

N

Fal

se A

ccep

ts

1.0 0.75

0.5

0.3

0.1

Binning

Binning can be used to achieve a smaller PSYS

Partition the feature space Each bin is represented by a cluster center CK

Records are compared with only NB cluster centers

Bin representatives are computed offline during training Challenges

How to handle clustering of large databases? How to handle additions and deletions?

Tradeoff

Although binning reduces search space, it introduces another source of identification error : Bin Miss

If the bin in which the user record exists is not searched, then FRR is generated no matter how good the matcher is

If P(B) is the probability of getting the correct bin

Binning increases the probability of False Rejects Not tolerable in security and screening applications Solution:

Use K-means clustering to find K bins Check Ns nearest bins for the record, such that P(B) = 1

In general a biometric template may be represented as a vector

Vectors are represented into N distinct clusters; each represented by a ‘code book vector’

The code book vectors divide the feature space into N distinct Voronoi regions

Every template is closest to the mean (codebook vector) of the region it belongs to

Formal definition of Binning

kikiiii xxxxx ]....,,,[ 4321

kNYYYYY ]....,,,[ 4321

iV

jixYxY ji 22

ik

i VV and

Search Space Partition: Voronoi Regions

Hand Geometry Template

Feature extraction stages•Image capture•Binarization•Contour Extraction•Noise Removal

35 Features are extracted•25 directly measured features•10 ratio and perimeter features

Signature Template

11 Features Extracted

•Regression Constants b0,b1•Compactness•Signature Length•Major Stroke Length•Major Stroke Angle

•Connected Components•Hole Count•Hole Area•Stroke Count•Signing Time

Results

Signature Binning

0

10

20

30

40

50

60

70

80

3 4 5 6 7 8 9 10 12 14 16 18 20 25 30

Number of Bins

Pen

etr

atio

n R

ate

11 – Dimensional Signature dataBest Penetration: 35.57% for 6 binsFRR = 0%

Hand Geometry Binning

0

10

20

30

40

50

60

70

3 4 5 6 7 8 9 10 12 14 16 18 20 25 30

Number of Bins

Pen

etr

ati

on

Rate

35 – Dimensional Hand Geometry data Best Penetration: 35.8% for 6 bins

FRR = 0%

D

BBSSYS N

NTNP

database theof Size:N bins, ofNumber :N

binin templatesAverage :T search, tobins ofNumber :

DB

BSN

Dataset 250 Training Set & 250 Testing Set

Multi-modal approach

Resulting bins have very high template densities A different biometric modality should be used to classify

templates within a bin Multimodal biometrics

Using multiple biometrics improves accuracy It is difficult to forge multiple biometrics Composite templates reduce template density Statistical independence ensures that individual

binning results are diverse The search space (intersection of bins) is reduced

due to low commonality between the individual binning results

Multi-Modal Approach

Multi-Modal Approach

Search Space: 5% original database size; FRR – 0%

Results of Combination

0

5

10

15

20

25

30

35

40

45

5 6 7 8 9 10 12 14 16 18 20 25 30

Number of Bins

Pen

etr

ati

on

Rate

Signature

Hand Geometry

Combination

Best combined penetration rate of 5%

D

BSSYS N

TNP

database theof Size:N

binin templatesAverage :T search, tobins ofNumber :

D

BSN


Binning v/s Indexing

Applications can have frequent insertions of new templates

Binning works well when database is static Insertions will require re-partitioning the entire

database

Indexing can be used in both – static and dynamic database scenarios

Trees are commonly used for indexing Extend the concept of indexing relational databases

to indexing biometric databases Much more challenging – no concept of primary key

exists in biometric templates!

Pyramid Technique spatial hashing

Determine the Pyramid (i) within with which the template lies

Determine height (h) of template from the apex The 1-D value = Pyramid Number (i) + Height (h) Indexing done using B+ Trees

Various Indexing Techniques

Grid Files KD Tree

R Tree R+ Tree X Tree

Pyramid Technique

Comparative Study

Method Scalable Order Invariant

Dynamic Range Query

No Overlap

Grid File Y Y N N Y

R Tree Y N N N N

R* Tree Y N N N N

R+ Tree Y N N N Y

KD Tree Y N N N Y

X Tree Y N Y Y Y

Pyramid Tech

Y Y Y Y Y

Results of Indexing

35 – Dimensional Hand Geometry data Best Penetration: 27% FRR = 0%


Parallel combination with signature will further reduce the search space

Multimodal Biometrics

2D Biometric: Signature & Fingerprint Fusion

Impostor Score Pairs True Match Score Pairs

Optimal Fusion AlgorithmSignature Fused With Fingerprint

FusionAlgorithm

True Match Score Pairs

Impostor Score Pairs

Unrealizable Performance Area

Suboptimal Performance Area

The ROC is the boundary between what is possible and suboptimal performance.

Acc

ura

cy (

1-F

RR

)

False Accept Rate (FAR)

Optimal Fusion ROC

Optimal Fusion Algorithm Decision Regions99.04% Accuracy @ Specified FAR of 1 in a Million

No-Match ZoneMatch Zone

irregular decision region boundary due to finite sample sizethe more data the smoother the boundaries

2nd B

iom

etric

Sco

re A

xis

1st Biometric Score Axis

RSS Fusion Algorithm for Fingerprint & Signature Provides A Suboptimal Performance ROC

RSSFusion


Impostor Score PairsA

ccu

racy

(1

-FR

R)


RSS Fusion ROC

Optimal ROC

RSS Fusion Decision Regions96.11% Accuracy @ Specified FAR of 1 in a Million


2nd B

iom

etric

Sco

re A

xis


OR Fusion Algorithm for Fingerprint & Signature Provides A Suboptimal Performance ROC

ORFusion



ccu

racy

(1

-FR

R)


OR Fusion ROC

Optimal ROC

OR Fusion Decision Regions96.85% Accuracy @ Specified FAR of 1 in a Million


2nd B

iom

etric

Sco

re A

xis


AND Fusion Algorithm for Fingerprint & Signature Provides A Suboptimal Performance ROC

ANDFusion



ccu

racy

(1

-FR

R)


AND Fusion ROC

Optimal ROC

AND Fusion Decision Regions62.91% Accuracy @ Specified FAR of 1 in a Million


2nd B

iom

etric

Sco

re A

xis


Thank You

indexing and binning large databases

Documents

biometric database

n matchingdivide

classification onlystate

search performance

templatessearch space

used henry classification

false reject rate probability

total number of false