indexing and binning large databases

41
Indexing and Binning Large Databases

Upload: etana

Post on 18-Jan-2016

55 views

Category:

Documents


0 download

DESCRIPTION

Indexing and Binning Large Databases. Abstract. Problems with large databases Biometric identification (1:N Matching) does not scale well with size No established way to organize high dimensional biometric data Proposed Solution Reduce search space before 1:N matching - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Indexing and Binning Large Databases

Indexing and Binning Large Databases

Page 2: Indexing and Binning Large Databases

Abstract

Problems with large databases Biometric identification (1:N Matching) does not scale well

with size No established way to organize high dimensional biometric

data Proposed Solution

Reduce search space before 1:N matching Divide the database using Clustering Techniques

Contributions We analyze the effect of implementing a binning scheme

on search performance and accuracy We present binning and pruning approaches using multiple

biometrics Using hand geometry and signature, we have achieved a

search space reduction of 95% without any FRR

Page 3: Indexing and Binning Large Databases

Background

Only biometric identification (1:N matching) can prevent duplicate enrollments, double dipping

Biometrics are being deployed for immigration and national ID applications US-VISIT program Voter ID and national ID programs[3]

Potential size that can run into millions Current research is focused only on accuracy Apart from accuracy, scalability, speed and efficiency

also become important at this scale

Page 4: Indexing and Binning Large Databases

Challenges

Textual/Numeric Data

Data is scalar(1D) Textual/numeric data can be linearly

ordered and therefore easily indexed

Biometric Data

Biometric templates are high dimensional

No linear ordering or sorting methods exists for biometric data

Page 5: Indexing and Binning Large Databases

Search space analysis

As number of stored templates increases, template density (TD) also increases

countCluster N

distancecluster intra Average

C

1

IC

C

IC

D

N

DTD

Page 6: Indexing and Binning Large Databases

Number of false positives grows geometrically with the size of the database

Let FAR and FRR be the False Acceptance Rate (probability) and False Reject Rate (probability) for 1:1 matching

For a 1:N matching,

Identification problem

FRRFRR

FARNFARFAR

N

NN

)1(1

The total number of False Accepts is given by

FARNFARNFAR NN 2))1(1(N accepts False

Page 7: Indexing and Binning Large Databases

State of the Art

Biometrics State of the art Research Problems

Fingerprint 0.15% FRR at 1% FAR(FVC 2002)

Fingerprint EnhancementPartial fingerprint matching

Face Recognition

10% FRR at 1% FAR(FRVT 2002)

Improving accuracyFace alignment variationHandling lighting variations

Hand Geometry 4% FRR at 0% FAR(Transport Security Administration Tests)

Developing reliable modelsIdentification problem

Signature Verification

1.5%(IBM Israel) Developing offline verification systemsHandling skillful forgeries

Voice Verification

<1% FRR (Current Research)

Handling channel normalizationUser habituationText and language independence

Page 8: Indexing and Binning Large Databases

State of the Art

Biometrics State of the art Research Problems

Fingerprint 0.15% FRR at 1% FAR(FVC 2002)

Fingerprint EnhancementPartial fingerprint matching

Face Recognition

10% FRR at 1% FAR(FRVT 2002)

Improving accuracyFace alignment variationHandling lighting variations

Hand Geometry

2.6% FRR at 0.02% FAR(CUBS, SUNY-Buffalo)

Developing reliable modelsIdentification problem

Signature Verification

1.5%(IBM Israel) Developing offline verification systemsHandling skillful forgeries

Voice Verification

<1% FRR (Current Research)

Handling channel normalizationUser habituationText and language independence

Page 9: Indexing and Binning Large Databases

Identification problem (contd.)

Even if FAR = 0.0001%, False accepts = 1 in 10 for N=100000(lower bound) in the identification case.

No single biometric is capable of meeting this security requirement individually

Ways to reduce identification errors: Reduce FAR

FAR is limited by feature representation and the recognition algorithm

Cannot be indefinitely reduced Reduce N

Classify or index the biometric database. (e.g Henry classification system for fingerprints)

Index the records based on meta-data Can we do better?

Page 10: Indexing and Binning Large Databases

Fingerprint Features

Fingerprints can be classified based on the ridge flow pattern

Fingerprints can be distinguished based on the ridge characteristics

65% of fingerprints belong to the Loop class

Page 11: Indexing and Binning Large Databases

Henry Classification of Fingerprints

[Ratha et al,1996] used Henry Classification on database of 1800 templates, tested on 100 templates Search Space: 25%; FRR: 10%

[Jain, Pankanti,2000] similar experiment on database of 700 templates achieved FRR: 7.4% (Focus on classification only)

State-of-art Fingerprint classification system [Capelli,Maio,Maltoni,Nanni,2003] has FRR 4.8% for 5 class problem and 3.7% for 4 class problem

Though natural class exists, still classification is non-trivial Natural classes do not exist for biometrics like Hand Geometry Need more sophistication for partitioning database

Page 12: Indexing and Binning Large Databases

Analysis of search space reduction

For a 1:N matching,

We can improve performance by reducing the search space during identification

Let PSYS – Penetration rate [between 0.0 and 1.0] Penetration rate is the average fraction of the database

searched during identification Effective size = N*PSYS

FRRFRR

FARNPFARFAR

NP

SYSNP

NP

SYS

SYS

SYS

)1(1

The total number of False Accepts is given by

FARPNFAR SYSN 2SYS )(PN accepts False

State of the art fingerprint systems has PSYS=0.5

Page 13: Indexing and Binning Large Databases

Effect of binning on accuracy

For PSYS < 0.2, the false accepts are almost constant

Query response time improves by a factor of PSYS

Capabilities of a low FAR system Will allow us to screen immigrants at airports Will make biometric systems more user-friendly by

eliminating the need to remember PINs and IDs

0 2 4 6 8 10

x 105

0

2

4

6

8

10x 10

7

N

Fal

se A

ccep

ts

1.0 0.75

0.5

0.3

0.1

Page 14: Indexing and Binning Large Databases

Binning

Binning can be used to achieve a smaller PSYS

Partition the feature space Each bin is represented by a cluster center CK

Records are compared with only NB cluster centers

Bin representatives are computed offline during training Challenges

How to handle clustering of large databases? How to handle additions and deletions?

Page 15: Indexing and Binning Large Databases

Tradeoff

Although binning reduces search space, it introduces another source of identification error : Bin Miss

If the bin in which the user record exists is not searched, then FRR is generated no matter how good the matcher is

If P(B) is the probability of getting the correct bin

Binning increases the probability of False Rejects Not tolerable in security and screening applications Solution:

Use K-means clustering to find K bins Check Ns nearest bins for the record, such that P(B) = 1

Page 16: Indexing and Binning Large Databases

In general a biometric template may be represented as a vector

Vectors are represented into N distinct clusters; each represented by a ‘code book vector’

The code book vectors divide the feature space into N distinct Voronoi regions

Every template is closest to the mean (codebook vector) of the region it belongs to

Formal definition of Binning

kikiiii xxxxx ]....,,,[ 4321

kNYYYYY ]....,,,[ 4321

iV

jixYxY ji 22

ik

i VV and

Page 17: Indexing and Binning Large Databases

Search Space Partition: Voronoi Regions

Page 18: Indexing and Binning Large Databases

Hand Geometry Template

Feature extraction stages•Image capture•Binarization•Contour Extraction•Noise Removal

35 Features are extracted•25 directly measured features•10 ratio and perimeter features

Page 19: Indexing and Binning Large Databases

Signature Template

11 Features Extracted

•Regression Constants b0,b1•Compactness•Signature Length•Major Stroke Length•Major Stroke Angle

•Connected Components•Hole Count•Hole Area•Stroke Count•Signing Time

Page 20: Indexing and Binning Large Databases

Results

Signature Binning

0

10

20

30

40

50

60

70

80

3 4 5 6 7 8 9 10 12 14 16 18 20 25 30

Number of Bins

Pen

etr

atio

n R

ate

11 – Dimensional Signature dataBest Penetration: 35.57% for 6 binsFRR = 0%

Hand Geometry Binning

0

10

20

30

40

50

60

70

3 4 5 6 7 8 9 10 12 14 16 18 20 25 30

Number of Bins

Pen

etr

ati

on

Rate

35 – Dimensional Hand Geometry data Best Penetration: 35.8% for 6 bins

FRR = 0%

D

BBSSYS N

NTNP

database theof Size:N bins, ofNumber :N

binin templatesAverage :T search, tobins ofNumber :

DB

BSN

Dataset 250 Training Set & 250 Testing Set

Page 21: Indexing and Binning Large Databases

Multi-modal approach

Resulting bins have very high template densities A different biometric modality should be used to classify

templates within a bin Multimodal biometrics

Using multiple biometrics improves accuracy It is difficult to forge multiple biometrics Composite templates reduce template density Statistical independence ensures that individual

binning results are diverse The search space (intersection of bins) is reduced

due to low commonality between the individual binning results

Page 22: Indexing and Binning Large Databases

Multi-Modal Approach

Page 23: Indexing and Binning Large Databases

Multi-Modal Approach

Search Space: 5% original database size; FRR – 0%

Page 24: Indexing and Binning Large Databases

Results of Combination

0

5

10

15

20

25

30

35

40

45

5 6 7 8 9 10 12 14 16 18 20 25 30

Number of Bins

Pen

etr

ati

on

Rate

Signature

Hand Geometry

Combination

Best combined penetration rate of 5%

D

BSSYS N

TNP

database theof Size:N

binin templatesAverage :T search, tobins ofNumber :

D

BSN

Dataset 250 Training Set & 250 Testing Set

Page 25: Indexing and Binning Large Databases

Binning v/s Indexing

Applications can have frequent insertions of new templates

Binning works well when database is static Insertions will require re-partitioning the entire

database

Indexing can be used in both – static and dynamic database scenarios

Trees are commonly used for indexing Extend the concept of indexing relational databases

to indexing biometric databases Much more challenging – no concept of primary key

exists in biometric templates!

Page 26: Indexing and Binning Large Databases

Pyramid Technique spatial hashing

Determine the Pyramid (i) within with which the template lies

Determine height (h) of template from the apex The 1-D value = Pyramid Number (i) + Height (h) Indexing done using B+ Trees

Page 27: Indexing and Binning Large Databases

Various Indexing Techniques

Grid Files KD Tree

R Tree R+ Tree X Tree

Pyramid Technique

Page 28: Indexing and Binning Large Databases

Comparative Study

Method Scalable Order Invariant

Dynamic Range Query

No Overlap

Grid File Y Y N N Y

R Tree Y N N N N

R* Tree Y N N N N

R+ Tree Y N N N Y

KD Tree Y N N N Y

X Tree Y N Y Y Y

Pyramid Tech

Y Y Y Y Y

Page 29: Indexing and Binning Large Databases

Results of Indexing

35 – Dimensional Hand Geometry data Best Penetration: 27% FRR = 0%

Dataset 450 Training Set & 450 Testing Set

Parallel combination with signature will further reduce the search space

Page 30: Indexing and Binning Large Databases

Multimodal Biometrics

Page 31: Indexing and Binning Large Databases

2D Biometric: Signature & Fingerprint Fusion

Impostor Score Pairs True Match Score Pairs

Page 32: Indexing and Binning Large Databases

Optimal Fusion AlgorithmSignature Fused With Fingerprint

FusionAlgorithm

True Match Score Pairs

Impostor Score Pairs

Unrealizable Performance Area

Suboptimal Performance Area

The ROC is the boundary between what is possible and suboptimal performance.

Acc

ura

cy (

1-F

RR

)

False Accept Rate (FAR)

Optimal Fusion ROC

Page 33: Indexing and Binning Large Databases

Optimal Fusion Algorithm Decision Regions99.04% Accuracy @ Specified FAR of 1 in a Million

No-Match ZoneMatch Zone

irregular decision region boundary due to finite sample sizethe more data the smoother the boundaries

2nd B

iom

etric

Sco

re A

xis

1st Biometric Score Axis

Page 34: Indexing and Binning Large Databases

RSS Fusion Algorithm for Fingerprint & Signature Provides A Suboptimal Performance ROC

RSSFusion

True Match Score Pairs

Impostor Score PairsA

ccu

racy

(1

-FR

R)

False Accept Rate (FAR)

RSS Fusion ROC

Optimal ROC

Page 35: Indexing and Binning Large Databases

RSS Fusion Decision Regions96.11% Accuracy @ Specified FAR of 1 in a Million

No-Match ZoneMatch Zone

2nd B

iom

etric

Sco

re A

xis

1st Biometric Score Axis

Page 36: Indexing and Binning Large Databases

OR Fusion Algorithm for Fingerprint & Signature Provides A Suboptimal Performance ROC

ORFusion

True Match Score Pairs

Impostor Score PairsA

ccu

racy

(1

-FR

R)

False Accept Rate (FAR)

OR Fusion ROC

Optimal ROC

Page 37: Indexing and Binning Large Databases

OR Fusion Decision Regions96.85% Accuracy @ Specified FAR of 1 in a Million

No-Match ZoneMatch Zone

2nd B

iom

etric

Sco

re A

xis

1st Biometric Score Axis

Page 38: Indexing and Binning Large Databases

AND Fusion Algorithm for Fingerprint & Signature Provides A Suboptimal Performance ROC

ANDFusion

True Match Score Pairs

Impostor Score PairsA

ccu

racy

(1

-FR

R)

False Accept Rate (FAR)

AND Fusion ROC

Optimal ROC

Page 39: Indexing and Binning Large Databases

AND Fusion Decision Regions62.91% Accuracy @ Specified FAR of 1 in a Million

No-Match ZoneMatch Zone

2nd B

iom

etric

Sco

re A

xis

1st Biometric Score Axis

Page 40: Indexing and Binning Large Databases

ROC

Page 41: Indexing and Binning Large Databases

Thank You