nov 2010 hug: fuzzy table - b.a.h

47
Fuzzy Table Distributed Fuzzy Matching Database Ed Kohlwey [email protected]

Upload: yahoo-developer-network

Post on 19-Jun-2015

6.851 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Nov 2010 HUG: Fuzzy Table - B.A.H

Fuzzy Table Distributed Fuzzy Matching Database

Ed Kohlwey

[email protected]

Page 2: Nov 2010 HUG: Fuzzy Table - B.A.H

Session Agenda Fuzzy Matching?

The Big Data Problem

A Scalable Solution

Performance

Questions?

2

Page 3: Nov 2010 HUG: Fuzzy Table - B.A.H

Fuzzy Matching?

3

Page 4: Nov 2010 HUG: Fuzzy Table - B.A.H

What is Fuzzy Matching?

*Euclidean Distance in this example

These images are very similar, but obliviously not the same.To find image #2 given image #1, some sort of fuzzy matching technique needs to be used

DistanceFunction*

31.46

Feature

Extraction &Normalization

Feature

Extraction &Normalization

1

2

Start with some multimediaimage/voice/audio/video/etc

Create a Vector or Matrix of doubles

4

Images from Flickr; Licensed under Creative Commons http://www.flickr.com/photos/mdpettitt/455527136/sizes/l/in/photostream/http://www.flickr.com/photos/mdpettitt/455539917/sizes/l/in/photostream/

Page 5: Nov 2010 HUG: Fuzzy Table - B.A.H

How Is Fuzzy Matching Being Used Today?

5

Page 6: Nov 2010 HUG: Fuzzy Table - B.A.H

Why Do We Care?

At the forefront of strategy and technology consulting for nearly a century

Deep functional knowledge spanning strategy and organization, technology, operations, and analytics

US government agencies in the defense, security, and civil sectors, as well as corporations, institutions, and not-for-profit organizations

6

Page 7: Nov 2010 HUG: Fuzzy Table - B.A.H

Biometrics – A Fuzzy Matching Problem

Same Person?Lifted From A Crime Scene

Law Enforcement Database

7

Page 8: Nov 2010 HUG: Fuzzy Table - B.A.H

Biometrics – Example

*Euclidean Distance in this example

DistanceFunction*

2.41

Feature

Extraction &Normalization

Feature

Extraction &Normalization

1

2

Query Biometrics Database

Create a Vector or Matrix of doubles

8

Page 9: Nov 2010 HUG: Fuzzy Table - B.A.H

The Big Data Problem

9

Page 10: Nov 2010 HUG: Fuzzy Table - B.A.H

Growth of Multimedia Databases Flickr – over 5 billion images ImageShack – over 20 billion unique images

http://techcrunch.com/2009/04/07/who-has-the-most-photos-of-them-all-hint-it-is-not-facebook/http://ksudigg.wetpaint.com/page/YouTube+Statisticshttp://techcrunch.com/2009/04/28/as-youtube-passes-a-billion-unique-us-viewers-hulu-rushes-into-third-place/

Youtube – over 6 billion videos Hulu – over 380 million videos

10

Page 11: Nov 2010 HUG: Fuzzy Table - B.A.H

Growth of Biometric Databases Combined U.S. government databases will soon

hold billions of identities DHS’s US-VISIT has the world’s largest and

fastest biometric database: over 110 million identities and 145,000 transactions daily*

The FBI’s Integrated Automated Fingerprint Identification System has 66 million identities with 8,000 added daily **

* US-VISIT: The world’s largest biometric application. William Graves.** http://www.fbi.gov/hq/cjisd/iafis/iafis_facts.htm*** http://www.business-standard.com/india/news/national-population-register-to-start-biometrics-data-collectiondec/399135/**** http://www.findbiometrics.com/articles/i/5220/***** http://www.alltrustnetworks.com/News/6Million/tabid/378/Default.aspx

US-VISIT

India working on a database of fingerprints and face images it’s population of 1.2 billion ***

European Union’s Biometric Matching System (supporting visa applications, immigration, and border control) will grow to 70 Million****

AllTrust Networks Paycheck Secure system has over 6 Million identities and has performed over 70 Million transactions*****

11

Page 12: Nov 2010 HUG: Fuzzy Table - B.A.H

Biometric Databases are a Big (Data) Problem Large scale operations

Searching and storing 100’s of millions to billions of Identities

Multiple biometric templates and raw files per identity Fingerprints, Faces, and Iris

New raw files and templates stored on each verification Computation to update models for identity

Results are expected in real time Cost efficient storage and retrieval is hard

Need innovative ways to reduce costs per match

500 M Identities x (16 KB to 300 KB) x (10 to 20)

= 1 – 2 PB

500 M Identities x (256 b to 3 KB) x (10 to 20)= 2 – 27 TB

12

Page 13: Nov 2010 HUG: Fuzzy Table - B.A.H

A Scalable Solution

13

Page 14: Nov 2010 HUG: Fuzzy Table - B.A.H

Hadoop and Multimedia Databases HDFS as file storage for petabytes worth of multimedia

(images/audio/video/etc) Redundancy Distribution

Mahout and MapReduce used for indexing and binning similar objects Improve overall search speeds

Improving feature selection by analyzing the entire database with MapReduce Select most effective features in distinguishing identities

N-to-N matching search (special type of Identification search) to cleanse database

What about low latency matching?

14

Page 15: Nov 2010 HUG: Fuzzy Table - B.A.H

Fuzzy Table: Large-scale, Low Latency, Fuzzy Matching Database Designed for biometric applications, has uses in other domains Enables fast parallel search of keys that cannot be effectively

ordered Biometrics Images Audio Video

Enabled by Mahout and MapReduce for binning, re-encoding, and other bulk data operations

Inherits nice features of Hadoop: Horizontal scalability over commodity hardware Distributed and parallel computation High reliability and redundancy

15

Page 16: Nov 2010 HUG: Fuzzy Table - B.A.H

Fuzzy Table Architecture

16

Page 17: Nov 2010 HUG: Fuzzy Table - B.A.H

Bulk Binning and Real-time Classification

17

* Efficient fingerprint search based on database clustering. Manhua Liu, Xudong Jiang, Alex Chichung Kot

Page 18: Nov 2010 HUG: Fuzzy Table - B.A.H

Fuzzy Table: Bulk Data Processing Component Canopy Clustering and K-means Clustering partitions data into

bins Reduces search space This concept is based on work done in academia*

Centroids from K-means clustering are used to create a “Bin classifier” Determines the best bins to search for a given key

{Key, Value} records are stored as Sequence Files in HDFS Spread across the cluster for optimal parallel searching

MapReduce is used for all other bulk or batch data processing Batch operations (many to many search, duplicate

detection) Encoding the raw files into feature vectors Feature evalutation

*Efficient Search and Retrieval in Biometric Databases by Amit Mhatre, Srinivas Palla, Sharat Chikkerur and Venu Govindaraju* Efficient fingerprint search based on database clustering. Manhua Liu, Xudong Jiang, Alex Chichung Kot

18

Page 19: Nov 2010 HUG: Fuzzy Table - B.A.H

19

Procedure

Page 20: Nov 2010 HUG: Fuzzy Table - B.A.H

Fuzzy Table: Data Storage and Bins Bins are represented as directories, contain chunk

files: /fuzzytable/_table_fingerprints/_bin_000001/_chunk_000001

Chunk files contain many {Key, Value} pairs Key is biometric template, Value is a reference to the

biographic record Chunks are same size as HDFS block to simplify data-local

search HDFS load balancing distributes data evenly across

cluster Enables parallel search Replication provides fault tolerance and speculative

execution of queries Data Servers only search local chunk files

Results returned in real-time as soon as a match is found Preserve principle of keeping computation next to data

20

Page 21: Nov 2010 HUG: Fuzzy Table - B.A.H

Low Latency Component

21

After data is organized, we want to retrieve it quickly

Does not use MapReduce MapReduce is high latency due to jar shipping,

other misc. tasks which support redundancy in the process

Need lightweight framework to perform realtime queries with minimum overhead

Provides real time matching and responses over Apache Avro-based protocol.

Page 22: Nov 2010 HUG: Fuzzy Table - B.A.H

Fuzzy Table Query

22

Page 23: Nov 2010 HUG: Fuzzy Table - B.A.H

Fuzzy Table Query

23

Page 24: Nov 2010 HUG: Fuzzy Table - B.A.H

Fuzzy Table Query

24

Page 25: Nov 2010 HUG: Fuzzy Table - B.A.H

Fuzzy Table Query

25

Page 26: Nov 2010 HUG: Fuzzy Table - B.A.H

Fuzzy Table Query

26

Page 27: Nov 2010 HUG: Fuzzy Table - B.A.H

Fuzzy Table Query

27

Page 28: Nov 2010 HUG: Fuzzy Table - B.A.H

Fuzzy Table Query

28

Page 29: Nov 2010 HUG: Fuzzy Table - B.A.H

Fuzzy Table Query

29

Page 30: Nov 2010 HUG: Fuzzy Table - B.A.H

Fuzzy Table: Optimizations Master Server HDFS Metadata Caching

The HDFS Namenode is a performance bottleneck for low latency searches

Master Server caches HDFS Block locations for all Fuzzytable files (Bins and Chunk Files)

Periodic refresh of the cache so its metadata is always fresh

Increased HDFS replication factor (Replication factor of N) Fuzzytable is close to a read only system Data replication enables speculative execution

Data Servers only perform searches against data that resides locally on disk

30

Page 31: Nov 2010 HUG: Fuzzy Table - B.A.H

Performance

31

Page 32: Nov 2010 HUG: Fuzzy Table - B.A.H

Performance and Scalability Testing On EC2 Employed EC2 for all testing Downloaded ~1 TB of images from Flickr (100 Nodes) Performed the Bulk Processing Components tasks across all

1 TB of images (80 nodes) Duplicate detection and removal Feature extraction and normalization Mahout’s canopy clustering Mahout’s k-means clustering Join Clusters with Features Post processing data into bins and chunk files

Run a series of test iterations against the low latency component

Querying increasing cluster sizes Queries performed using random images from the larger set

32

Page 33: Nov 2010 HUG: Fuzzy Table - B.A.H

Average Query Times

33

# Of Data Servers

Tim

e T

o R

esp

on

d (

ms)

Page 34: Nov 2010 HUG: Fuzzy Table - B.A.H

Average Query Times

34

# Of Data Servers

Tim

e T

o R

esp

on

d (

ms) Linear Scalability to ~ 7 Nodes

Lower limit due to I/O latencies

Page 35: Nov 2010 HUG: Fuzzy Table - B.A.H

Longest Query Times

35

# Of Data Servers

Tim

e T

o R

esp

ond (

ms)

Frequent Namenode access + large number of DFS clientsbegins to erode performance

Page 36: Nov 2010 HUG: Fuzzy Table - B.A.H

Shortest Query Times

36

# Of Data Servers

Tim

e T

o R

esp

ond (

ms)

~500 ms

Page 37: Nov 2010 HUG: Fuzzy Table - B.A.H

EC2 Results Discussion

37

Linear scalability – great! One data point shows 500 ms queries are possible

I/O Latency is a lower bound on average query response time Combined disk, network

Future enhancements Reduce disk penalty via hardware, cleaver data

structures, specialized data store Reliance on HDFS/Namenode for filesystem

metadata is another bottleneck Optimizations to HDFS client Distributed Namenode

Page 38: Nov 2010 HUG: Fuzzy Table - B.A.H

Performance and Scalability (Local)

38

Instrumented Master Server code Compared initial implementation that

accesses Namenode frequently with rework that caches filesystem metadata

Results matched those anticipated from EC2 testing

Page 39: Nov 2010 HUG: Fuzzy Table - B.A.H

Caching Performance

39

# Threads Polling The Master Server

Avera

ge R

esp

on

se T

ime (

ns)

Major discrepency, grows with load

Page 40: Nov 2010 HUG: Fuzzy Table - B.A.H

Conclusion & Future Work Large-scale, real-time Multimedia/Biometric Database

search is a hard problem And it’s becoming computationally more expensive as the

amount of data grows Hadoop is a potential solution to this problem MapReduce can be used for bulk processing to enable

distributed, low latency fuzzy matching over HDFS Hadoop is a great platform for solving all sorts of Big Data

and distributed computing problems, even for low latency searching

Future work Hadoop-level optimizations Currently implementing a new version based on Hbase which

supports online insertion and reorganization

40

Page 41: Nov 2010 HUG: Fuzzy Table - B.A.H

Contact Information – Cloud Computing Team

Booz Allen Hamilton Inc.134 National Business Parkway.

Annapolis Junction, Maryland 20701(301)543-4611

[email protected]

Michael RidleyAssociate

Booz Allen Hamilton Inc.134 National Business Parkway.

Annapolis Junction, Maryland 20701(301)543-4400

[email protected]

Jason TrostAssociate

Booz Allen Hamilton Inc.134 National Business Parkway.

Annapolis Junction, Maryland 20701(301)821-8000

[email protected]

Edmund KohlweySenior Consultant

Booz Allen Hamilton Inc.134 National Business Parkway.

Annapolis Junction, Maryland 20701(301)821-8000

[email protected]

Robert GordonAssociate

Booz Allen Hamilton Inc.134 National Business Parkway.

Annapolis Junction, Maryland 20701(301)617-3523

[email protected]

Jesse YatesConsultant

@jason_trost@ekohlwey

@jesse_yates

@mikeridley

41

Page 42: Nov 2010 HUG: Fuzzy Table - B.A.H

Thanks Lalit Kapoor (@idefine) – Former team

member Brandyn White (@brandynwhite) – Assistance

with Flickr image retrieval

42

Page 43: Nov 2010 HUG: Fuzzy Table - B.A.H

Questions

43

Page 44: Nov 2010 HUG: Fuzzy Table - B.A.H

Questions?

44

Page 45: Nov 2010 HUG: Fuzzy Table - B.A.H

Appendix

45

Page 46: Nov 2010 HUG: Fuzzy Table - B.A.H

Technologies Used Cloudera’s Distribution of Hadoop (CDH3)

MapReduce HDFS

Mahout Avro Amazon EC2 Ubuntu Linux Java Python Bash

46

Page 47: Nov 2010 HUG: Fuzzy Table - B.A.H

Fuzzy Table: Low Latency Fuzzy Matching Component Details The low latency component consists of three main

parts Client – submits queries for Keys and get back {Key, Value}

pairs Master Server – serve metadata about which Data Servers

host which bins Data Servers – Actually perform fuzzy matching searches

Data Servers perform fuzzy matching against Keys in order to find {Key, Value} records

double score = fuzzyMatcher.match(key, storedRec.getKey());

if(score >= threshold)

return storedRec;

Fuzzy matching searches are performed in parallel across many Data Servers

47