bloom filters

16
Bloom Filters Bloom Filters Lookup questions: Does item “x” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with negative results before accessing data. Allow false positive errors, as they only cost us an extra data access. Don’t allow false negative errors, because they result in wrong answers.

Upload: jael-glenn

Post on 31-Dec-2015

13 views

Category:

Documents


1 download

DESCRIPTION

Bloom Filters. Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with negative results before accessing data. Allow false positive errors, as they only cost us an extra data access. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Bloom Filters

1Bloom Filters

Bloom Filters

Lookup questions: Does item “x” exist in a set or multiset?

Data set may be very big or expensive to access. Filter lookup questions with negative results before accessing data.

Allow false positive errors, as they only cost us an extra data access.

Don’t allow false negative errors, because they result in wrong answers.

Page 2: Bloom Filters

2Bloom Filters

Bloom Filter [B70]

Encoding an attribute aU Maintain a Bit Vector V of size m Use k hash functions (h1..hk) ,

hi: U[1..m]

Encoding: For item x, “turn on” bits V[h1(x)]..V[hk(x)].

Lookup: Check bits V[h1(i)]..V[hk(i)] . If all equal 1, return “Probably Yes”. Else “Definitely No”.

Page 3: Bloom Filters

3Bloom Filters

Bloom Filter

01000 10100 00010

x

h1(x) h2(x) hk(x)

V0 Vm-1

h3(x)

Page 4: Bloom Filters

4Bloom Filters

Bloom Errors

01000 10100 00010

h1(x) h2(x) hk(x)

V0 Vm-1

h3(x)

a b c d

x didn’t appear, yet its bits are already set

Page 5: Bloom Filters

5Bloom Filters

Error Estimation Assumption: Hash functions are perfectly random Probability of a bit being 0 after hashing all elements:

Let p=e-kn/m, probability of a false positive is:

Assuming we are given m and n, the optimal k is:

m

nkeem mknkn γ,/11 γ/

kkmkn

kkn

pem

f

11

111 /

mkn

mknmkn

mkn

mkn

e

e

m

kne

dk

dg

ekg

ekf

/

//

/

/

11ln

1ln

1lnexp

nmkkf

n

mk

dk

dg

/min

min

)6185.0()2/1()(

)2(ln0

Page 6: Bloom Filters

6Bloom Filters

Bloom Filter Tradeoffs

Three factors: m,k and n. Normally, n and m are given, and we select k. Small k

– Less computations.– Actual number of bits accessed (nk) is smaller, so the chance of a

“step over” is smaller too.– However, less bits need to be stepped over to generate an error.

For big k, the exact opposite holds. Not surprisingly, when k is optimal, the “hit ratio” (ratio of bits

flipped in the array) is exactly 0.5

Page 7: Bloom Filters

7Bloom Filters

Summary Cache [FCAB00]

Proxy servers maintain local cache to minimize expensive internet requests.

Proxy must maintain an efficient lookup method into the cache.

The lookup structure must be stored in DRAM for performance.

Structure must be compact, as DRAM is expensive and is used for “Hot Items” storage and more.

Pages are usually replaced in the cache using an LRU algorithm.

Page 8: Bloom Filters

8Bloom Filters

ICP – Request Handling

ClientProxy

Cache

Proxy

Cache Proxy

Cache

Proxy

Cache

InternetInternet

Page 9: Bloom Filters

9Bloom Filters

Internet Cache Protocol (ICP)

Allows for scaling-out when using proxies. Protocol that supports discovery and retrieval of

documents from neighboring caches. Establish an hierarchy of proxy caches If page not found in local proxy cache, it searches

for the page in neighboring proxies. If page not found anywhere, fetch it from the

internet.

Page 10: Bloom Filters

10Bloom Filters

ICP – Request Handling

ClientProxy

Cache

Proxy

Cache Proxy

Cache

Proxy

Cache

InternetInternet

Page 11: Bloom Filters

11Bloom Filters

Summary Cache

Each proxy maintains a Bloom Filter representing its local cache.

Also, it holds Bloom Filters representing caches of other proxies.

Updates to Bloom Filters are exchanged periodically or after a certain percentage of the documents in the cache was replaced.

ICP request is sent only to proxy who supposedly holds the requested document.

Page 12: Bloom Filters

12Bloom Filters

ICP – With Summary Cache

Client

InternetInternetProxy

Cache

Proxy

Cache

Proxy

Cache

Proxy

Cache

Page 13: Bloom Filters

13Bloom Filters

Summary Cache – Bloom Filters

To support deletions and updates, the proxy maintains the Bloom Filter and also an array of counters C, initially set to 0.

The Bloom Filter is filled with the contents of the cache.

Each bit in the BF is allowed 4 bits for its counter. On insert of item i, all C[hj(i)] are increased (to a

maximum of 15). On deletion of item i, counters are decreased. When C[i] increases from 0 to 1, V[i] is turned on. When C[i] decreases from 1 to 0, V[i] is turned off.

Page 14: Bloom Filters

14Bloom Filters

Summary Cache – Bloom Filters Hashing scheme

– Generate 128 bits using MD5 on the URL.– Divide to segments of M bits (usually 32)– Calculate modulus of segments by m, providing

128/M hash values (4, for 32 bit segments)– If 128 bits are not enough, calculate MD5 of URL

concatenated with itself. Bloom Filter Exchange

– Header contains MD5 properties, size of array.– If refresh rate is high, send only deltas.– Bit counts are internal and not exchanged.– Otherwise, send entire Bloom Filter.

Page 15: Bloom Filters

15Bloom Filters

Summary Cache - Errors

False Misses – Document requested is cached at some remote proxy, but

summary does not reflect that fact.– Hit ratio is reduce, a redundant internet access is performed.

False Hits– Document is not at a remote proxy, but summary suggests

that it is.– An Inter-Proxy query message is wasted.

Remote Stale Hits– Document is cached at a remote proxy, but is stale.– Occurs in both ICP and Summary Cache.– Might not be a totally wasted effort, as delta compression

can be used.

Page 16: Bloom Filters

16Bloom Filters

Implementation - Squid

Squid – A publicly available web proxy cache software. http://www.squid-cache.org

Summary Cache is implemented in Squid v1.1.14

A variation called cache digest is implemented in Squid 1.2b20