references - university of texas at austinstandard bloom filter a bloom filter is an array of m bits...

22
2/16/2017 1 Bloom Filters References A. Broder and M. Mitzenmacher, Network applications of Bloom A. Broder and M. Mitzenmacher, Network applications of Bloom filters: A survey,” Internet Mathematics, vol. 1 no. 4, pp. 485-509, 2004. Li Fan, Pei Cao, Jussara Almeida, Andrei Broder, “Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol,” IEEE/ACM Transactions on Networking, Vol. 8, No. 3, June 2000. O i in f ntin Bl m filt s o Origin of counting Bloom filters 2/16/2017 Bloom Filters (Simon S. Lam) 1

Upload: others

Post on 27-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: References - University of Texas at AustinStandard Bloom Filter A Bloom filter is an array of m bits representing a set S = { x1, x2, … , xn} of n elements o Array set to 0 initially

2/16/2017

1

Bloom Filters

ReferencesA. Broder and M. Mitzenmacher, “Network applications of BloomA. Broder and M. Mitzenmacher, Network applications of Bloom filters: A survey,” Internet Mathematics, vol. 1 no. 4, pp. 485-509, 2004.

Li Fan, Pei Cao, Jussara Almeida, Andrei Broder, “Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol,” IEEE/ACM Transactions on Networking, Vol. 8, No. 3, June 2000.

O i in f ntin Bl m filt so Origin of counting Bloom filters

2/16/2017 Bloom Filters (Simon S. Lam) 1

Page 2: References - University of Texas at AustinStandard Bloom Filter A Bloom filter is an array of m bits representing a set S = { x1, x2, … , xn} of n elements o Array set to 0 initially

2/16/2017

2

Origin and applicationsRandomized data structure introduced by

Burton Bloom [CACM 1970]o It represents a set for membership queries, with

false positiveso Probability of false positive can be controlled byo Probability of false positive can be controlled by

design parameterso When space efficiency is important, a Bloom filter

ma be used if the effect f false p sitives can bemay be used if the effect of false positives can be mitigated.

First applications in dictionaries and databases

2/16/2017 Bloom Filters (Simon S. Lam) 2

Page 3: References - University of Texas at AustinStandard Bloom Filter A Bloom filter is an array of m bits representing a set S = { x1, x2, … , xn} of n elements o Array set to 0 initially

2/16/2017

3

First application in networking: distributed cache (2000)distributed cache (2000)

Proxy 2

Proxy 1Cache 1

Proxy 2Cache 2Summary 1Summary 3Cache 1

Summary 2Summary 3

y

Proxy 3Proxy 3Cache 3Summary 1Summary 2Summary 2

N li ti i t ki i 2000

2/16/2017 Bloom Filters (Simon S. Lam) 3

Numerous applications in networking since 2000

Page 4: References - University of Texas at AustinStandard Bloom Filter A Bloom filter is an array of m bits representing a set S = { x1, x2, … , xn} of n elements o Array set to 0 initially

2/16/2017

4

Standard Bloom FilterA Bloom filter is an array of m bits representing

a set S = { x1, x2, … , xn} of n elements{ 1 2 n}o Array set to 0 initially

k independent hash functions h1, … , hk with range {1 2 }{1, 2, …, m}o Assume that each hash function maps each item in the

universe to a random number uniformly over the rangeuniverse to a random number uniformly over the range {1, 2, …, m}

For each element x in S, the bit hi(x) in the array i t t 1 f 1 i kis set to 1, for 1 ≤ i ≤ k, o A bit in the array may be set to 1 multiple times for

different elementsff m

2/16/2017 Bloom Filters (Simon S. Lam) 4

Page 5: References - University of Texas at AustinStandard Bloom Filter A Bloom filter is an array of m bits representing a set S = { x1, x2, … , xn} of n elements o Array set to 0 initially

2/16/2017

5

A Bloom filter example(three hash functions)( )

Insert X1 and X2

Check Y1 and Y2

2/16/2017 Bloom Filters (Simon S. Lam) 5

Page 6: References - University of Texas at AustinStandard Bloom Filter A Bloom filter is an array of m bits representing a set S = { x1, x2, … , xn} of n elements o Array set to 0 initially

2/16/2017

6

Standard Bloom Filter (cont.)

To check membership of y in S, check whether hi(y), 1≤i≤k, are all set to 1whether hi(y), ≤ ≤k, are all set too If not, y is definitely not in So Else, we conclude that y is in S, but sometimes this

conclusion is wrong (false positive)For many applications, false positives are

t bl l th b bilit facceptable as long as the probability of a false positive is small enough

We will assume that kn < m

2/16/2017 Bloom Filters (Simon S. Lam) 6

Page 7: References - University of Texas at AustinStandard Bloom Filter A Bloom filter is an array of m bits representing a set S = { x1, x2, … , xn} of n elements o Array set to 0 initially

2/16/2017

7

False positive probability After all members of S have been hashed to a Bloom

filter, the probability that a specific bit is still 0 is

/1' (1 )kn kn mp e pm

−= − =

For a non member, it may be found to be a member of S (all of its k bits are nonzero) with false positive

m

of S (all of its k bits are nonzero) with false positive probability

(1 ') (1 )k kp p− −

2/16/2017 Bloom Filters (Simon S. Lam) 7

Page 8: References - University of Texas at AustinStandard Bloom Filter A Bloom filter is an array of m bits representing a set S = { x1, x2, … , xn} of n elements o Array set to 0 initially

2/16/2017

8

False positive probability (cont.)

Define

11' (1 ') (1 (1 ) )k kn kf pm

= − = − −

/(1 ) (1 )k kn m kf p e−= − = −

Two competing forces as k increases

o Larger k > is smaller for a fixed p’(1 ')kp−o Larger k -> is smaller for a fixed p

o Larger k -> p’= is smaller -> 1-p’ larger(1 1 / )knm−

(1 )p−

2/16/2017 Bloom Filters (Simon S. Lam) 8

Page 9: References - University of Texas at AustinStandard Bloom Filter A Bloom filter is an array of m bits representing a set S = { x1, x2, … , xn} of n elements o Array set to 0 initially

2/16/2017

9

False positive rate vs. kmNumber of bits per member 8mn

=

Number of

2/16/2017 9Bloom Filters (Simon S. Lam)

Page 10: References - University of Texas at AustinStandard Bloom Filter A Bloom filter is an array of m bits representing a set S = { x1, x2, … , xn} of n elements o Array set to 0 initially

2/16/2017

10

Optimal number k from derivativeRewrite asf

/ /

Rewrite as exp(ln(1 ) ) exp( ln(1 ))kn m k kn m

ff e k e− −= − = −

/Let ln(1 )Minimizing will minimize exp( )

kn mg k eg f g

−= −=g p( )g f g

//

/(1 )ln(1 )

1

kn mkn m

kn mg k eek k

−−

∂ ∂ −= − +∂ ∂

/ //ln(1 ) ln(2) ln(2) 0

1kn m kn m

kn mk ne ee m

− −−= − + = − + =

/1 kn mk e k∂ − ∂

if we plug ( / ) ln 2 which is optimal( i i f l b l i )

k m n=1 e m−

(It is in fact a global optimum)2/16/2017 Bloom Filters (Simon S. Lam) 10

Page 11: References - University of Texas at AustinStandard Bloom Filter A Bloom filter is an array of m bits representing a set S = { x1, x2, … , xn} of n elements o Array set to 0 initially

2/16/2017

11

Optimal k from symmetryAlternatively, from we get/kn mp e−=

ln( )mk pln( )

From previous slide, we have

k pn

= −

/

From previous slide, we have

ln(1 ) ln( ) ln(1 )kn m mg k e p p−= − = − −

From above, symmetry indicates that the minimum value for g occurs when p=1/2.

n

g pThus

ln(1 / 2) ln(2)optm mkn n

= − =

2/16/2017 Bloom Filters (Simon S. Lam) 11

n n

Page 12: References - University of Texas at AustinStandard Bloom Filter A Bloom filter is an array of m bits representing a set S = { x1, x2, … , xn} of n elements o Array set to 0 initially

2/16/2017

12

Optimal k from symmetryusing the precise probability of false positiveusing the precise probability of false positive

' (1 ') exp( ln(1 '))kf p k p= − = −

From ' (1 1 / ) , solving for knp m k= −

( ) p( ( ))f p p

( ) , g1= ln( ')

l (1 1 / )

p

k pln(1 1 / )n m−

(in equation for ' above)Let ' ln(1 ') fg k p= − ( q )( )1 ln( ') ln(1 ')

ln(1 1 / )

fg p

p pn m

= −

2/16/2017 Bloom Filters (Simon S. Lam) 12

ln(1 1 / )n m−

Page 13: References - University of Texas at AustinStandard Bloom Filter A Bloom filter is an array of m bits representing a set S = { x1, x2, … , xn} of n elements o Array set to 0 initially

2/16/2017

13

Using the precise probability of false positive to get optimal k (cont.)p g p ( )

From previous slide11' ln( ') ln(1 ')

ln(1 1 / )g p p

n m= −

By symmetry, g’ (also f’) minimized at p’=1/2

Optimal k is 1 1' ln( ') ln(1 / 2)

ln(1 1 / ) ln(1 1 / )optk pn m n m

= =− −

2/16/2017 Bloom Filters (Simon S. Lam) 13

Page 14: References - University of Texas at AustinStandard Bloom Filter A Bloom filter is an array of m bits representing a set S = { x1, x2, … , xn} of n elements o Array set to 0 initially

2/16/2017

14

Optimal number of hash functions Using the false positive rate is

ln(2) ln(2) /m m

ln(2)optmkn

=

In practice, k should be an integer. May choose an integer l ll h k d h hi h d

( ) ( ) / (1 ) (0.5) (0.6185) , where ln(2) 0.6931 m nn np− = =

value smaller than kopt to reduce hashing overhead

m/n denotes bits per entry False positive ratebits per entry

2/16/2017 Bloom Filters (Simon S. Lam) 14

Page 15: References - University of Texas at AustinStandard Bloom Filter A Bloom filter is an array of m bits representing a set S = { x1, x2, … , xn} of n elements o Array set to 0 initially

2/16/2017

15

False positive rate vs. bits per entry

4 hash functions

False positive raterate

Using optimal number of hash functions

2/16/2017 Bloom Filters (Simon S. Lam) 15

m/n

Page 16: References - University of Texas at AustinStandard Bloom Filter A Bloom filter is an array of m bits representing a set S = { x1, x2, … , xn} of n elements o Array set to 0 initially

2/16/2017

16

Standard Bloom Filter tricksTwo Bloom filters representing sets S1 and

S2 with the same number of bits and using gthe same hash functions.o A Bloom filter that represents the union of S1 and

S2 can be obtained by taking the OR of the bitS2 can be obtained by taking the OR of the bit vectors

A Bloom filter can be halved in size. Suppose h i i f 2the size is a power of 2.o Just OR the first and second halves of the bit

vectorvectoro When hashing to do a lookup, the highest order bit

is masked

2/16/2017 Bloom Filters (Simon S. Lam) 16

Notation: OR denotes bitwise or

Page 17: References - University of Texas at AustinStandard Bloom Filter A Bloom filter is an array of m bits representing a set S = { x1, x2, … , xn} of n elements o Array set to 0 initially

2/16/2017

17

Counting Bloom filters

Proposed by Fan et al. [2000] for distributed cachingcach ng

Every entry in a counting Bloom filter is a small counter (rather than a single bit).( g )o When an item is inserted into the set, the

corresponding counters are each incremented by 1h d l d f h ho When an item is deleted from the set, the

corresponding counters are each decremented by 1To avoid counter overflow its size must beTo avoid counter overflow, its size must be

sufficiently large. It was found that 4 bits per counter are enough.u ug .

2/16/2017 Bloom Filters (Simon S. Lam) 17

Page 18: References - University of Texas at AustinStandard Bloom Filter A Bloom filter is an array of m bits representing a set S = { x1, x2, … , xn} of n elements o Array set to 0 initially

2/16/2017

18

Counter overflow probabilityConsider a set of n elements, k hashConsider a set of n elements, k hash

functions, and m counterso C(i) is the count for the ith counter

1 1[ ( ) ] 1j nk jnk

P c i jj

− = = − j m m

1[ ( ) ]nk 1[ ( ) ] jP c i jj m

≥ ≤

(a very loose upper bound)j

enkjm

2/16/2017 Bloom Filters (Simon S. Lam) 18

Page 19: References - University of Texas at AustinStandard Bloom Filter A Bloom filter is an array of m bits representing a set S = { x1, x2, … , xn} of n elements o Array set to 0 initially

2/16/2017

19

Counter overflow probability (cont.)Choose k such that k ≤ m/n (ln 2)

Then ln 2j j

enk e ln 2[ ( ) ] enk eP c i jjm j

≥ ≤ ≤

j

1

ln 2[max ( ) ]j

i m

eP c i j mj≤ ≤

≥ ≤

for some i

Using 4 bits, each counter counts from 0 to 15

15

1[max ( ) 16] 1.37 10

i mP c i m −

≤ ≤≥ ≤ × ×

2/16/2017 Bloom Filters (Simon S. Lam) 19

Page 20: References - University of Texas at AustinStandard Bloom Filter A Bloom filter is an array of m bits representing a set S = { x1, x2, … , xn} of n elements o Array set to 0 initially

2/16/2017

20

Counter overflow consequences

When a counter does overflow, it may be left at its maximum value. at ts max mum value.

This can later cause a false negative only if eventually the counter goes down to 0 when it y gshould have remain at nonzero.

The expected time to this event is very large p y gbut is something we need to keep in mind for any application that does not allow false

tinegatives

2/16/2017 Bloom Filters (Simon S. Lam) 20

Page 21: References - University of Texas at AustinStandard Bloom Filter A Bloom filter is an array of m bits representing a set S = { x1, x2, … , xn} of n elements o Array set to 0 initially

2/16/2017

21

Conclusions

Wherever a list or set is used, and space is at a premium, a Bloom filter may be used if the a prem um, a Bloom f lter may be used f theeffect of false positives can be mitigatedo No false negative

With a counting Bloom filter, false negatives are possible, albeit highly unlikely

2/16/2017 Bloom Filters (Simon S. Lam) 21

Page 22: References - University of Texas at AustinStandard Bloom Filter A Bloom filter is an array of m bits representing a set S = { x1, x2, … , xn} of n elements o Array set to 0 initially

2/16/2017

22

The End

2/16/2017 Bloom Filters (Simon S. Lam) 22