bloom filters
DESCRIPTION
Bloom Filters. Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with negative results before accessing data. Allow false positive errors, as they only cost us an extra data access. - PowerPoint PPT PresentationTRANSCRIPT
1Bloom Filters
Bloom Filters
Lookup questions: Does item “x” exist in a set or multiset?
Data set may be very big or expensive to access. Filter lookup questions with negative results before accessing data.
Allow false positive errors, as they only cost us an extra data access.
Don’t allow false negative errors, because they result in wrong answers.
2Bloom Filters
Bloom Filter [B70]
Encoding an attribute aU Maintain a Bit Vector V of size m Use k hash functions (h1..hk) ,
hi: U[1..m]
Encoding: For item x, “turn on” bits V[h1(x)]..V[hk(x)].
Lookup: Check bits V[h1(i)]..V[hk(i)] . If all equal 1, return “Probably Yes”. Else “Definitely No”.
3Bloom Filters
Bloom Filter
01000 10100 00010
x
h1(x) h2(x) hk(x)
V0 Vm-1
h3(x)
4Bloom Filters
Bloom Errors
01000 10100 00010
h1(x) h2(x) hk(x)
V0 Vm-1
h3(x)
a b c d
x didn’t appear, yet its bits are already set
5Bloom Filters
Error Estimation Assumption: Hash functions are perfectly random Probability of a bit being 0 after hashing all elements:
Let p=e-kn/m, probability of a false positive is:
Assuming we are given m and n, the optimal k is:
m
nkeem mknkn γ,/11 γ/
kkmkn
kkn
pem
f
11
111 /
mkn
mknmkn
mkn
mkn
e
e
m
kne
dk
dg
ekg
ekf
/
//
/
/
11ln
1ln
1lnexp
nmkkf
n
mk
dk
dg
/min
min
)6185.0()2/1()(
)2(ln0
6Bloom Filters
Bloom Filter Tradeoffs
Three factors: m,k and n. Normally, n and m are given, and we select k. Small k
– Less computations.– Actual number of bits accessed (nk) is smaller, so the chance of a
“step over” is smaller too.– However, less bits need to be stepped over to generate an error.
For big k, the exact opposite holds. Not surprisingly, when k is optimal, the “hit ratio” (ratio of bits
flipped in the array) is exactly 0.5
7Bloom Filters
Summary Cache [FCAB00]
Proxy servers maintain local cache to minimize expensive internet requests.
Proxy must maintain an efficient lookup method into the cache.
The lookup structure must be stored in DRAM for performance.
Structure must be compact, as DRAM is expensive and is used for “Hot Items” storage and more.
Pages are usually replaced in the cache using an LRU algorithm.
8Bloom Filters
ICP – Request Handling
ClientProxy
Cache
Proxy
Cache Proxy
Cache
Proxy
Cache
InternetInternet
9Bloom Filters
Internet Cache Protocol (ICP)
Allows for scaling-out when using proxies. Protocol that supports discovery and retrieval of
documents from neighboring caches. Establish an hierarchy of proxy caches If page not found in local proxy cache, it searches
for the page in neighboring proxies. If page not found anywhere, fetch it from the
internet.
10Bloom Filters
ICP – Request Handling
ClientProxy
Cache
Proxy
Cache Proxy
Cache
Proxy
Cache
InternetInternet
11Bloom Filters
Summary Cache
Each proxy maintains a Bloom Filter representing its local cache.
Also, it holds Bloom Filters representing caches of other proxies.
Updates to Bloom Filters are exchanged periodically or after a certain percentage of the documents in the cache was replaced.
ICP request is sent only to proxy who supposedly holds the requested document.
12Bloom Filters
ICP – With Summary Cache
Client
InternetInternetProxy
Cache
Proxy
Cache
Proxy
Cache
Proxy
Cache
13Bloom Filters
Summary Cache – Bloom Filters
To support deletions and updates, the proxy maintains the Bloom Filter and also an array of counters C, initially set to 0.
The Bloom Filter is filled with the contents of the cache.
Each bit in the BF is allowed 4 bits for its counter. On insert of item i, all C[hj(i)] are increased (to a
maximum of 15). On deletion of item i, counters are decreased. When C[i] increases from 0 to 1, V[i] is turned on. When C[i] decreases from 1 to 0, V[i] is turned off.
14Bloom Filters
Summary Cache – Bloom Filters Hashing scheme
– Generate 128 bits using MD5 on the URL.– Divide to segments of M bits (usually 32)– Calculate modulus of segments by m, providing
128/M hash values (4, for 32 bit segments)– If 128 bits are not enough, calculate MD5 of URL
concatenated with itself. Bloom Filter Exchange
– Header contains MD5 properties, size of array.– If refresh rate is high, send only deltas.– Bit counts are internal and not exchanged.– Otherwise, send entire Bloom Filter.
15Bloom Filters
Summary Cache - Errors
False Misses – Document requested is cached at some remote proxy, but
summary does not reflect that fact.– Hit ratio is reduce, a redundant internet access is performed.
False Hits– Document is not at a remote proxy, but summary suggests
that it is.– An Inter-Proxy query message is wasted.
Remote Stale Hits– Document is cached at a remote proxy, but is stale.– Occurs in both ICP and Summary Cache.– Might not be a totally wasted effort, as delta compression
can be used.
16Bloom Filters
Implementation - Squid
Squid – A publicly available web proxy cache software. http://www.squid-cache.org
Summary Cache is implemented in Squid v1.1.14
A variation called cache digest is implemented in Squid 1.2b20