data structures for big data: bloom filtervielmo/notes/2014...data structures for big data...
TRANSCRIPT
![Page 1: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2262e5e0ec842bd789d38/html5/thumbnails/1.jpg)
Data Structures for Big Data:
Bloom Filter
Vinicius Vielmo Cogo
Smalltalks, DI, FC/UL. October 16, 2014.
![Page 2: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2262e5e0ec842bd789d38/html5/thumbnails/2.jpg)
2 / 30
is relative
is not defined by a specific number of TB, PB, EB
is when it becomes big for you
is when your solutions become inefficient/impractical
![Page 3: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2262e5e0ec842bd789d38/html5/thumbnails/3.jpg)
Data Structures for Big Data
Traditional DSs are subject to the same problems
e.g., lists, trees
(e.g., YARN, NoSQL)
or
(e.g., index, metadata)
reached the point of thinking in new DSs for BD
3 / 30
![Page 4: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2262e5e0ec842bd789d38/html5/thumbnails/4.jpg)
Outline
Bloom Filter
Use Cases
Implementations
Other Filters
Other Data Structures for Big Data
4 / 30
![Page 5: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2262e5e0ec842bd789d38/html5/thumbnails/5.jpg)
Bloom Filter
Membership testing
Does my collection contain this element?
5 / 30
![Page 6: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2262e5e0ec842bd789d38/html5/thumbnails/6.jpg)
Bloom Filter
City
Coimbra
Leiria
6 / 30
![Page 7: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2262e5e0ec842bd789d38/html5/thumbnails/7.jpg)
Bloom Filter
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Index i
bf[i]
http://billmill.org/bloomfilter-tutorial/ 7 / 30
![Page 8: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2262e5e0ec842bd789d38/html5/thumbnails/8.jpg)
Bloom Filter
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Index i
bf[i]
City
Coimbra
Leiria
Hash Function
Fnv
Murmur
8 / 30
![Page 9: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2262e5e0ec842bd789d38/html5/thumbnails/9.jpg)
Bloom Filter
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Index i
bf[i]
City
Coimbra
Leiria
Hash Function
Fnv
Murmur
i=4
i=7
9 / 30
![Page 10: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2262e5e0ec842bd789d38/html5/thumbnails/10.jpg)
Bloom Filter
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 0 0 0 1 0 0 1 0 0 0 0 0 0 0
Index i
bf[i]
City
Coimbra
Leiria
Hash Function
Fnv
Murmur
i=4
i=7
10 / 30
![Page 11: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2262e5e0ec842bd789d38/html5/thumbnails/11.jpg)
Bloom Filter
Index i
bf[i]
City
Coimbra
Leiria
Hash Function
Fnv
Murmur
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 0 0 0 1 0 0 1 0 0 0 0 0 0 0
11 / 30
![Page 12: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2262e5e0ec842bd789d38/html5/thumbnails/12.jpg)
Bloom Filter
Index i
bf[i]
City
Coimbra
Leiria
Hash Function
Fnv
Murmur
i=2
i=9
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 0 0 0 1 0 0 1 0 0 0 0 0 0 0
12 / 30
![Page 13: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2262e5e0ec842bd789d38/html5/thumbnails/13.jpg)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 0 1 0 1 0 0 1 0 1 0 0 0 0 0
Bloom Filter
Index i
bf[i]
City
Coimbra
Leiria
Hash Function
Fnv
Murmur
i=2
i=9
13 / 30
![Page 14: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2262e5e0ec842bd789d38/html5/thumbnails/14.jpg)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 0 1 0 1 0 0 1 0 1 0 0 0 0 0
Bloom Filter
Index i
bf[i]
City
Coimbra
Leiria
Hash Function
Fnv
Murmur
14 / 30
![Page 15: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2262e5e0ec842bd789d38/html5/thumbnails/15.jpg)
Bloom Filter
City
Braga
Guarda
Coimbra
Lisboa
15 / 30
![Page 16: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2262e5e0ec842bd789d38/html5/thumbnails/16.jpg)
Result: false
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 0 1 0 1 0 0 1 0 1 0 0 0 0 0
Bloom Filter
Index i
bf[i]
City
Braga
Guarda
Coimbra
Lisboa
Hash Function
Fnv
Murmur
i=10
i=14
16 / 30
![Page 17: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2262e5e0ec842bd789d38/html5/thumbnails/17.jpg)
Result: false
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 0 1 0 1 0 0 1 0 1 0 0 0 0 0
Bloom Filter
Index i
bf[i]
City
Braga
Guarda
Coimbra
Lisboa
Hash Function
Fnv
Murmur
i=2
i=12
17 / 30
![Page 18: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2262e5e0ec842bd789d38/html5/thumbnails/18.jpg)
Result: true
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 0 1 0 1 0 0 1 0 1 0 0 0 0 0
Bloom Filter
Index i
bf[i]
City
Braga
Guarda
Coimbra
Lisboa
Hash Function
Fnv
Murmur
i=4
i=7
18 / 30
![Page 19: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2262e5e0ec842bd789d38/html5/thumbnails/19.jpg)
Result: true (but it is a false positive)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 0 1 0 1 0 0 1 0 1 0 0 0 0 0
Bloom Filter
Index i
bf[i]
City
Braga
Guarda
Coimbra
Lisboa
Hash Function
Fnv
Murmur
i=7
i=9
19 / 30
![Page 20: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2262e5e0ec842bd789d38/html5/thumbnails/20.jpg)
Bloom Filter
DS proposed by Burton Howard Bloom in 1970
Design principles
Space-efficient
Smaller than the original dataset
Time-efficient
Low latency R/W
O(k), which is much smaller than O(n)
High throughput
Probabilistic
E.g., myCollection.mightContain(myObject)
False positives happen (but in a configurable way)
20 / 30
![Page 21: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2262e5e0ec842bd789d38/html5/thumbnails/21.jpg)
= Optimal number of hash functions
Hash Function
Fnv
Murmur
Important variables
Bloom Filter
= Expected collection size
= Bitmap size
= False positive rate (e.g., 0.0001% or 1 in 1M)
City
Coimbra
Leiria
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
21 / 30
![Page 22: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2262e5e0ec842bd789d38/html5/thumbnails/22.jpg)
Important variables
Bloom Filter
22 / 30
![Page 23: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2262e5e0ec842bd789d38/html5/thumbnails/23.jpg)
Users define two of them (normally n and any other)
The other two are calculated with those equations
Interesting relations:
Bigger collection ( ) Larger bitmap ( )
Bigger collection ( ) More false positives ( )
Larger bitmap ( Less false positives ( )
Larger bitmap ( ) Less hash functions ( )
Less hash functions ( )
Bloom Filter
23 / 30
![Page 24: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2262e5e0ec842bd789d38/html5/thumbnails/24.jpg)
Bloom filter size vs. False positive rate
Bloom Filter
24 / 30
![Page 25: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2262e5e0ec842bd789d38/html5/thumbnails/25.jpg)
Use Cases
Reducing unnecessary disk reads
Client BloomFilter Dataset
RAM Hard Disk
1
2
3
F
T
T F
T
1?
2?
3?
necessary
read(2)
unnecessary
read(3)
No
2
No
F
25 / 30
![Page 26: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2262e5e0ec842bd789d38/html5/thumbnails/26.jpg)
Use Cases
Google BigTable, Apache Cassandra and HBase
Reducing disk lookups
Google Chrome
Lookup a list of known malicious URLs
Bitcoin
Get only the transactions relevant to your wallet
Others
In my Ph.D. work
Lookup a list of known privacy-sensitive DNA
sequences 26 / 30
![Page 27: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2262e5e0ec842bd789d38/html5/thumbnails/27.jpg)
Implementations
-libraries https://code.google.com/p/guava-libraries/
Orestes-Bloomfilter https://github.com/Baqend/Orestes-Bloomfilter
java-bloomfilter https://github.com/magnuss/java-bloomfilter
java-longfastbloomfilter https://code.google.com/p/java-longfastbloomfilter/
27 / 30
![Page 28: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2262e5e0ec842bd789d38/html5/thumbnails/28.jpg)
Other Filters
Counting Bloom filters Allow deletions (use a 4-bit counter instead of 1 bit)
Buffered Bloom filters Sub-filters in SSD with buffered R/W exploring bit locality
Quotient and Cascade filters Uses an SSD, instead of the main memory, for scalability
28 / 30
![Page 29: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2262e5e0ec842bd789d38/html5/thumbnails/29.jpg)
Other DSs (and techniques) for Big Data
Locality-sensitive hashing (LSH) Hashing similar elements into the same bucket with high probability
HyperLogLog for computing cardinality Counting the number of distinct elements in a collection
Log Structured Merge (LSM) trees Indexed access to files with high insert volume and background batch synchronization
29 / 30
![Page 30: Data Structures for Big Data: Bloom Filtervielmo/notes/2014...Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or](https://reader034.vdocuments.us/reader034/viewer/2022042306/5ed2262e5e0ec842bd789d38/html5/thumbnails/30.jpg)
Thank you!
Vinicius Vielmo Cogo
Smalltalks, DI, FC/UL. October 16, 2014.