hash - a probabilistic approach for big data
TRANSCRIPT
![Page 1: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/1.jpg)
HashA probabilistic approach for big data
Luca Mastrostefano
![Page 2: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/2.jpg)
Who am I?
● Product manager of MyMemory at Translated
● IT background
● Algorithms lover
Luca Mastrostefano
![Page 3: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/3.jpg)
Syllabus
Problem Use case
Fast and exact search Databases - Search
Stream filter Translated - MyMemory
Counting unique items in a stream ClickMeter - IPs analysis
Probabilistic search Memopal - Search for similar files
![Page 4: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/4.jpg)
Search algorithmsDatabases - Fast and exact search
Static, extendible and linear hash indexes
![Page 5: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/5.jpg)
Use case
Sometimes also a logarithmic complexity is
too expensive.
![Page 6: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/6.jpg)
B
+
tree index
Images from Data Management - Maurizio Lenzerini
Select/Insert ≅ Log
F
(# items)
![Page 7: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/7.jpg)
Search - Hash index
![Page 8: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/8.jpg)
Static hash index
Images from Data Management - Maurizio Lenzerini
Select/Insert ≅ 2 + (# overflow pages)
Directories
![Page 9: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/9.jpg)
Images from Data Management - Maurizio Lenzerini
Dynamic hash index - Extendible
Select/Insert ≅
2 + (# overflow pages)
# overflow pages almost constant
![Page 10: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/10.jpg)
Intuition:
● Avoid the directories to save one memory access.
● Split one bucket per time: it fits real-time environments!
Dynamic hash index - Linear
Select/Insert ≅
1 + (# overflow pages)
# overflow pages almost constant
![Page 11: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/11.jpg)
4x in case of billions of entries
Select/Insert ≊ Log
VSB
+
tree index
Indexes comparison - Secondary memory accesses
Linear hash index
Select/Insert ≊ const
1 access ≊ 7 ms4 accesses ≊ 30 ms
![Page 12: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/12.jpg)
Stream filter: x ∈ U ?Translated - MyMemory
Bloom filter
![Page 13: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/13.jpg)
Use case
The delay introduced by the secondary
memory does not fit an environment in which
milliseconds matter.
![Page 14: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/14.jpg)
Stream filter - Naïve approach
60+ GB
Hash index (1,5B items)
Network delay
5% item ∈ Dataset
…
![Page 15: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/15.jpg)
Stream filter - Bloom filter
![Page 16: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/16.jpg)
Bloom filter - Insert
0 0 0 0 0 0 0 0 0 0 0 0 0 0
n1
...
nn
n items to insert
h1 h2 h3 k hash functions
Bit array of length m
![Page 17: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/17.jpg)
Bloom filter - Insert
0 1 0 0 0 0 0 0 1 0 0 0 1 0
h1 h... hk
n1
![Page 18: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/18.jpg)
Bloom filter - Insert
0 1 1 0 0 1 0 0 1 0 0 1 1 0
h1 h... hk
nn
![Page 19: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/19.jpg)
Bloom filter - Search
0 1 1 0 0 1 0 0 1 0 0 1 1 0
n
a
b
...
h1 h... hk
Items to search for
Same hash
functions
Fixed bit array
![Page 20: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/20.jpg)
Bloom filter - Search [No false negative]
0 1 1 0 0 1 0 0 1 0 0 1 1 0
h1 h... hk“a” DOES NOT belong to the
set
a
n
b
...
![Page 21: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/21.jpg)
Bloom filter - Search [True positive]
0 1 1 0 0 1 0 0 1 0 0 1 1 0
h1 h... hk “n” MAY belong to the set
n
b
...
![Page 22: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/22.jpg)
Bloom filter - Search [Possible false positive]
0 1 1 0 0 1 0 0 1 0 0 1 1 0
h1 h... hk
b
...
“b” MAY belong to the set
![Page 23: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/23.jpg)
Bloom filter - Analysis
n items to insert
k hash
functions
m bits
0 1 1 0 0 1 0 0 1 0 0 1 1 0
z
...
h1 h2 h3
b
...
h1 h... hk
The probability of a false
positive is:
P =
![Page 24: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/24.jpg)
Bloom filter - Implementation
n items to insert
k hash
functions
m bits
● Optimal number of hash function:
● Optimal number of bit m for the
desired probability p of false positive:
![Page 25: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/25.jpg)
Bloom filter - Results
7 hash functions
2 GB (14B bit)
60+ GB VS
Naïve approach Bloom filter
1% of false positive
![Page 26: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/26.jpg)
Bloom filter - Results [MyMemory]
~5% of connections
60+ GB
Hash index (1,5B items)
…
2 GB
bloom filter
![Page 27: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/27.jpg)
Counting unique items in a stream
ClickMeter - Number of unique IPs per link
Flajolet - Martin for unique hash counting
![Page 28: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/28.jpg)
Use case
Counting unique elements could be really
costly in terms of memory.
![Page 29: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/29.jpg)
Counting unique items - Naïve approach
500 MB per link
(4B bits array)
... 1 1 0 0 1 0 0 1 0 0 1 1 ...
5 PB with 10M links
0.0.0.0 255.255.255.255
![Page 30: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/30.jpg)
Counting unique items -
Flajolet-Martin
![Page 31: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/31.jpg)
Flajolet-Martin
...0 1 0 1 0 1 0 1 0 0 1 0 0 0
P(n trailing zeros) = ?
![Page 32: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/32.jpg)
Flajolet-Martin
...0 1 0 1 0 1 0 1 0 0 1 0 0 0
P(n trailing zeros) = (½)^n
# seen hashes ≅ ?
![Page 33: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/33.jpg)
… x x x x x x x x 0 0 0
Flajolet-Martin
...0 1 0 1 0 1 0 1 0 0 1 0 0 0
P(n trailing zeros) = (½)^n
# seen hashes ≅ 2^n
… x x x x x x x x 0 0 1
… x x x x x x x x 0 1 0
… x x x x x x x x 0 1 1
… x x x x x x x x 1 0 0
… x x x x x x x x 1 0 1
… x x x x x x x x 1 1 0
… x x x x x x x x 1 1 1
![Page 34: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/34.jpg)
Flajolet-Martin
0Hash ...010011011
Element Hash function Hashed value Max number of trailing zeros
x1
1Hash ...100101010x2
1Hash ...010011011x1
...
Hash ...010000000xn log
2
(n)
![Page 35: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/35.jpg)
Flajolet-Martin
0Hash1 ...010011011
Element Hash functions Hashed value Max number of trailing zeros
x1 3Hash.. ...111001000
0Hashk ...110100001
...
...
![Page 36: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/36.jpg)
Flajolet-Martin - Results
VS
Naïve approach Flajolet-Martin
500 MB per link
5 PB with 10M links
1,5 KB per link
15 GB with 10M links
2% of error
![Page 37: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/37.jpg)
Probabilistic searchMemopal - Search for similar files
Local sensitive hashing & min hashing
![Page 38: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/38.jpg)
Use case
The difference between a petabyte and a
gigabyte index is worth an approximation.
![Page 39: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/39.jpg)
Search - Naïve approach
2 B files
1 PB of index
Slow search
![Page 40: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/40.jpg)
Search - Min hash
![Page 41: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/41.jpg)
Day was departing, and the
embrowned air
Released the animals that are
on earth
From their fatigues; and I the
only one
Made myself ready to sustain
the war,
Both of the way and likewise
of the woe,
Which memory that errs not
shall retrace.
Similarity
Midway upon the journey of
our life
I found myself within a forest
dark,
For the straightforward
pathway had been lost.
Ah me! how hard a thing it is
to say
What was this forest savage,
rough, and stern,
Which in the very thought
renews the fear.
Are they similar?
Jaccard =
Number of substrings in common
Total number of unique substrings
Document 1 Document 2
![Page 42: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/42.jpg)
Similarity
Substrings => Shingles of length S
Storage ≅ S * Doc_length * #Docs
Complexity ≅ Doc_length * #Docs
Set of shingles =
...
“Midway upon the”,
“upon the journey”,
“the journey of”,
...
“Midway upon the journey of our life”
![Page 43: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/43.jpg)
Similarity
Fingerprint => 32 bit hash of a shingle
Storage ≅ 4 byte * Doc_length * #Docs
Complexity ≅ Doc_length * #Docs
Set of shingles =
…
… 100101101 …,
… 011010000…,
… 110010011 …,
…
![Page 44: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/44.jpg)
Similarity
We need to find a signature Sig(D) of
length K so that
if Sig(D
1
) ~ Sig(D
2
) then D
1
~ D
2
Storage ≅ 4 byte * K * #Docs
Complexity ≅ K * #Docs
With K << Doc_length
![Page 45: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/45.jpg)
MinHash - Signature creation
Doc
1
…10101
…01100
…10010
…00111
Take a random permutation
of the fingerprints.
Generate the fingerprints
of the documents.
Define minhash(H
n
, Doc
i
) = First fingerprint of Doc
i
hashed with
H
n
Sig(Doc
i
) of length K = [minhash
i
, minhash
2
, …, minhash
n
]
Doc
1
…00111
…01100
…10101
…10010
Minhash of this permutation
H
n
![Page 46: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/46.jpg)
MinHash
Signature(Doc
1
)
… 100101101 …
… 011010000…
… 110010011 …
… 011100011 …
… 100100001 …
…
Sig(Doc) is a set of K min-hashing fingerprints:
Signature(Doc
n
)
… 100001101 …
… 101010110…
… 110010011 …
… 010100101 …
… 100100001 …
…
…
![Page 47: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/47.jpg)
MinHash
If Sig(D
1
) ~ Sig(D
2
) then Doc
1
~ Doc
2
P(X = 1) = Jaccard(Doc
1
, Doc
2
)
∑ X / K ≃ Jaccard(Doc
1
, Doc
2
)
… 100101101 …
… 011010000…
… 110010011 …
… 011100011 …
… 100100001 …
…
… 100001101 …
… 101010110…
… 110010011 …
… 010100101 …
… 100100001 …
…
Signature(Doc
1
) Signature(Doc
2
) X
1
0
1
0
1
…
![Page 48: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/48.jpg)
MinHash - Implementation
1. Generate the fingerprints of the document
2. Define K hash functions: h
1
, h
2
, ...
.
, h
k
.
3. Define Sig(Doc) = [h
1
(Doc), h
2
(Doc), ..., h
k
(Doc)]
4. Define O = { i / h
i
(Doc
1
) = h
i
(Doc
2
) }
5. Sim(Doc
1
, Doc
2
) = ≃ Jaccard(Doc
1
, Doc
2
)
| O |
K
Storage ≅ 4 byte * K * #Docs
Complexity ≅ K * #Docs
With K << Doc_length
![Page 49: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/49.jpg)
Local Sensitive Hashing
Signature(Doc) =
… 100101101 …
… 011010000…
… 110010011 …
…
…
…
Divide the signature Sig(Doc) into B bands of R rows each, such that B*R = K:
band 1
band 2
band ...
band B
} R fingerprints
![Page 50: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/50.jpg)
● Threshold ≅ (1/B)^(1/R)
Local Sensitive Hashing - Analysis
Probability of a document having at least band in common: 1 - (1 - j
R
)
B
Jaccard of documents
Probability of
becoming a
candidate
S-curve
R
B
![Page 51: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/51.jpg)
● Threshold ≅ (1/B)^(1/R)
● True Positive
● True Negative
● False Positive
● False Negative
Local Sensitive Hashing - Analysis
Probability of a document having at least band in common: 1 - (1 - j
R
)
B
Jaccard of documents
Probability of
becoming a
candidate
S-curve
R
B
![Page 52: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/52.jpg)
Probabilistic search - Results
Storage ≅ Shingle_length * Doc_length * #Docs
Complexity ≅ Doc_length * #Docs
From:
To:
Storage ≅ 4 byte * K * #Docs
Complexity ≅ K * #Docs * p(“candidate”)
With K << Doc_length and p(“candidate”) << 1
![Page 53: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/53.jpg)
Probabilistic search - Results
VS
Naïve approach Min hash + LSH
2 B files
1 PB of index
Slow search
2 B files
1,5 TB of index
Fast search & update
![Page 54: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/54.jpg)
Thank you
![Page 55: Hash - A probabilistic approach for big data](https://reader034.vdocuments.us/reader034/viewer/2022051705/5884c9a61a28ab767c8b4c9b/html5/thumbnails/55.jpg)
P(|questions| > 0) = 1 - [1 - p(question)]
|audience|
Any questions?