probabilistic data structures and approximate solutions oleksandr pryymak
DESCRIPTION
Probabilistic Data Structures and Approximate Solutions by Oleksandr Pryymak. http://nbviewer.ipython.org/gist/235/d3ee622926b5f77f03dfTRANSCRIPT
Probabilistic Data Structuresand Approximate Solutions
by Oleksandr PryymakPyData London 2014IPython notebook with code >>
Probabilistic||Approximate: Why?Often:● an approximate answer is sufficient● need to trade accuracy for scalability or speed● need to analyse stream of data
Catch:● despite typically achieving good result, exists a
chance of the bad worst case behaviour.● use on large datasets (law of large numbers)
Code: Approximationimport randomx = [random.randint(0,80000) for _ in xrange(10000)]y = [i>>8 for i in x] # trim 8 bits off of integersz = x[:500] # 5% sample (x is uniform)
avx = average(x)avy = average(y) * 2**8 # add 8 bitsavz = average(z)
print avxprint avy, 'error %.06f%%' % (100*abs(avx-avy)/float(avx))print avz, 'error %.06f%%' % (100*abs(avx-avz)/float(avx))
39547.881639420.7744 error 0.321401%39591.424 error 0.110100%
C. Titus Brown “Awesome Big Data Algorithms”
Code: Sampling Data
Interview question: Get K samples from an infinite stream
Probabilistic Data Structures
Generally they are:● Use less space than a full dataset● Require higher CPU load● Stream-friendly ● Can be parallelized● Have controlled error rate
Hash functionsOne-way function: arbitrary length of the key -> to a fixed length of the message
message = hash(key)
However, collisions are possible:
hash(key1) = hash(key2)
Code: Hashing
Hash collisions and performance● Cryptographic hashes not ideal for our use (like bcrypt)● Need a fast algorithm with the lowest number of collisions:
Hash Lowercase Random UUID Numbers ============= ============= =========== ==============Murmur 145 ns 259 ns 92 ns 6 collis 5 collis 0 collisFNV-1 184 ns 730 ns 92 ns 1 collis 5 collis 0 collisDJB2 156 ns 437 ns 93 ns 7 collis 6 collis 0 collisSDBM 148 ns 484 ns 90 ns 4 collis 6 collis 0 collisSuperFastHash 164 ns 344 ns 118 ns 85 collis 4 collis 18742 collisCRC32 250 ns 946 ns 130 ns 2 collis 0 collis 0 collisLoseLose 338 ns - - 215178 collis
by Ian Boyd: http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
Murmur2 collisions
● cataract collides with periti● roquette collides with skivie● shawl collides with stormbound● dowlases collides with tramontane● cricketings collides with twanger● longans collides with whigs
Hash randomness visualised hashmap
Great murmur2
on a sequence of numbers
Not so greatDJB2
on a sequence of numbers
Comparison: Locality Sensitive Hashing (LSH)
Comparison: Locality Sensitive Hashing (LSH)
Image hashes
Kernelized locality-sensitive hashing for scalable image searchB Kulis, K Grauman - Computer Vision, 2009 IEEE 12th …, 2009 - ieeexplore.ieee.org
Abstract Fast retrieval methods are critical for large-scale and data-driven vision applications. Recent work has explored ways to embed high-dimensional features or complex distance functions into a low-dimensional Hamming space where items can be ... Cited by 22
Membership test: Bloom filterBloom filter is probabilistic but only yields false positives.
Hash each item k times indices into bit field.`
1..mAt least one 0 means w definitely isn’t in set.
All 1s would mean wprobably is in set.
Use Bloom filter to serve requests
Code: bloom filter
Use Bloom filter to store graphsGraphs only gain nodes because of Bloom filter false positives.
Pell et al., PNAS 2012
Counting Distinct ElementsIn: infinite stream of dataQuestion: how many distinct elements are there?
is similar to:
In: coin flipsQuestion: how many times it has been flipped?
Coin flips: intuition● Long runs of HEADs in random series are rare.
● The longer you look, the more likely you see a long one.
● Long runs are very rare and are correlated with how many coins you’ve flipped.
Code: Cardinality estimation
Cardinality estimationBasic algorithm:
● n=0● For each input item:
○ Hash item into bit string○ Count trailing zeroes in bit string○ If this count > n:
■ Let n = count
● Estimated cardinality (“count distinct”) = 2^n
Cardinality estimation: HyperLogLog
Demo by: http://www.aggregateknowledge.com/science/blog/hll.html
Billions of distinct values in 1.5KB of RAM with 2% relative error
HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm
P.Flajolet, É.Fusy, O.Gandouet, F.Meunier; 2007
Code: HyperLogLog
Count-min sketch
count(value) = min{w1[h1(value)], ... wd[hd(value)]}
Frequency histogram estimation with chance of over-counting
Code: Frequent Itemsets
Machine Learning: Feature hashingHigh-dimensional machine learning without feature dictionary
by Andrew Clegg “Approximate methods for scalable data mining”
Locality-sensitive hashing To approximate nearest neighbours
by Andrew Clegg “Approximate methods for scalable data mining”
Probabilistic Databases● PrDB (University of Maryland)
● Orion (Purdue University)
● MayBMS (Cornell University)
● BlinkDB v0.1alpha(UC Berkeley and MIT)
BlinkDB: queriesQueries with Bounded Errors
and Bounded Response Times on Very Large Data
BlinkDB: architecture
References
Mining of Massive Datasetsby Jure Leskovec, Anand Rajaraman, and Jeff Ullmanhttp://infolab.stanford.edu/~ullman/mmds.html
Summary
● know the data structures● know what you sacrifice● control errors
http://nbviewer.ipython.org/gist/235/d3ee622926b5f77f03df
http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/ by Ilya Katsov