probabilistic data structures and approximate solutions oleksandr pryymak

30
Probabilistic Data Structures and Approximate Solutions by Oleksandr Pryymak PyData London 2014 IPython notebook with code >>

Upload: pydata

Post on 23-Jan-2015

1.512 views

Category:

Technology


0 download

DESCRIPTION

Probabilistic Data Structures and Approximate Solutions by Oleksandr Pryymak. http://nbviewer.ipython.org/gist/235/d3ee622926b5f77f03df

TRANSCRIPT

Page 1: Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

Probabilistic Data Structuresand Approximate Solutions

by Oleksandr PryymakPyData London 2014IPython notebook with code >>

Page 2: Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

Probabilistic||Approximate: Why?Often:● an approximate answer is sufficient● need to trade accuracy for scalability or speed● need to analyse stream of data

Catch:● despite typically achieving good result, exists a

chance of the bad worst case behaviour.● use on large datasets (law of large numbers)

Page 3: Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

Code: Approximationimport randomx = [random.randint(0,80000) for _ in xrange(10000)]y = [i>>8 for i in x] # trim 8 bits off of integersz = x[:500] # 5% sample (x is uniform)

avx = average(x)avy = average(y) * 2**8 # add 8 bitsavz = average(z)

print avxprint avy, 'error %.06f%%' % (100*abs(avx-avy)/float(avx))print avz, 'error %.06f%%' % (100*abs(avx-avz)/float(avx))

39547.881639420.7744 error 0.321401%39591.424 error 0.110100%

C. Titus Brown “Awesome Big Data Algorithms”

Page 5: Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

Probabilistic Data Structures

Generally they are:● Use less space than a full dataset● Require higher CPU load● Stream-friendly ● Can be parallelized● Have controlled error rate

Page 6: Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

Hash functionsOne-way function: arbitrary length of the key -> to a fixed length of the message

message = hash(key)

However, collisions are possible:

hash(key1) = hash(key2)

Page 7: Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

Code: Hashing

Page 8: Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

Hash collisions and performance● Cryptographic hashes not ideal for our use (like bcrypt)● Need a fast algorithm with the lowest number of collisions:

Hash Lowercase Random UUID Numbers ============= ============= =========== ==============Murmur 145 ns 259 ns 92 ns 6 collis 5 collis 0 collisFNV-1 184 ns 730 ns 92 ns 1 collis 5 collis 0 collisDJB2 156 ns 437 ns 93 ns 7 collis 6 collis 0 collisSDBM 148 ns 484 ns 90 ns 4 collis 6 collis 0 collisSuperFastHash 164 ns 344 ns 118 ns 85 collis 4 collis 18742 collisCRC32 250 ns 946 ns 130 ns 2 collis 0 collis 0 collisLoseLose 338 ns - - 215178 collis

by Ian Boyd: http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed

Murmur2 collisions

● cataract collides with periti● roquette collides with skivie● shawl collides with stormbound● dowlases collides with tramontane● cricketings collides with twanger● longans collides with whigs

Page 9: Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

Hash randomness visualised hashmap

Great murmur2

on a sequence of numbers

Not so greatDJB2

on a sequence of numbers

Page 10: Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

Comparison: Locality Sensitive Hashing (LSH)

Page 12: Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

Membership test: Bloom filterBloom filter is probabilistic but only yields false positives.

Hash each item k times indices into bit field.`

1..mAt least one 0 means w definitely isn’t in set.

All 1s would mean wprobably is in set.

Page 13: Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

Use Bloom filter to serve requests

Page 14: Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

Code: bloom filter

Page 15: Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

Use Bloom filter to store graphsGraphs only gain nodes because of Bloom filter false positives.

Pell et al., PNAS 2012

Page 16: Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

Counting Distinct ElementsIn: infinite stream of dataQuestion: how many distinct elements are there?

is similar to:

In: coin flipsQuestion: how many times it has been flipped?

Page 17: Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

Coin flips: intuition● Long runs of HEADs in random series are rare.

● The longer you look, the more likely you see a long one.

● Long runs are very rare and are correlated with how many coins you’ve flipped.

Page 18: Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

Code: Cardinality estimation

Page 19: Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

Cardinality estimationBasic algorithm:

● n=0● For each input item:

○ Hash item into bit string○ Count trailing zeroes in bit string○ If this count > n:

■ Let n = count

● Estimated cardinality (“count distinct”) = 2^n

Page 20: Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

Cardinality estimation: HyperLogLog

Demo by: http://www.aggregateknowledge.com/science/blog/hll.html

Billions of distinct values in 1.5KB of RAM with 2% relative error

HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm

P.Flajolet, É.Fusy, O.Gandouet, F.Meunier; 2007

Page 21: Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

Code: HyperLogLog

Page 22: Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

Count-min sketch

count(value) = min{w1[h1(value)], ... wd[hd(value)]}

Frequency histogram estimation with chance of over-counting

Page 23: Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

Code: Frequent Itemsets

Page 24: Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

Machine Learning: Feature hashingHigh-dimensional machine learning without feature dictionary

by Andrew Clegg “Approximate methods for scalable data mining”

Page 25: Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

Locality-sensitive hashing To approximate nearest neighbours

by Andrew Clegg “Approximate methods for scalable data mining”

Page 26: Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

Probabilistic Databases● PrDB (University of Maryland)

● Orion (Purdue University)

● MayBMS (Cornell University)

● BlinkDB v0.1alpha(UC Berkeley and MIT)

Page 27: Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

BlinkDB: queriesQueries with Bounded Errors

and Bounded Response Times on Very Large Data

Page 28: Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

BlinkDB: architecture

Page 29: Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

References

Mining of Massive Datasetsby Jure Leskovec, Anand Rajaraman, and Jeff Ullmanhttp://infolab.stanford.edu/~ullman/mmds.html