e.g.m. petrakishashing1 data organization in main memory or disk sequential, binary trees, … ...

E.G.M. Petrakis Hashing 1

Hashing

Data organization in main memory or disk sequential, binary trees, …

The location of a key depends on other keys => unnecessary key comparisons to find a key

Question: find key with a single comparison Hashing: the location of a record is

computed using its key only Fast for random accesses - slow for range

queries

Hash Table

Hash Function: transforms keys to array indices

h(key)dataindex

h(key): Hash Function

0 4967000 1 2 8421002 3 . . . . . .

395 396 4618396 397 4957397 398 399 1286399 400 401

. . 990 0000990 991 0000991 992 1200992 993 0047993 994 995 9846995 996 4618996 997 4967997 998 999 0001999

position

key record

h(key) = key mod 1000h(key) = key mod 1000

Good Hash Functions

1. Uniform: distribute keys evenly in space

2. Perfect: two records cannot occupy the same location or

3. Order preserving: Difficult to find such hash functions Property 2 is the most essential Most functions are no better than

h(key) = key mod m Hash collision:

)h(k)h(k:kk jiji

)()(:, jiji khkhjikk )()(:, jiji khkhjikk

Collision Resolution

1. Open Addressing (rehashing): compute new position to store the key in the table (no extra space)

i. linear probingii. double hashing

2. Separate Chaining: lists of keys mapped to the same position (uses extra space)

Open Addressing

Computes a new address to store the key if it is occupied (rehashing)

if occupied too, compute a new address, … until an empty position is found

primary hash function: i=h(key) rehash function: rh(i)=rh(h(key)) hash sequence: (h0,h1,h2…) = (h(key),

rh(h(key)), rh(rh(h(key)))…) To find a key follow the same hash

sequence

Example

i=h(key)=key mod 100

rh(i) = (i+1) mod 100 key: 193 i=h(193)=93 rh(i)=(93+1)=94 Key 193 will occupy

position 94

0 100 1 101 2 . . .

. 90 990 91 991 92 992 93 993 94 . . .

Problem 1: Locate Empty Positions

No empty position can be found i. the table is full

check on number of empty positionsii. the hash function fails to find an empty

position although the table is not full !! i=h(key) = key mod 1000 rh(i) = (i + 200) mod 1000 => checks only 5

positions on a table of 1000 positions rh(i) = (i+1) mod 1000 successive positions rh(i) = (i+c) mod 1000 where GCD(c,m) = 1

Problem 2: Primary Clustering

Different keys that hash into different addresses compete with each other in successive rehashes i=h(key) = key mod 100 rh(i) = (i+1) mod 100 keys: 1990, 1991, 1992,

1993, 1994 => 94

0 100 1 101 2 . . .

. 90 990 91 991 92 992 93 993 94 . . .

Problem 3: Secondary Clustering

Different keys which hash to the same hash value have the same rehash sequence i=h(key) = key mod 10 rh(i,j) = (i + j) mod 10i. key 23 : h(23) = 3

rh = 4, 6, 9, 3, …ii. key 13 : h(13) = 3

rh = 4, 6, 9, 3, …

0 10 1 2 3 53 4 14 5 15 6 46 7 8 9

Linear Probing

Store the key into the next free position h0 = h(key) usually h0 = key mod m

hi = (hi-1 + 1) mod m, i >= 1

1 301 2 22 3 102 4 452 5 35 6 7 8 9 99

S = {22, 35, 301, 99, 102, 452}

Observation 1

Different insertion sequences => different hash sequences S1={11,3,27,99,8,50,7

7,22,12,31,33,40,53}=>28 probes

S2={53,40,33,31,12,22,77,50,8,99,27,3,11}=> 30 probes

H(key) = key mod 13H(key) = key mod 13

number

of probes0 17 2

1 27 1 2 12 4 3 3 1 4 40 4 5 31 1 6 53 6 7 33 1 8 99 1 9 8 2

10 22 2 11 11 1 12 50 2

Observation 2

Deletions are not easy: i=h(key) = key mod 10 rh(i) = (i+1) mod 10

Action: delete(65) and search(5) Problem: search will stop at the

empty position and will not find 5 Solution:

mark position as deleted rather than empty the marked position can be reused

Observation 3

Linear probing tends to create long sequences of occupied positions the longer a

sequence is, the longer it tends to become

P: probability to use a position in the cluster

Observation 4

Linear probing suffers from both primary and secondary clustering

Solution: double hashing uses two hash functions h1, h2 and a

rehashing function rh

Double Hashing

Two hash functions and a rehashing function primary hash function i=h1(key)= key mod m

secondary hash function h2(key)

rehashing function: rh(key) = (i + h2(key)) mod m

h2(m,key) is some function of m, key helps rh in computing random positions in the

hash table h2 is computed once for each key!

Example of Double Hashing

i. hash function: h1(key) = key mod m

q = (key div m) mod m ii. rehash function: rh(i, key) = (i + h2(key)) mod m

0q2 div m(key)h2

Example (continued)

A. m = 10, key = 23

h1(23) = 3, h2(23) = 2

rh(3,2)=(3+2) mod 10 = 5

rehash sequence: 5, 7, 9, 1, …

m = 10, key = 13

h1(key)=3, h2(13)=1, rh(3,1)=(3+1)mod10=4

rehash sequence: 4, 5, 6,…

Performance of Open Addressing

Distinguish between successful and unsuccessful search

Assume a series of probes to random positions independent events load factor: λ = n/m λ: probability to probe an occupied position each position has the same probability P=1/m

Unsuccessful Search

The hash sequence is exhausted let u be the expected number of

probes u equals the expected length of the

hash sequence P(k): probability to search k positions

in the hash sequence

2probes)P(1probes)P(

________________________

P(k)P(k)P(k)

P(3)P(3)P(3)

P(2)P(2)

kP(k)u1k

)ocupied positions 1k P(first

probes) kP(u

independent events

u increases with λ =>performance drops as

λ increases

Successful Search

The hash sequence is not exhausted

the number of probes to find a key equals the number of probes s at the time the key was inserted plus 1

λ was less at that time consider all values of λ

11)dx(uλ1

increases with λ

u: equivalent to unsuccessful search

approximation

Performance

The performance drops as λ increases the higher the value of λ is, the higher

the probability of collisions

Unsuccessful search is more expensive than successful search unsuccessful search exhausts the

hash sequence

Experimental ResultsSUCCESSFUL UNSUCCESSFUL

LOAD FACTOR

LINEAR i + bkey DOUBLE LINEAR i + bkey DOUBLE

25% 1.17 1.16 1.15 1.39 1.37 1.33

50% 1.50 1.44 1.39 2.50 2.19 2.00

75% 2.50 2.01 1.85 8.50 4.64 4.00

90% 5.50 2.85 2.56 50.50 11.40 10.00

95% 10.50 3.52 3.15 200.50 22.04 20.00

Performance on Full TableSUCCESSFUL

TABLE SIZE (m)

LINEAR i + bkey DOUBLE

UNSUCCESSFUL LOG2m

100 6.60 4.62 4.12 50.50 6.64

500 14.35 6.22 5.72 250.50 8.97

1000 20.15 6.91 6.41 500.50 9.97

5000 44.64 8.52 8.02 2500.5 12.29

10000 63.00 9.21 8.71 5000.50 13.29

Separate Chaining

Keys hashing to the same hash value are stored in separate lists

one list per hash position can store more than m records easy to implement the keys in each list can be ordered

40 130

227417

h(key) = key mod m

Performance of Separate Chaining

Depends on the average chain size insertions are independent events let P(c,n,m): probability that a

position has been selected c times after n insertions on a table of size m

P(c,n,m): probability that the chain has length c => binomial distribution

cncqpc

nm)n,P(c,

p=1/m: success caseq=1-p: failure case

nm)n,P(c,

=> P(c,n,m)=(1/c!)λce-λ

Poison

mn, =>

Unsuccessful Search

The entire chain is searched the average number of comparisons

equals its average length u

λec!λ

cλ)cP(c,u λ

Successful Search

Not the whole chain is searched the average number of comparisons

equals the length s of the chain at time the key was inserted plus 1

the performance at the time a key was inserted equals that of unsuccessful search!

11)dx(xλ1

1)dx(uλ1

Performance

The performance drops with the length of the chains worst case: all keys are stored in a

single chain worst case performance: O(N) unsuccessful search performs better

than successful search!! WHY ? no problem with deletions!!

Coalesced Hashing

The hash sequence is implemented as a linked list within the hash table no rehash function the next hash position

is the next available position in linked list

extra space for the list

0 … … 1 … … 2 49 7 3 … … 4 … … 5 29 2 6 … … 7 59 - 8 … … 9 19 5

h(key) = key mod 10keys: 19, 29, 49, 59

0 nilkey -1 1 nilkey 0 2 . 1 3 . 2 4 . 3 5 . 4 6 . 5 7 . 6 8 . 7 9 nilkey 8

initially: avail = 9

h(key) = key mod 10

keys: 14,29,34,28,42,39,84,38

0 nilkey -1 1 nilkey 0 2 42 -1 3 38 -1 4 14 8 5 84 -1 6 39 -1 7 28 3 8 34 5 9 29 6

initialization

List of empty positions

Holds lists ofrehashingpositions and list of emptypositions

Performance of Coalesced Hashing

Unsuccessful search

Successful search

0.752λ

e41 2λ

0.754λ

8λ1e2λ

probes/search

e.g.m. petrakishashing1 data organization in main memory or disk sequential, binary trees, … ...

Documents

sequential clinical scheduling with service criteria...

e.g.m. petrakistext clustering1 clustering “clustering is...

f329 unnecessary medications 1 2006 unnecessary medications...

sequential circuits - digital logic design (eee...

sequential abstract state machines capture sequential...

unnecessary delay

the right to silence; unnecessary

e.g.m. petrakisintroduction1 technical university of crete...

sequential logic...

e.g.m. petrakisvideo processing1 video is a rich...

e.g.m petrakismultimedia information retrieval 1 information...

son: love theme from midnight express e.g.m. mar 07

learning sequential decision rules using simulation...

unnecessary consumption

chapter 10 sequential files and structures. class 10:...

unnecessary spending habits

unnecessary austerity unnecessary government shutdown

1 sequential circuits definition of sequential circuit...

unnecessary suffering 7 march 2012

sequential abstract state machines capture sequential...