dictionaries (maps) - softeng.polito.it · advanced programming dictionaries, hash tables 5 9...

22
Advanced Programming Dictionaries, Hash Tables 1 Dictionaries (Maps) Hash tables 2 ADT Dictionary or Map Has following operations: INSERT: inserts a new element, associated to unique value of a field (key) SEARCH: searches an element with a certain value of the key. If it esists, it returns it DELETE: cancels element with given key, if exists

Upload: dangtuyen

Post on 17-Feb-2019

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dictionaries (Maps) - softeng.polito.it · Advanced Programming Dictionaries, Hash Tables 5 9 Assumptions Two assumptions are needed: ! No two elements with same key (keys are unique)

Advanced Programming Dictionaries, Hash Tables

1

Dictionaries (Maps)

Hash tables

2

ADT Dictionary or Map

Has following operations: n INSERT: inserts a new element, associated

to unique value of a field (key) n SEARCH: searches an element with a certain

value of the key. If it esists, it returns it n DELETE: cancels element with given key, if

exists

Page 2: Dictionaries (Maps) - softeng.polito.it · Advanced Programming Dictionaries, Hash Tables 5 9 Assumptions Two assumptions are needed: ! No two elements with same key (keys are unique)

Advanced Programming Dictionaries, Hash Tables

2

3

Uses of dictionaries

n Symbol table in a compiler n Key: nameof identifier n Values: types, context

n Citizens in a country n Key: social security number n Values: name, surname, age, address

4

Associative array

A dictionary would be easily implemented with an associative array (index of value = key instead of position) Ex: n Citizens = {{“jr50”, “john”, “red”},

{“bg40”, “bill”, “green”}, } n Citizens[“jr50”] = {“jr50”, “john”, “red”}

Page 3: Dictionaries (Maps) - softeng.polito.it · Advanced Programming Dictionaries, Hash Tables 5 9 Assumptions Two assumptions are needed: ! No two elements with same key (keys are unique)

Advanced Programming Dictionaries, Hash Tables

3

5

Goal

Complexity of insert/search/delete: n O(1) average case n Θ(n) worst case

6

Hash tables

Implementation of associative arrays An array containing elements. Address of element is computed by hash function, in time O(1). Ex: n Hash(“jr50”) = 117: element john red is in

position 117 of vector

Page 4: Dictionaries (Maps) - softeng.polito.it · Advanced Programming Dictionaries, Hash Tables 5 9 Assumptions Two assumptions are needed: ! No two elements with same key (keys are unique)

Advanced Programming Dictionaries, Hash Tables

4

7

Associative array U (all keys)

K (used keys)

• 7 • 4 • 9

• 5 • 3 • 8

• 6

• 0

• 1

• 2

0 1 2 3 4 5 6 7 8 9

2 3

5

8

key

value

T

8

Dictionary implemented w associative array

n T: associative array, key: key, x: value n Search(T, key)

n Return T[key]

n Insert(T, x) n T[key[x]] ← x

n Delete(T, x) n T[key[x]] ← NIL

n Complexity O(1), memory O(|U|)

O(|U|) number of different values of key

Page 5: Dictionaries (Maps) - softeng.polito.it · Advanced Programming Dictionaries, Hash Tables 5 9 Assumptions Two assumptions are needed: ! No two elements with same key (keys are unique)

Advanced Programming Dictionaries, Hash Tables

5

9

Assumptions

Two assumptions are needed: n  No two elements with same key (keys are unique) n  Size of T == size of max number of possible

values of key, |U|. n  This is critical, if |U| is large, array unfeasible n  Ex: key = SSN, 10chars, |U| = 24 10 ≈ 10 13

n  Assuming 24 values alphabet n  But, the citizens of a country are in the order 10 7 - 10 9

n  It is essential that size of array be O(|K|) and not O(|U|)

10

Hash tables

n A kind of associative array with size O(|K|) and not O(|U|)

n Insert/search/delete are O(1) on average n However, the way of computing index

given key must be different: hash function

Page 6: Dictionaries (Maps) - softeng.polito.it · Advanced Programming Dictionaries, Hash Tables 5 9 Assumptions Two assumptions are needed: ! No two elements with same key (keys are unique)

Advanced Programming Dictionaries, Hash Tables

6

11

Hash function

n Hash table is array with size m (m<<|U|) n Hash function h, from key to position in

array (index) n h: U → { 0, 1, ..., m-1 }

n Element x is stored in n T[h(key[x])]

12

Hash function

• k1

0 1 2 3 4 5 6 7 8

m-1

T

U

• k3

• k2

• k4

• k5

h(k1) h(k4)

h(k2)=h(k5)

h(k3)

Page 7: Dictionaries (Maps) - softeng.polito.it · Advanced Programming Dictionaries, Hash Tables 5 9 Assumptions Two assumptions are needed: ! No two elements with same key (keys are unique)

Advanced Programming Dictionaries, Hash Tables

7

13

Collision

n Collision n when h(ki)=h(kj) and ki ≠ kj,

n Essential to: n Minimize number of collisions

n  Depend on hash function

n Manage collisions

14

Example

Key is a string of characters Hash function

h(k) = Σ(ci) mod m with n ci ASCII code of i-th char of string k n m number of elements (size) of array T

Page 8: Dictionaries (Maps) - softeng.polito.it · Advanced Programming Dictionaries, Hash Tables 5 9 Assumptions Two assumptions are needed: ! No two elements with same key (keys are unique)

Advanced Programming Dictionaries, Hash Tables

8

15

Ex (II) m = 15. n  h(“pippo”) = (112+105+112+112+111)mod 15= 552 mod

15 = 12 n  h(“pluto”) = (112+108+117+116+111)mod 15= 564 mod

15 = 9 n  h(“paperino”) =

(112+97+112+101+114+105+110+111)mod 15= 862 mod 15 = 7

n  h(“topolino”) = (116+111+112+111+108+105+110+111)mod 15= 884 mod 15 = 14

n  h(“paperoga”) = (112+97+112+101+114+111+103+97)mod 15= 847 mod 15 = 7

Collision with strings “paperino” and “paperoga”

Ex (II)

m = 15. n  h("Mickey”) = (77 + 105 + 99 + 107 + 101 + 121) mod 15 = 10 n  h("Minnie") = (77 + 105 + 110 + 110 + 105 + 101) mod 15 = 8 n  h("Donald") = (68 + 111 + 110 + 97 + 108 + 100) mod 15 = 9 n  h("Daisy") = (68 + 97 + 105 + 115 + 121) mod 15 = 11 n  h("foo") = (102 + 111 + 111) mod 15 = 9 n  h("bar") = (98 + 97 + 114) mod 15 = 9

16

Collision with strings “foo” and “bar”

Page 9: Dictionaries (Maps) - softeng.polito.it · Advanced Programming Dictionaries, Hash Tables 5 9 Assumptions Two assumptions are needed: ! No two elements with same key (keys are unique)

Advanced Programming Dictionaries, Hash Tables

9

17

Collisions mitigation

The best hash functions are capable of distributing as uniformly (randomly) as possible the |K| elements among the m positions available Typical strategies:

pick m as a prime number manipulate bits of k

18

Collision management

n Chaining n Open Addressing

Page 10: Dictionaries (Maps) - softeng.polito.it · Advanced Programming Dictionaries, Hash Tables 5 9 Assumptions Two assumptions are needed: ! No two elements with same key (keys are unique)

Advanced Programming Dictionaries, Hash Tables

10

19

Chaining (I)

Position i can contain more than one element This can be implemented through a linked list

20

Chaining (II)

• k1

0 1 2 3 4 5 6 7 8

m-1

T

U

• k3

• k2

• k4

• k5

k1 k4

k3

k2 k5

• k6

k6

Page 11: Dictionaries (Maps) - softeng.polito.it · Advanced Programming Dictionaries, Hash Tables 5 9 Assumptions Two assumptions are needed: ! No two elements with same key (keys are unique)

Advanced Programming Dictionaries, Hash Tables

11

21

Chaining (III)

n T[i] is a pointer to a list, initially NIL. n CHAINED-HASH-INSERT(T,x)

n  insert x at head of list T[h(key[x])]

n CHAINED-HASH-SEARCH(T,k) n Search element with key k in list T[h(k)]

n CHAINED-HASH-DELETE(T,x) n Cancel x from list T[h(key[x])]

22

Chaining - Complexity

n Assumption: unorderd list, single chaining n Insert: O(1) n Search: O(length of lists) n Cancel: O(length of lists)

n Requires a search

Page 12: Dictionaries (Maps) - softeng.polito.it · Advanced Programming Dictionaries, Hash Tables 5 9 Assumptions Two assumptions are needed: ! No two elements with same key (keys are unique)

Advanced Programming Dictionaries, Hash Tables

12

23

Search (hash + chaining) - complexity

n We have n n : number of elements in hash table T n m : size of hash table T n α=n/m: load factor for hash table T

n Normally α>1 n What if m,n→∞ (with same α) ?

24

Search (hash + chaining) – complexity (II)

n Search n Worst case: a linked list, not ordered

n  Time to compute h(k) + n  Time to transverse the list, Θ(n)

n Best case: depends on how uniformly h(k) distributes the elements

n Let’s assume h(k) is capable of simple uniform hashing (distributes in perfect uniform way) (this requires that the table grows with the elements, so that α remains constant)

Page 13: Dictionaries (Maps) - softeng.polito.it · Advanced Programming Dictionaries, Hash Tables 5 9 Assumptions Two assumptions are needed: ! No two elements with same key (keys are unique)

Advanced Programming Dictionaries, Hash Tables

13

25

Search (hash + chaining) – complexity (II)

Search Time to compute h(k) = O(1). Time to trasverse the list, depends on length of list T[h(k)] depends on element found/not found In both cases complexity is Θ(1+α).

summing up O(1) + Θ(1+α) = O(1)

26

Open Addressing

T[i] can contain only one element In case of collision another free cell is searched for

next one, after next, etc Must be α<1.

Page 14: Dictionaries (Maps) - softeng.polito.it · Advanced Programming Dictionaries, Hash Tables 5 9 Assumptions Two assumptions are needed: ! No two elements with same key (keys are unique)

Advanced Programming Dictionaries, Hash Tables

14

27

Hash-Insert

HASH-INSERT(T, k) 1 i ← 0 2 repeat j ← h(k, i) 3 if T[j] = NIL

4 then T[j] ← k 5 return 6 else i ← i + 1 7 until i = m 8 error “hash table overflow”

28

Hash-Search

HASH-SEARCH(T, k) 1 i ← 0 2 repeat j ← h(k, i) 3 if T[j] = k 4 then return j 5 i ← i + 1 6 until T[j] = NIL or i = m 7 return NIL

Page 15: Dictionaries (Maps) - softeng.polito.it · Advanced Programming Dictionaries, Hash Tables 5 9 Assumptions Two assumptions are needed: ! No two elements with same key (keys are unique)

Advanced Programming Dictionaries, Hash Tables

15

29

Re-hash functions

n Linear probing n h(k, i) = (h’(k)+i) mod m

n Quadratic probing n h(k, i) = (h’(k)+ c1i + c2i2) mod m

n Double hashing n h(k, i) = (h1(k)+ i h2(k) ) mod m

30

Ex - insert

n m = 10 n open addressing with linear probing.

Hash values sequence: n h(A)=5, h(B)=4, h(C)=9, h(D)=4, h(E)=8,

h(F)=8, h(G)=10

Page 16: Dictionaries (Maps) - softeng.polito.it · Advanced Programming Dictionaries, Hash Tables 5 9 Assumptions Two assumptions are needed: ! No two elements with same key (keys are unique)

Advanced Programming Dictionaries, Hash Tables

16

31

Ex - insert (II) A

B A

B A C

B A D C

B A D E C

B A D E C F

G B A D E C F

5

4

9

4

8

8

10

32

Ex - search (III)

search: n  D: (h(D)=4)

n  Read 4 n  Read 5 n  Read 6 ⇒ found

n  G: (h(G)=10) n  Read 10 n  Read 1 ⇒ found

n  M: (h(M)=4) n  Read 4, n  Read 5, n  Read 6, n  Read 7, ⇒ not found

Page 17: Dictionaries (Maps) - softeng.polito.it · Advanced Programming Dictionaries, Hash Tables 5 9 Assumptions Two assumptions are needed: ! No two elements with same key (keys are unique)

Advanced Programming Dictionaries, Hash Tables

17

33

Delete

Very complex, because changes the rehash/collision sequence In practice open hashing is used only if no delete

34

Complexity

With uniform hashing and linear probing: n The number of probing trials is 1/(1–α),

and complexity is the same as for insert n Complexity of search is

ααα1

11ln1 +−

Page 18: Dictionaries (Maps) - softeng.polito.it · Advanced Programming Dictionaries, Hash Tables 5 9 Assumptions Two assumptions are needed: ! No two elements with same key (keys are unique)

Advanced Programming Dictionaries, Hash Tables

18

35

Hash functions

36

Uniform hashing

Best hash functions do a uniform hashing: if keys have the same probability, also h(k) should have equal probability ∑

=

−==jkhk

mjm

kP)(:

1,,1,0,1)( …

Page 19: Dictionaries (Maps) - softeng.polito.it · Advanced Programming Dictionaries, Hash Tables 5 9 Assumptions Two assumptions are needed: ! No two elements with same key (keys are unique)

Advanced Programming Dictionaries, Hash Tables

19

37

Keys are not uniform

However, keys often are not equally distributed (ex words in a language, ex names and surnames)

use all characters amplify the differences

38

Keys as numbers Usually keys are strings of characters Easiest thing is to treat them as integers

n Ex: “abc” becomes ‘a’*2562 + ‘b’*256 + ‘c’

However, with very long strings this is impractical, variants have to be used In the following the key is an integer

Page 20: Dictionaries (Maps) - softeng.polito.it · Advanced Programming Dictionaries, Hash Tables 5 9 Assumptions Two assumptions are needed: ! No two elements with same key (keys are unique)

Advanced Programming Dictionaries, Hash Tables

20

39

Hash function = mod m

n k is an integer : n h(k) = k mod m

n Requires m≥n/α. n m size, n number of elements

40

Choice of m

n Avoid n Powers of 2

n  Division by m looses high bits of k

n Powers of 10 n  Same as above, if k is decimal number

n Use n A prime number n Far from powers of 2

Page 21: Dictionaries (Maps) - softeng.polito.it · Advanced Programming Dictionaries, Hash Tables 5 9 Assumptions Two assumptions are needed: ! No two elements with same key (keys are unique)

Advanced Programming Dictionaries, Hash Tables

21

41

Ex

n n = 2000 n On average 3 comparisons in searches n m = 701 is a prime, close to 2000/3 but far

from powers of 2 n h(k) = k mod 701

42

Hash function = multiply

n K integer: n A constant 0<A<1 n Frac(x) = x - ⎣x⎦ n h(k) = ⎣ m ⋅ frac(k ⋅ A) ⎦

n k⋅A “shuffles” bits of k, n Multiplying by m expands [0,1] in [0,m]

Page 22: Dictionaries (Maps) - softeng.polito.it · Advanced Programming Dictionaries, Hash Tables 5 9 Assumptions Two assumptions are needed: ! No two elements with same key (keys are unique)

Advanced Programming Dictionaries, Hash Tables

22

43

Choice of m and A

n M is not critical. Using a power of 2 simplifies the multiplication

n Best A depends on how keys are statistically distributed

n A = (√5 – 1) / 2 = 0.6180339887... Is a good choice