dictionaries (maps) - softeng.polito.it · advanced programming dictionaries, hash tables 5 9...
TRANSCRIPT
Advanced Programming Dictionaries, Hash Tables
1
Dictionaries (Maps)
Hash tables
2
ADT Dictionary or Map
Has following operations: n INSERT: inserts a new element, associated
to unique value of a field (key) n SEARCH: searches an element with a certain
value of the key. If it esists, it returns it n DELETE: cancels element with given key, if
exists
Advanced Programming Dictionaries, Hash Tables
2
3
Uses of dictionaries
n Symbol table in a compiler n Key: nameof identifier n Values: types, context
n Citizens in a country n Key: social security number n Values: name, surname, age, address
4
Associative array
A dictionary would be easily implemented with an associative array (index of value = key instead of position) Ex: n Citizens = {{“jr50”, “john”, “red”},
{“bg40”, “bill”, “green”}, } n Citizens[“jr50”] = {“jr50”, “john”, “red”}
Advanced Programming Dictionaries, Hash Tables
3
5
Goal
Complexity of insert/search/delete: n O(1) average case n Θ(n) worst case
6
Hash tables
Implementation of associative arrays An array containing elements. Address of element is computed by hash function, in time O(1). Ex: n Hash(“jr50”) = 117: element john red is in
position 117 of vector
Advanced Programming Dictionaries, Hash Tables
4
7
Associative array U (all keys)
K (used keys)
• 7 • 4 • 9
• 5 • 3 • 8
• 6
• 0
• 1
• 2
0 1 2 3 4 5 6 7 8 9
2 3
5
8
key
value
T
8
Dictionary implemented w associative array
n T: associative array, key: key, x: value n Search(T, key)
n Return T[key]
n Insert(T, x) n T[key[x]] ← x
n Delete(T, x) n T[key[x]] ← NIL
n Complexity O(1), memory O(|U|)
O(|U|) number of different values of key
Advanced Programming Dictionaries, Hash Tables
5
9
Assumptions
Two assumptions are needed: n No two elements with same key (keys are unique) n Size of T == size of max number of possible
values of key, |U|. n This is critical, if |U| is large, array unfeasible n Ex: key = SSN, 10chars, |U| = 24 10 ≈ 10 13
n Assuming 24 values alphabet n But, the citizens of a country are in the order 10 7 - 10 9
n It is essential that size of array be O(|K|) and not O(|U|)
10
Hash tables
n A kind of associative array with size O(|K|) and not O(|U|)
n Insert/search/delete are O(1) on average n However, the way of computing index
given key must be different: hash function
Advanced Programming Dictionaries, Hash Tables
6
11
Hash function
n Hash table is array with size m (m<<|U|) n Hash function h, from key to position in
array (index) n h: U → { 0, 1, ..., m-1 }
n Element x is stored in n T[h(key[x])]
12
Hash function
• k1
0 1 2 3 4 5 6 7 8
m-1
T
U
• k3
• k2
• k4
• k5
h(k1) h(k4)
h(k2)=h(k5)
h(k3)
Advanced Programming Dictionaries, Hash Tables
7
13
Collision
n Collision n when h(ki)=h(kj) and ki ≠ kj,
n Essential to: n Minimize number of collisions
n Depend on hash function
n Manage collisions
14
Example
Key is a string of characters Hash function
h(k) = Σ(ci) mod m with n ci ASCII code of i-th char of string k n m number of elements (size) of array T
Advanced Programming Dictionaries, Hash Tables
8
15
Ex (II) m = 15. n h(“pippo”) = (112+105+112+112+111)mod 15= 552 mod
15 = 12 n h(“pluto”) = (112+108+117+116+111)mod 15= 564 mod
15 = 9 n h(“paperino”) =
(112+97+112+101+114+105+110+111)mod 15= 862 mod 15 = 7
n h(“topolino”) = (116+111+112+111+108+105+110+111)mod 15= 884 mod 15 = 14
n h(“paperoga”) = (112+97+112+101+114+111+103+97)mod 15= 847 mod 15 = 7
Collision with strings “paperino” and “paperoga”
Ex (II)
m = 15. n h("Mickey”) = (77 + 105 + 99 + 107 + 101 + 121) mod 15 = 10 n h("Minnie") = (77 + 105 + 110 + 110 + 105 + 101) mod 15 = 8 n h("Donald") = (68 + 111 + 110 + 97 + 108 + 100) mod 15 = 9 n h("Daisy") = (68 + 97 + 105 + 115 + 121) mod 15 = 11 n h("foo") = (102 + 111 + 111) mod 15 = 9 n h("bar") = (98 + 97 + 114) mod 15 = 9
16
Collision with strings “foo” and “bar”
Advanced Programming Dictionaries, Hash Tables
9
17
Collisions mitigation
The best hash functions are capable of distributing as uniformly (randomly) as possible the |K| elements among the m positions available Typical strategies:
pick m as a prime number manipulate bits of k
18
Collision management
n Chaining n Open Addressing
Advanced Programming Dictionaries, Hash Tables
10
19
Chaining (I)
Position i can contain more than one element This can be implemented through a linked list
20
Chaining (II)
• k1
0 1 2 3 4 5 6 7 8
m-1
T
U
• k3
• k2
• k4
• k5
k1 k4
k3
k2 k5
• k6
k6
Advanced Programming Dictionaries, Hash Tables
11
21
Chaining (III)
n T[i] is a pointer to a list, initially NIL. n CHAINED-HASH-INSERT(T,x)
n insert x at head of list T[h(key[x])]
n CHAINED-HASH-SEARCH(T,k) n Search element with key k in list T[h(k)]
n CHAINED-HASH-DELETE(T,x) n Cancel x from list T[h(key[x])]
22
Chaining - Complexity
n Assumption: unorderd list, single chaining n Insert: O(1) n Search: O(length of lists) n Cancel: O(length of lists)
n Requires a search
Advanced Programming Dictionaries, Hash Tables
12
23
Search (hash + chaining) - complexity
n We have n n : number of elements in hash table T n m : size of hash table T n α=n/m: load factor for hash table T
n Normally α>1 n What if m,n→∞ (with same α) ?
24
Search (hash + chaining) – complexity (II)
n Search n Worst case: a linked list, not ordered
n Time to compute h(k) + n Time to transverse the list, Θ(n)
n Best case: depends on how uniformly h(k) distributes the elements
n Let’s assume h(k) is capable of simple uniform hashing (distributes in perfect uniform way) (this requires that the table grows with the elements, so that α remains constant)
Advanced Programming Dictionaries, Hash Tables
13
25
Search (hash + chaining) – complexity (II)
Search Time to compute h(k) = O(1). Time to trasverse the list, depends on length of list T[h(k)] depends on element found/not found In both cases complexity is Θ(1+α).
summing up O(1) + Θ(1+α) = O(1)
26
Open Addressing
T[i] can contain only one element In case of collision another free cell is searched for
next one, after next, etc Must be α<1.
Advanced Programming Dictionaries, Hash Tables
14
27
Hash-Insert
HASH-INSERT(T, k) 1 i ← 0 2 repeat j ← h(k, i) 3 if T[j] = NIL
4 then T[j] ← k 5 return 6 else i ← i + 1 7 until i = m 8 error “hash table overflow”
28
Hash-Search
HASH-SEARCH(T, k) 1 i ← 0 2 repeat j ← h(k, i) 3 if T[j] = k 4 then return j 5 i ← i + 1 6 until T[j] = NIL or i = m 7 return NIL
Advanced Programming Dictionaries, Hash Tables
15
29
Re-hash functions
n Linear probing n h(k, i) = (h’(k)+i) mod m
n Quadratic probing n h(k, i) = (h’(k)+ c1i + c2i2) mod m
n Double hashing n h(k, i) = (h1(k)+ i h2(k) ) mod m
30
Ex - insert
n m = 10 n open addressing with linear probing.
Hash values sequence: n h(A)=5, h(B)=4, h(C)=9, h(D)=4, h(E)=8,
h(F)=8, h(G)=10
Advanced Programming Dictionaries, Hash Tables
16
31
Ex - insert (II) A
B A
B A C
B A D C
B A D E C
B A D E C F
G B A D E C F
5
4
9
4
8
8
10
32
Ex - search (III)
search: n D: (h(D)=4)
n Read 4 n Read 5 n Read 6 ⇒ found
n G: (h(G)=10) n Read 10 n Read 1 ⇒ found
n M: (h(M)=4) n Read 4, n Read 5, n Read 6, n Read 7, ⇒ not found
Advanced Programming Dictionaries, Hash Tables
17
33
Delete
Very complex, because changes the rehash/collision sequence In practice open hashing is used only if no delete
34
Complexity
With uniform hashing and linear probing: n The number of probing trials is 1/(1–α),
and complexity is the same as for insert n Complexity of search is
ααα1
11ln1 +−
Advanced Programming Dictionaries, Hash Tables
18
35
Hash functions
36
Uniform hashing
Best hash functions do a uniform hashing: if keys have the same probability, also h(k) should have equal probability ∑
=
−==jkhk
mjm
kP)(:
1,,1,0,1)( …
Advanced Programming Dictionaries, Hash Tables
19
37
Keys are not uniform
However, keys often are not equally distributed (ex words in a language, ex names and surnames)
use all characters amplify the differences
38
Keys as numbers Usually keys are strings of characters Easiest thing is to treat them as integers
n Ex: “abc” becomes ‘a’*2562 + ‘b’*256 + ‘c’
However, with very long strings this is impractical, variants have to be used In the following the key is an integer
Advanced Programming Dictionaries, Hash Tables
20
39
Hash function = mod m
n k is an integer : n h(k) = k mod m
n Requires m≥n/α. n m size, n number of elements
40
Choice of m
n Avoid n Powers of 2
n Division by m looses high bits of k
n Powers of 10 n Same as above, if k is decimal number
n Use n A prime number n Far from powers of 2
Advanced Programming Dictionaries, Hash Tables
21
41
Ex
n n = 2000 n On average 3 comparisons in searches n m = 701 is a prime, close to 2000/3 but far
from powers of 2 n h(k) = k mod 701
42
Hash function = multiply
n K integer: n A constant 0<A<1 n Frac(x) = x - ⎣x⎦ n h(k) = ⎣ m ⋅ frac(k ⋅ A) ⎦
n k⋅A “shuffles” bits of k, n Multiplying by m expands [0,1] in [0,m]
Advanced Programming Dictionaries, Hash Tables
22
43
Choice of m and A
n M is not critical. Using a power of 2 simplifies the multiplication
n Best A depends on how keys are statistically distributed
n A = (√5 – 1) / 2 = 0.6180339887... Is a good choice