dbms 2001notes 4.2: hashing1 principles of database management systems 4.2: hashing techniques pekka...
TRANSCRIPT
DBMS 2001 Notes 4.2: Hashing 1
Principles of Database Management Systems
4.2: Hashing Techniques
Pekka Kilpeläinen(after Stanford CS245 slide originals by Hector Garcia-Molina, Jeff Ullman and
Jennifer Widom)
DBMS 2001 Notes 4.2: Hashing 2
Hashing?
• Locating the storage block of a record by the hash value h(k) of its key k
• Normally really fast– records (often) located by a single
disk access
DBMS 2001 Notes 4.2: Hashing 3
key h(key)
Hashing
<key>
.
.
Buckets(typically 1disk block)
DBMS 2001 Notes 4.2: Hashing 4
.
.
.
Two alternatives
records
.
.
.
key h(key)
(1) Hash value determines the storage block directly
• to implement a primary index
DBMS 2001 Notes 4.2: Hashing 5
key h(key)
Index
recordkey 1
Two alternatives
• for a secondary index
(2) Records located indirectly via index buckets
DBMS 2001 Notes 4.2: Hashing 6
Example hash function
• Key = ‘x1 x2 … xn’ n byte character string
• Have b buckets• h = (x1 + x2 + … + xn) mod b
{0, 1, …, b-1}
DBMS 2001 Notes 4.2: Hashing 7
This may not be best function …
Good hash Expected number of function: keys/bucket is the
same for all buckets
Read Knuth Vol. 3 if you reallyneed to select a good function.
DBMS 2001 Notes 4.2: Hashing 8
Next: example to illustrateinserts, overflows,
deletes
h(K)
DBMS 2001 Notes 4.2: Hashing 9
EXAMPLE 2 records/bucket
INSERT:h(a) = 1h(b) = 2h(c) = 1h(d) = 0
0
1
2
3
d
ac
b
h(e) = 1
e
DBMS 2001 Notes 4.2: Hashing 10
0
1
2
3
a
bc
e
d
EXAMPLE: deletion
Delete:ef
fg
maybe move“g” up
cd
DBMS 2001 Notes 4.2: Hashing 11
Rule of thumb:• Try to keep space utilization
between 50% and 80% Utilization = # keys used
total # keys that fit
• If < 50%, wasting space• If > 80%, overflows significant
depends on how good hashfunction is & on # keys/bucket
DBMS 2001 Notes 4.2: Hashing 12
How do we cope with growth?
• Overflows and reorganizations• Dynamic hashing: # of buckets
may vary• Extensible• Linear• also others ...
DBMS 2001 Notes 4.2: Hashing 13
Extensible hashing: two ideas
(a) Use i of b bits output by hash function
b h(K)
use i grows over time….
00110101
For example, b=32
DBMS 2001 Notes 4.2: Hashing 14
(b) Use directory
h(K)[i ] to bucket
.
.
.
.
Directory contains 2i pointers to buckets, and stores i.
Each bucket stores j, indicating #bits used for placing the records in this block (j i)
DBMS 2001 Notes 4.2: Hashing 15
Extensible Hashing: Insertion
• If there's room in bucket h(k)[i], place record there; Otherwise …
• If j=i, set i=i+1 and double the directory • If j<i, split the block in two, distribute
records among them now using j+1 bits of h(k); (Repeat until some records end up in the new bucket); Update pointers of bucket array
• See the next example
DBMS 2001 Notes 4.2: Hashing 16
Example: h(k) is 4 bits; 2 keys/block
i = 1
1
1
0001
1001
1100
Insert 1010
11100
1010
New directory
200
01
10
11
i =
2
2
(j)
DBMS 2001 Notes 4.2: Hashing 17
10001
21001
1010
21100
Insert:
0111
0000
00
01
10
11
2i =
Example continued
0111
0000
0111
0001
2
2
DBMS 2001 Notes 4.2: Hashing 18
00
01
10
11
2i =
21001
1010
21100
20111
20000
0001
Insert:
1001
Example continued
1001
1001
1010
000
001
010
011
100
101
110
111
3i =
3
3
DBMS 2001 Notes 4.2: Hashing 19
Extensible hashing: deletion
• Reverse insert procedure …
• Example: Walk thru insert example in reverse!
DBMS 2001 Notes 4.2: Hashing 20
Extensible hashing
Can handle growing files- without full reorganizations
Summary
+
Indirection(Not bad if directory in memory)
Directory doubles in size(First it fits in memory, then it does
not sudden performance degradation)
-
-
Only one data block examined+
DBMS 2001 Notes 4.2: Hashing 21
Linear hashing: grow # of buckets by one
No bucket directory needed
Two ideas:(a) Use i low order bits of hash
01110101grows
b
i(b) File grows linearly
DBMS 2001 Notes 4.2: Hashing 22
Linear Hashing: Parameters
• n: number of buckets in use– buckets numbered 0…n-1
• i: number of bits of h(k) used to address buckets
• r: number of records in hash table– ratio r/n limited to fit an avg bucket in a block– next example: r 1.7n, and block holds 2 records
=> AVG bucket occupancy is 1.7/2 = 0.85 of a block
)log(ni
DBMS 2001 Notes 4.2: Hashing 23
Example: 2 keys/block, b=4 bits, n=2, i =1
00 01
1111
0000
1010
If h(k)[i ] = (a1 … ai)2 < n, then
look at bucket h(k)[i ]; else
look at bucket h(k)[i ] - 2i -1 = (0a2 … ai)2
Rule
• insert 0101
0101
• now r=4 >1.7n get new bucket
10and distribute keys btw
buckets 00 and 10
DBMS 2001 Notes 4.2: Hashing 24
n=3, i =2; distribute keys btw buckets 00 and 10:
00 01 10
0101
1111
0000
1010
1010
DBMS 2001 Notes 4.2: Hashing 25
n=3, i =2; insert 0001:
00 01 10
0101
1111
0000
0001
1010
• can have overflow chains!
DBMS 2001 Notes 4.2: Hashing 26
n=3, i =2
00 01 10
0101
1111
0000 1010
0001 • insert 0111
• bucket 11 not in use redirect to 01
0111
• now r=6 > 1.7n-> get new bucket 11
DBMS 2001 Notes 4.2: Hashing 27
n=4, i =2; distribute keys btw 01 and 11
00 01 10 11
0101
1111
0000 1010
00010111
1111
01110001
DBMS 2001 Notes 4.2: Hashing 28
Example Continued: How to grow beyond this?
00 01 10 11
111110100101
0101
0000
m = 11 (max used block)
i = 2
0 0 0 0 101 110 111
3
. . .
100
100
101
101
0101
0101
DBMS 2001 Notes 4.2: Hashing 29
Linear Hashing
Can handle growing files- without full reorganizations
No indirection directory of extensible hashingCan have overflow chains
- but probability of long chains can be kept low by controlling the r/n fill ratio (?)
Summary
+
+
-
DBMS 2001 Notes 4.2: Hashing 30
Hashing- How it works- Dynamic hashing
- Extensible- Linear
Summary
DBMS 2001 Notes 4.2: Hashing 31
Next:
• Indexing vs Hashing• Index definition in SQL
DBMS 2001 Notes 4.2: Hashing 32
• Hashing good for probes given keye.g., SELECT …
FROM RWHERE R.A = 5
Indexing vs Hashing
DBMS 2001 Notes 4.2: Hashing 33
• INDEXING (Including B-Trees) good for
Range Searches:e.g., SELECT
FROM RWHERE R.A > 5
Indexing vs Hashing
DBMS 2001 Notes 4.2: Hashing 34
Index definition in SQL
• Create index name on rel (attr)• Create unique index name on rel
(attr)defines candidate key
• Drop INDEX name
DBMS 2001 Notes 4.2: Hashing 35
CANNOT SPECIFY TYPE OF INDEX(e.g. B-tree, Hashing, …)
OR PARAMETERS(e.g. Load Factor, Size of Hash,...)
... at least in SQL …Oracle and IBM DB2 UDB provide a
PCTFREE clause to inditate the proportion of B-tree blocks initially left unfilled
Oracle: “Hash clusters” with built-in or DBA-specified hash function
Note
DBMS 2001 Notes 4.2: Hashing 36
The BIG picture….
• Chapters 2 & 3: Storage, records, blocks...
• Chapter 4: Access Mechanisms- Indexes
- B trees- Hashing
• Chapters 6 & 7: Query ProcessingNEXT