hashing best study
TRANSCRIPT
-
7/31/2019 Hashing Best Study
1/40
File Structures SNU-OOPSLA Lab. 1
Chap12. Extendible Hashing
SNU-OOPSLA-LAB
File Structures by Folk, Zoellick and Riccardi
-
7/31/2019 Hashing Best Study
2/40
File Structures SNU-OOPSLA Lab. 2
Chapter Objectives
Describe the problem solved by extendible hashing and relatedapproaches
Explain how extendible hashing works; show how it combines
trieswith conventional, static hashing Use the buffer, file, and index classes of previous chapters to
implement extendible hashing, including deletion
Review studies of extendible hashing performance
Examine alternative approaches to the same problem, including
dynamic hashing, linearhashing, and hashing schemes thatcontrol splitting by allowing for overflow buckets
-
7/31/2019 Hashing Best Study
3/40
File Structures SNU-OOPSLA Lab. 3
Contents
12.1 Introduction
12.2 How extendible hashing works
12.3 Implementation 12.4 Deletion
12.5 Extendible hashing performance
12.6 Alternative approaches
-
7/31/2019 Hashing Best Study
4/40
File Structures SNU-OOPSLA Lab. 4
12.1 Introduction
Dynamic files undergo a lot of growths
Static hashing
described in chapter 11 (direct hashing)
typically worse than B-Tree for dynamic files
eventually requires file reorganization
Extendible hashing
hashing for dynamic file
Fagin, Nievergelt, Pippenger, and Strong (ACM TODS 1979)
-
7/31/2019 Hashing Best Study
5/40
File Structures SNU-OOPSLA Lab. 5
Overview(1)
Direct access (hashing) files have static size, so not
suitable for files whose size is unknown in advance
Dynamic file structure is desired which retains the feature
of fast retrieval by primary key, and which also expands
and contracts as the number of records in the file
fluctuates (without reorganizing the whole file)
Similar motivation!
Indexed-sequential File ==> B tree
Hashing ==> Extendible Hashing
-
7/31/2019 Hashing Best Study
6/40
File Structures SNU-OOPSLA Lab. 6
Overview(2)
Extendible Hashing
Primary key H(key)Hashing function
DirectoryIndex
Extract first d digit
File pointerTable look-up
-
7/31/2019 Hashing Best Study
7/40
File Structures SNU-OOPSLA Lab. 7
12.2 How Extendible Hashing works
Idea from Tries file (radix searching)
The branching factor of the tree is equal to the # of alternative
symbols in each position of the key
e.g.) Radix 26 trie - able, abrahms, adams, anderson,adnrews, baird
Use the first n characters for branching
a
b
b
d
n
l
r
d e
r
able
abrahms
adams
anderson
andrews
baird
-
7/31/2019 Hashing Best Study
8/40
File Structures SNU-OOPSLA Lab. 8
Extendible Hashing
H maps keys to a fixed address space, with size the largestprime less than a power of 2 (65531 < 216)
File pointers point to blocks of records known as buckets,where an entire bucket is read by one physical data transfer,buckets may be added to or removed from the file dynamically
The d bits are used as an index in a directory array containing2d entries, which usually resides in primary memory
The value d, the directory size(2d), and the number of bucketschange automatically as the file expands and contracts
-
7/31/2019 Hashing Best Study
9/40
File Structures SNU-OOPSLA Lab. 9
Extendible Hashing Example
000001010011
100101110111
d=1
d=3
d=3
d=2
Directory with d=3 and 4 buckets
B0
B100
B101
B11
H(key)=0
H(key)=100
H(key)=101
H(key)=11
d=3
-
7/31/2019 Hashing Best Study
10/40
File Structures SNU-OOPSLA Lab. 10
Turning the trie into a directory
Using Trie for extendible hashing
(1) Use Radix 2 Trie :
Keys in A : beginning with 0
Keys in B : beginning with 10
Keys in C : beginning with 11
(2) Retrieving from secondary storage the buckets containingkeys, instead of individual keys
A
B
C
01 0
1
-
7/31/2019 Hashing Best Study
11/40
File Structures SNU-OOPSLA Lab. 11
Representation of Trie (1)
Tree is not preferable (directory is not big) A flattened array
1. Make a complete full binary tree
2. Collapse it into the directory structure
0
1
0
1
0
1
C
A
B
00
01
10
11
A
B
C
-
7/31/2019 Hashing Best Study
12/40
File Structures SNU-OOPSLA Lab. 12
Representation of Trie(2)
Directory is a complete binary tree
Directory entry : a pointer to the associated bucket
Given an address beginning with the bits 10, the 210
directory entries
Introduced for uniform distribution
-
7/31/2019 Hashing Best Study
13/40
File Structures SNU-OOPSLA Lab. 13
Retrieve a record
Steps in retrieving a record with a given key
find H(given key)
extract first d bits of H(given key)
use this value as an index into the directory to find a pointer
use this pointer to read a bucket into primary memory
locate the desired record within the bucket (scan)
-
7/31/2019 Hashing Best Study
14/40
File Structures SNU-OOPSLA Lab. 14
Expansion & Contraction(1)
A pair of adjunct buckets with the same value of d which
share a common value of the first d-1 bits of H(key) canbe combined if the average load < 50%, so all records
would be able to fit into one bucket File contraction is the reverse of expansion; the directory
can be compacted and d decremented whenever all pairsof pointers have the same values
-
7/31/2019 Hashing Best Study
15/40
File Structures SNU-OOPSLA Lab. 15
Expansion & Contraction(2)
000001010
011100101110111
d=2
Bucket B0overflows, then splits into B0and B1
B00 H(key)=00..
d=2B01 H(key)=01..
d=3B100 H(key)=100..
d=3
B00 H(key)=101..
d=2B00 H(key)=11..
d=3
-
7/31/2019 Hashing Best Study
16/40
File Structures SNU-OOPSLA Lab. 16
Expansion & Contraction(3)
0000d=2
Bucket B100overflows, d increase to 4
B00 H(key)=00..d=2
B01 H(key)=01..
d=4B1000H(key)=1000..
d=4B1001H(key)=1001..
d=3B101 H(key)=101..
d=4
00010010001101000101
01100111100010011010
10111100110111101111
d=2B11 H(key)=11..
-
7/31/2019 Hashing Best Study
17/40
File Structures SNU-OOPSLA Lab. 17
Splitting to Handle Overflow (1)
When overflow occurs
e.g.1) Overflowing of bucket A Split A into A and D
Come to use additional unused bits
No need to expand the directory
00
01
10
11
B
C
A
D00
01
10
11
A
B
C
-
7/31/2019 Hashing Best Study
18/40
File Structures SNU-OOPSLA Lab. 18
Splitting to Handle Overflow(2)
e.g. Overflowing of bucket B Do not have additional unused bits
(need to expand the directory)
1. Divide B using 3 bits of hash address
2. Make a complete full binary tree
3. Collapse it into the directory structure
00
01
10
11
A
B
C
-
7/31/2019 Hashing Best Study
19/40
File Structures SNU-OOPSLA Lab. 19
A
B
C
D
0
1 0
1
0
1
0
10
10
1 0
1
0
1 0
1
0
1
A
B
D
C
000
001
010
011
A
100
101
110
111
C
B
D
1. Result of overflow of bucket B
3. Directory
2. Complete Binary Tree
-
7/31/2019 Hashing Best Study
20/40
File Structures SNU-OOPSLA Lab. 20
Creating Address
Function hash(KEY)
Fold/Add hashing algorithm
Do not MOD hashing value by address space since no fixedaddress space exists
Output from the hash function for a number of keys
bill 0000 0011 0110 1100lee 0000 0100 0010 1000
pauline 0000 1111 0110 0101
alan 0100 1100 1010 0010
julie 0010 1110 0000 1001
mike 0000 0111 0100 1101
elizabeth 0010 1100 0110 1010
mark 0000 1010 0000 0111
-
7/31/2019 Hashing Best Study
21/40
File Structures SNU-OOPSLA Lab. 21
Int Hash (char * key)
{int sum = 0;
int len = strlen(key);
if (len % 2 == 1) len ++; // make len even
for (int j = 0; j < len; j+2)
sum = (sum + 100 * key[j] + key[j+1]) % 19937;
return sum;
}
Figure 12.7 Function Hash (key) returns an integer hash value for keyfor a 15 bit
-
7/31/2019 Hashing Best Study
22/40
File Structures SNU-OOPSLA Lab. 22
Int MakeAddress (char * key, int depth)
{
int retval = 0;
int hashVal = Hash(key);// reverse the bits
for (int j = 0; j < depth; j++)
{
retval = retval > 1;
}
return retval;
}
Figure 12.9 Function MakeAddress(key,depth)
Class Bucket: protected TextIndex
-
7/31/2019 Hashing Best Study
23/40
File Structures SNU-OOPSLA Lab. 23
Class Bucket: protected TextIndex
{protected:
Bucket (Directory & dir, int maxKeys = defaultMaxKeys);
int Insert (char * key, int recAddr);
int Remove(char * key); Bucket * Split ();
int NewRange (int & newStart, int & newEnd);
int Redistribute (Bucket & newBucket);
int FindBuddy (); int TryCombine ();
int Combine (Bucket * buddy, int buddyIndex);
int Depth;
Directory & Dir;
int BucketAddr;
friend class Directory;
friend class BucketBuffer;
}; Figure 12.10 Main members of class Bucket
-
7/31/2019 Hashing Best Study
24/40
File Structures SNU-OOPSLA Lab. 24
class Directory
{public:
Directory (..); ~Directory();
int Open (..); int Create(); int Close();int Insert(); int Delete(); int Search();
protected
int DoubleSize();
int Collape();
int InsertBucket (.);
int Find ();
int StoreBucket();
int LoadBucket()..
}
Figure 12.11 Definition of class Directory
-
7/31/2019 Hashing Best Study
25/40
File Structures SNU-OOPSLA Lab. 25
12.4 Deletion
When to combine buckets Buddy buckets: the buckets are siblings and at the leaf level
of the tree (Buddy means something like friend)
e.g., B and D in page 19 are buddy buckets
Examine the directory to see if we can make changesthere
Shrink the directory if none of the buckets requires the depth
of address information that is currently available in the
directory
-
7/31/2019 Hashing Best Study
26/40
File Structures SNU-OOPSLA Lab. 26
Buddy Bucket Given a bucket with an address uvwxy, where u,v,
w, x, and yhave values of either 0 or 1, the buddybucket, if it exists, has the value uvwxz, such that
z = y XOR 1
If enough keys are deleted, the contents of buddy
buckets can be combined into a single bucket
-
7/31/2019 Hashing Best Study
27/40
File Structures SNU-OOPSLA Lab. 27
Collapsing the Directory
Collapse condition If a single cell, downsizing is impossible
If there is a pair of directory cells that do not both point to the
same bucket, collapsing is impossible
Allocating space
Allocate half the size of the original
Copy the bucket references shared by each cell pair to a single
cell in the new directory
-
7/31/2019 Hashing Best Study
28/40
File Structures SNU-OOPSLA Lab. 28
12.5 Extendible Hashing Performance Time : O(1)
If the directory can kept in RAM: a single access Otherwise: two accesses are necessary
Space utilization of the bucket
r (# of records), b (block size), N (# of Blocks)
Utilization = r / bN
Average utilization ==> 0.69
Space utilization for the directory
How large a directory should we expect to have,given an expected number of keys?
Expected value for the directory size by Flajolet(1983)
Estimated directory size =3.92 / b X r(1+1/b)
-
7/31/2019 Hashing Best Study
29/40
File Structures SNU-OOPSLA Lab. 29
Periodic and fluctuating With uniform distributed addresses, all the buckets tend to fill up at the
same time -> split at the same time
As buffer fills up : 90%
After a concentrated series of splits : 50%
r : # of records , b : block size N ~= 4/(b ln 2)
Utilization = r / bN ~= ln 2 = 0.69
Average utilization of 69%
B tree space utilization Normal B-tree : 67%, B-tree with redistribution in insertion : 85 %
Space utilization for buckets
-
7/31/2019 Hashing Best Study
30/40
File Structures SNU-OOPSLA Lab. 30
12.6 Alternative Approaches(1):Dynamic Hashing
Similar to dynamic extendible hashing Use a directory to track bucket addresses
Extend the directory through the use of tries
Start with a hash function that covers an address space of
a fixed size When overflow occurs
splits forming the leaves of a trie that grows down from theoriginal address node makes a trie
-
7/31/2019 Hashing Best Study
31/40
File Structures SNU-OOPSLA Lab. 31
Two kinds of nodes
External node: reference a data bucket
Internal node: point to two children index nodes
When a node has split children, it changed from an external
node to an internal node
Two hash functions
Apply the first hash function original address space
if external node is found : search is completed if internal node is found : apply second hash function
Alternative Approaches(2):Dynamic Hashing
-
7/31/2019 Hashing Best Study
32/40
File Structures SNU-OOPSLA Lab. 32
1 2 3 4
41 2 3
40 41
41 3
1
410
20 21 41
411
2
Originaladdressspace
Originaladdressspace
Originaladdressspace
(a)
(b)
(c)
-
7/31/2019 Hashing Best Study
33/40
File Structures SNU-OOPSLA Lab. 33
Dynamic Hashing vs. Extendible Hashing(1)
Overflow handling
Both schemes extend the hash function locally, as a binary search
trie
Both schemes use directory structure Dynamic hashing: a linked structure
Extendible hashing: perfect tree expressible as an array
Space Utilization
both schemes is the same (space utilization : 69%)
-
7/31/2019 Hashing Best Study
34/40
File Structures SNU-OOPSLA Lab. 34
Dynamic Hashing and Extendible Hashing(2)
Growth of directory Dynamic hashing: slower, more gradual growth
Extendible hashing: extend directory by doubling it
Actual size of an index node
Dynamic hashing is lager than a directory cell in extendiblehashing (because of pointers)
Page fault
Dynamic hashing: more than one page fault (with linked structurefor the directory)
Extendible hashing: single page fault
-
7/31/2019 Hashing Best Study
35/40
File Structures SNU-OOPSLA Lab. 35
Alternative Approaches(3): Linear Hashing
Unlike extendible hashingand dynamic hashing, linear hashing does
not use a directory.
The actual address space is extended one bucket at a time as buckets
overflow
Because the extension of the address space does not necessarily
correspond to the bucket that is overflowing,linear hashing necessarily involves the use of overflow buckets, even
as the address space expands
No directories: Avoid additional seek resulting from additional layer
Use more bits of hashed value hd(k) : depth dhashing function (using function make_address)
-
7/31/2019 Hashing Best Study
36/40
File Structures SNU-OOPSLA Lab. 36
a b c d
00 01 10 11
a b c d A
w
00 01 10 11 100 101
a b c d A B
x
a b c d A B C00 01 10 11 100 101 110
x
y
(a) (b)
(c) (d)
(continued...)
The growth of address space in linear hashing(1)
000 01 10 11 100
-
7/31/2019 Hashing Best Study
37/40
File Structures SNU-OOPSLA Lab. 37
a b c d A B C D00 01 10 11 100 101 110 111
x
(e)
The growth of address space in linear hashing(2)
Alt ti A h (5)
-
7/31/2019 Hashing Best Study
38/40
File Structures SNU-OOPSLA Lab. 38
Alternative Approaches(5):Approaches to Controlling Splitting
Postpone splitting: increase space utilization B-Tree: redistribution rather than splitting
Hashing: placing records in chains of overflow buckets to
postpone splitting
Triggering event for splitting Linear hashing
Every time any bucket overflows
Not split overflowing bucket
Litwin(1980): overall load factor of the file Below 2 seeks, 75% ~ 80% storage utilization
-
7/31/2019 Hashing Best Study
39/40
File Structures SNU-OOPSLA Lab. 39
Alternative Approaches(5):Approaches to Controlling Splitting
Postpone splitting for extensible hashing
Use chaining overflow bucket
Avoid doubling directory space
1.1 seek, 76% ~ 81% storage utilization
-
7/31/2019 Hashing Best Study
40/40
Fil St t SNU OOPSLA L b 40
Lets Review !!!
12.1 Introduction
12.2 How extendible hashing works
12.3 Implementation
12.4 Deletion
12.5 Extendible hashing performance
12.6 Alternative approaches