copyright 2003curt hill hash indexes are they better or worse than a b+tree?
TRANSCRIPT
![Page 1: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/1.jpg)
Copyright 2003Curt Hill
Hash indexes
Are they better or worse than a B+Tree?
![Page 2: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/2.jpg)
Copyright 2003Curt Hill
Tasks
• Consider the basics of hashing• Consider how this applies to
indexing schemes• Consider variations • Consider the Hash Join
![Page 3: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/3.jpg)
Copyright 2003Curt Hill
Basics of hashing
• Internal hashing– A table in memory– Bucket usually holds one entry– Covered in a separate presentation:
• Hashing.ppt
• External hashing– On disk– Bucket is a page – holds multiple
entries
![Page 4: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/4.jpg)
Copyright 2003Curt Hill
External Hashing
• Hash function takes the key and computes an integer
• How is this integer used?• Direct file
– Key is an integer• Directory of heap file
– Works well if one directory page can hold the correct number of page ids
• Lookup table– Converts integer to page id number
![Page 5: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/5.jpg)
Copyright 2003Curt Hill
How does it work?
• Hash function takes the key and computes a page number
• Search the page for the correct data page
• Access the data• For very large indices the number of
accesses can still be quite small
![Page 6: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/6.jpg)
Copyright 2003Curt Hill
Diagram
Data . . .
Hash buckets
. . .
0 1 2 N-1
![Page 7: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/7.jpg)
Copyright 2003Curt Hill
Example• Assume 1 million records with 4 records
to a page– 250000 pages of data
• Assume the key and pointer is 24 bytes with a 512 byte page– 21 keys and pointers in a page
• Assume buckets are ¾ full– 67000 buckets with 15 keys
• Hash function computes a value in range 0 – 66999
• Without collisions it takes two accesses to get data
![Page 8: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/8.jpg)
Copyright 2003Curt Hill
Static and Dynamic
• Static hashing works well until the buckets fill up– Then a bucket requires an overflow
bucket– Searching the original and overflow
pages increases the accesses and performance drops
• Dynamic hashing involves techniques where the sizes may grow gracefully
![Page 9: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/9.jpg)
Copyright 2003Curt Hill
Extendible/Extensible Hashing
• Mechanism for altering the size of the hash table without the usual pain
• Common strategy for internal hashes is to double the hash table and rehash each entry
• This is too expensive for an index• Instead we do incremental doubling of
the buckets and index– Spreads the cost nicely
![Page 10: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/10.jpg)
Copyright 2003Curt Hill
Scheme
• We generate a hash that is in a range much larger than we need
• Typically modulo some large prime number
• Use only the bottom so many bits of that result to select the bucket
• Start the process with just one bit• We also have the notion of global and
local depth
![Page 11: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/11.jpg)
Copyright 2003Curt Hill
First example
01
11 2 1484
5 7 9 111
Bit pattern ends in 0
Bit pattern ends in 1
Numbers shown are hash function output.
Index Buckets
![Page 12: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/12.jpg)
Copyright 2003Curt Hill
Splitting a bucket• The next insertion will overfill a
bucket• The exact action is dependent on
the local and global levels• If the local level = global level
– Add one to global level (number of bits)
– Double the index• Add one more bit
– Double the bucket• Distribute values between the two
• If the local level < global level only double bucket
![Page 13: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/13.jpg)
Copyright 2003Curt Hill
Bucket and Index Split
0
1
11 2 1484
5 7 9 111
2
00011011
1
2
2
2 4 8 14
3 7
5 9
11
Add 3 – split 1 bucket into 01 and 11
![Page 14: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/14.jpg)
Copyright 2003Curt Hill
Continued Insertions
• Notice that there were two pointers to the unsplit bucket
• Insertions to a bucket that has a lower level than the global level only splits the bucket not the index
• It separates the two pointers
![Page 15: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/15.jpg)
Copyright 2003Curt Hill
Bucket Only Split 2
00011011
1
2
2
2 4 8 14
5 9
73 11
2
11
2
2
2
4 8
5 9
73 11
10
01
00
2 2 10 14
Add 10split bucket
![Page 16: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/16.jpg)
Copyright 2003Curt Hill
Extensible Hashing
• When the index exceeds one page– The upper so many bits may be checked so
the entire index is not searched
• The mechanism is different than a tree• The net effect is not that much different• The index may grow smoothly without
changes to the hash function or drastic rewriting
![Page 17: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/17.jpg)
Copyright 2003Curt Hill
Not without problems• When the index is doubled there is
work which is added to the insertion• When the index will not fit in
memory substantial I/Os occur• When number of records per block
is small we can end up with much larger global levels than needed– Suppose 2 records per block and 3
records have the same key for the last 20 bits
– Global level of 20, even when most local levels are in 1-5 range
![Page 18: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/18.jpg)
Copyright 2003Curt Hill
Linear Hashing
• A different scheme with mechanism different than extensible hashing but some common properties– Splits are incrementally added– Some flexibility when they occur
• Like extensible hashing we use the bottom so many bits of a larger hash function
• Round robin bucket splitting• Overflow buckets are used but may be
later consumed
![Page 19: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/19.jpg)
Copyright 2003Curt Hill
Linear Hashing Numbers
• N – the number of buckets– Not always a power of two
• I – the number of used bits in the hash function
• R – the number of records in the structure
• M – the hash result– 0 M 2i
– M may larger or smaller than N– If M > N we use M - 2i-1
![Page 20: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/20.jpg)
Copyright 2003Curt Hill
Adding a bucket
• Any strategy can be used to determine when a bucket is added– Adding a bucket increases N
• When the ratio of records to buckets crosses a threshold
• When a bucket is forced to add an overflow bucket
![Page 21: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/21.jpg)
Copyright 2003Curt Hill
Linear Example
I = 1R = 3N = 2
0000
10100
11111
R/N = 1.5Split > 1.7
![Page 22: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/22.jpg)
Copyright 2003Curt Hill
Adding an Item• When an item is added it is put in the
proper bucket• If if does not fit add an overflow bucket• If the R/N threshold is crossed add a
new bucket– This causes the corresponding bucket to be
redistributed over the two buckets
• The number of hash bits used may be increased
![Page 23: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/23.jpg)
Copyright 2003Curt Hill
Insert 0101I = 1R = 3N = 2
0000
10100
11111
Add 0101 to this
Increases R/N ratio
I = 2R = 4N = 3
000000
01 0101
101010
1111
![Page 24: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/24.jpg)
Copyright 2003Curt Hill
Explanation
• The bucket that was added to was not split– It was not its turn– Buckets are split in round robin
fashion
![Page 25: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/25.jpg)
Copyright 2003Curt Hill
Insert 0111I = 2R = 4N = 3
000000
01 0101
101010
1111
R/N = 1.3
I = 2R = 5N = 3
000000
01 0101
101010
1111
R/N = 1.67
0111
Add 0111No splitOverflow
![Page 26: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/26.jpg)
Copyright 2003Curt Hill
Insertion and Searching• The hash function is now using two bits
which gives it four possibilities, but there are only three buckets
• If the hash result M < N just use that bucket
• If the hash result M N subtract from M 2i-1
• In last case 0111 was inserted – Last two bits are 11, but there is not yet a
11 bucket, so 10 was subtracted from it
• Searching uses same type of scheme
![Page 27: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/27.jpg)
Copyright 2003Curt Hill
Insert 1100I = 2R = 6N = 40000
00
01 0101
101010
11110111
000000
01 0101
101010
1100
011111
1111
R/N = 1.5
![Page 28: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/28.jpg)
Copyright 2003Curt Hill
Dynamic Hashing Summary
• Linear hashing lacks the index of extensible hashing
• There are similarities– Hash function where only the bottom
so many bits are used– Gradual splits– Quick lookups
![Page 29: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/29.jpg)
Copyright 2003Curt Hill
Joins• If both files are sorted on the join
the previously mentioned zipper join is used
• However, if the join field is not the primary key sorting the relation on this field may be expensive – Especially so if the outer join is larger
than an inner join– The number of joined records is small
compared to either relation size
![Page 30: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/30.jpg)
Copyright 2003Curt Hill
Hash Join• Recall that a Cartesian Product makes
all possible combinations of records from two relations– This could mean read numbering the
products of block– That is exactly what we want to avoid
• Hash join partitions two relations into pieces based on a hash function
• Then only joins partitions that reacted similarly to the hash function
• Of course only works on Equi-Joins
![Page 31: Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649ca75503460f949698e7/html5/thumbnails/31.jpg)
Copyright 2003Curt Hill
Hash Join Process• Hash the smaller of the two files on
the join field• Read in the other file• Hash each key into a bucket
– The only candidates for equality are here
• Produce the output• Smaller but still substantial