10 b tree hash index
TRANSCRIPT
-
8/3/2019 10 B Tree Hash Index
1/38
B-Trees, Part 2
Hash-Based IndexesR&G Chapter 10
Lecture 10
-
8/3/2019 10 B Tree Hash Index
2/38
Administrivia
The new Homework 3 now available
Due 1 week from Sunday
Homework 4 available the week after
Midterm exams available here
-
8/3/2019 10 B Tree Hash Index
3/38
Review
Last time discussed File Organization Unordered heap files
Sorted Files
Clustered Trees
Unclustered Trees Unclustered Hash Tables
Indexes
B-Trees dynamic, good for changing data, rangequeries
Hash tables fastest for equality queries, uselessfor range queries
-
8/3/2019 10 B Tree Hash Index
4/38
Review (2)
For any index, 3 alternatives for data entriesk*:
Data record with key value k
Choice orthogonal to the indexing technique
-
8/3/2019 10 B Tree Hash Index
5/38
Today:
Indexes
Composite Keys, Index-Only Plans
B-Trees
details of insertion and deletion
Hash Indexes How to implement with changing data sets
-
8/3/2019 10 B Tree Hash Index
6/38
Inserting a Data Entry into a B+ Tree
Find correct leafL. Put data entry onto L.
IfLhas enough space, done!
Else, must split L (into L and a new node L2) Redistribute entries evenly, copy up middle key.
Insert index entry pointing to L2into parent ofL.
This can happen recursively
To split index node, redistribute entries evenly, butpush up middle key. (Contrast with leaf splits.)
Splits grow tree; root split increases height.
Tree growth: gets wideror one level taller at top.
-
8/3/2019 10 B Tree Hash Index
7/38
Indexes with Composite Search Keys
Composite Search Keys: Searchon a combination of fields.
Equality query: Every field value isequal to a constant value. E.g. wrt index:
age=20 and sal =75
Range query: Some field value isnot a constant. E.g.:
age =20; or age=20 and sal > 10
Data entries in index sorted bysearch key to support rangequeries.
Lexicographic order, or
Spatial order.
sue 13 75
bob
cal
joe 12
10
20
8011
12
name age sal
12,20
12,10
11,80
13,75
20,12
10,12
75,13
80,11
11
12
12
13
10
2075
80
Data recordssorted by name
Data entries in indexsorted by
Data entriessorted by
Examples of composite keyindexes using lexicographic order.
-
8/3/2019 10 B Tree Hash Index
8/38
Composite Search Keys
To retrieve Emp records with age=30ANDsal=4000,an index on would be better than an indexon ageor an index on sal.
Choice of index key orthogonal to clustering etc.
If condition is: 20
-
8/3/2019 10 B Tree Hash Index
9/38
Index-Only Plans
A number of queriescan be answeredwithout retrievingany tuples from one
or more of therelations involved ifa suitable index isavailable.
SELECT D.mgrFROM Dept D, Emp EWHERE D.dno=E.dno
SELECT D.mgr, E.eidFROM Dept D, Emp EWHERE D.dno=E.dno
SELECT E.dno, COUNT(*)FROM Emp EGROUP BY E.dno
SELECT E.dno, MIN(E.sal)FROM Emp EGROUP BY E.dno
SELECTAVG(E.sal)FROM Emp EWHERE E.age=25 AND
E.sal BETWEEN 3000 AND 5000
Tree index!
Tree index!
or
Tree!
-
8/3/2019 10 B Tree Hash Index
10/38
Index-Only Plans (Contd.)
Index-only plansare possible if thekey is or we have a tree
index with key
Which is better?
What if weconsider thesecond query?
SELECT E.dno, COUNT (*)FROM Emp EWHERE E.age=30GROUP BY E.dno
SELECT E.dno, COUNT (*)FROM Emp EWHERE E.age>30GROUP BY E.dno
-
8/3/2019 10 B Tree Hash Index
11/38
B-Trees: Insertion and Deletion
Insertion Find leaf where new record belongs
If leaf is full, redistribute*
If siblings too full, split, copy middle key up
If index too full, redistribute*
If index siblings full, split, push middle key up
Deletion
Find leaf where record exists, remove it
If leaf is less than 50% empty, redistribute* If siblings too empty, merge, remove key above
If index node above too empty, redistribute*
If index siblings too empty, merge, move above key down
-
8/3/2019 10 B Tree Hash Index
12/38
B-Trees: For Homework 2
Insertion Find leaf where new record belongs
If leaf is full, redistribute*
If siblings too full, split, copy middle key up
If index too full, redistribute*
If index siblings full, split, push middle key up
Deletion
Find leaf where record exists, remove it
If leaf is less than 50% empty, redistribute* If siblings too empty, merge, remove key above
If index node above too empty, redistribute*
If index siblings too empty, merge, move above key down
This means that after deletion, nodes will often be < 50% full
-
8/3/2019 10 B Tree Hash Index
13/38
B-Trees: For Homework 2 (cont)
Splits
When splitting nodes, choose the middle key
If there are even number of keys, choosemiddle
Your code must Handle Duplicate Keys
We promise that there will never be more than 1
page of duplicate values. Thus, when splitting, if the middle key is identical
to the key to the left, you must find the closestsplittable key to the middle.
-
8/3/2019 10 B Tree Hash Index
14/38
B-Tree Demo
-
8/3/2019 10 B Tree Hash Index
15/38
Hashing
Static and dynamic hashing techniques exist;trade-offs based on data change over time
Static Hashing Good if data never changes
Extendable Hashing
Uses directory to handle changing data
Linear Hashing
Avoids directory, usually faster
-
8/3/2019 10 B Tree Hash Index
16/38
Static Hashing # primary pages fixed, allocated sequentially,
never de-allocated; overflow pages if needed.
h(k) mod N = bucket to which data entry withkey kbelongs. (N = # of buckets)
h(key) mod N
h
key
Primary bucket pages Overflow pages
2
0
N-1
-
8/3/2019 10 B Tree Hash Index
17/38
Static Hashing (Contd.)
Buckets contain data entries.
Hash fn works on search keyfield of record r. Must
distribute values over range 0 ... N-1. h(key) = (a * key+ b) usually works well.
a and b are constants; lots known about how to tune h.
Long overflow chains can develop and degradeperformance.
Extendibleand LinearHashing: Dynamic techniques tofix this problem.
-
8/3/2019 10 B Tree Hash Index
18/38
Extendible Hashing
Situation: Bucket (primary page) becomes full.
Why not re-organize file by doubling# of buckets?
Reading and writing all pages is expensive!
Idea: Use directory of pointers to buckets,
double # of buckets by doubling the directory,
splitting just the bucket that overflowed!
Directory much smaller than file, doubling muchcheaper.
Nooverflowpages!
Trick lies in how hash function is adjusted!
-
8/3/2019 10 B Tree Hash Index
19/38
Extendible Hashing Details
Need directory with pointer to each bucket
Need hash function to incrementally double range
can just use increasingly more LSBs of h(key)
Must keep track of global depth
how many times directory doubled so far
Must keep track of local depth
how many time each bucket has been split
-
8/3/2019 10 B Tree Hash Index
20/38
-
8/3/2019 10 B Tree Hash Index
21/38
Insert h(r)=20 (Causes Doubling)
0001
10
11
2 2
2
2
LOCAL DEPTH 2
DIRECTORY
GLOBAL DEPTHBucket A
Bucket B
Bucket C
Bucket D
1* 5* 21*13*
32* 16*
10*
15* 7* 19*
4* 12*
19*
2
2
2
000
001
010
011
100
101110
111
3
3
3
DIRECTORY
Bucket A
Bucket B
Bucket C
Bucket D
Bucket A2
(`split image'of Bucket A)
32*
1* 5* 21* 13*
16*
10*
15* 7*
4* 20*12*
LOCAL DEPTH
GLOBAL DEPTH
-
8/3/2019 10 B Tree Hash Index
22/38
Now Insert h(r)=9 (Causes Split Only)
19*
3
2
2
000
001
010
011
100
101
110111
3
3
3
DIRECTORY
Bucket A
Bucket B
Bucket C
Bucket D
Bucket A2
32*
1* 9*
16*
10*
15* 7*
4* 20*12*
LOCAL DEPTH
GLOBAL DEPTH
3
Bucket B25* 21* 13*
19*
2
2
2
000
001
010
011
100
101
110111
3
3
3
DIRECTORY
Bucket A
Bucket B
Bucket C
Bucket D
Bucket A2
(`split image'of Bucket A)
32*
1* 5* 21* 13*
16*
10*
15* 7*
4* 20*12*
LOCAL DEPTH
GLOBAL DEPTH
-
8/3/2019 10 B Tree Hash Index
23/38
Points to Note
20 = binary 10100. Last 2 bits (00) tell us rbelongsin A or A2. Last 3 bits needed to tell which.
Global depth of directory: Max # of bits needed to tellwhich bucket an entry belongs to.
Local depth of a bucket: # of bits used to determine ifsplitting bucket will also double directory
When does bucket split cause directory doubling?
If, before insert, local depthof bucket = global depth. Insert causes local depthto become > global depth;
directory is doubled by copying it overand `fixing pointer to splitimage page.
(Use of least significant bits enables efficient doubling via copying
of directory!)
-
8/3/2019 10 B Tree Hash Index
24/38
Directory DoublingWhy use least significant bits in directory?
Allows for doubling via copying!
13*00
01
10
11
2
2
2
2
2
10*
1* 21*
4* 12* 32* 16*
15* 7* 19*
5* 13*
000
001
010
011
3
2
2
2
2
10*
1* 21*
4* 12* 32* 16*
15* 7* 19*
5*
100
101110
111
-
8/3/2019 10 B Tree Hash Index
25/38
Deletion
Delete: If removal of data entry makes bucket empty,
can be merged with `split image. If each directoryelement points to same bucket as its split image, canhalve directory.
13*00
01
10
11
2
2
2
2
2
10*
1* 21*
4* 12* 32* 16*
15*
5* 13*00
01
10
11
2
1
2
2
1* 21*
4* 12* 32* 16*
15*
5*
Delete 10
-
8/3/2019 10 B Tree Hash Index
26/38
Deletion (cont)
Delete: If removal of data entry makes bucket empty,
can be merged with `split image. If each directoryelement points to same bucket as its split image, canhalve directory.
13*
00
01
10
11
2 1
1
1* 21*
4* 12* 32* 16*
5*
Delete 15
13*00
01
10
11
2
1
2
2
1* 21*
4* 12* 32* 16*
15*
5*
13*
0
1
1 1
1
1* 21*
4* 12* 32* 16*
5*
-
8/3/2019 10 B Tree Hash Index
27/38
Comments on Extendible Hashing
If directory fits in memory, equality search
answered with one disk access; else two. 100MB file, 100 bytes/rec, 4K pages contains 1,000,000
records (as data entries) and 25,000 directory elements;chances are high that directory will fit in memory.
Directory grows in spurts, and, if the distribution of hashvaluesis skewed, directory can grow large.
Biggest problem:
Multiple entries with same hash value cause problems!
If bucket already full of same hash value, will keepdoubling forever! So must use overflow buckets if dups.
-
8/3/2019 10 B Tree Hash Index
28/38
Linear Hashing
This is another dynamic hashing scheme, an
alternative to Extendible Hashing.
LH handles the problem of long overflow chainswithout using a directory, and handles duplicates.
Idea: Use a family of hash functions h0, h1, h2, ...
hi(key) = h(key) mod(2iN); N = initial # buckets
h is some hash function (range is not0 to N-1) If N = 2d0, for some d0, hi consists of applying h and
looking at the last dibits, where di= d0+ i.
hi+1 doubles the range ofhi (similar to directory doubling)
-
8/3/2019 10 B Tree Hash Index
29/38
Lets start with N = 4 Buckets
Start at round 0, next 0, have 2round buckets Each time any bucket fills, split next bucket
If (O hround(key) < Next), use hround+1(key) instead
Linear Hashing Example
13*
10*
1* 21*
4* 12* 32* 16*
15*
5*
Start
13*
10*
1* 21*
4* 12*
32* 16*
15*
5*
Add 9
Next
9*Next
13*
10*
1* 21*
4* 12*
32* 16*
15*
5*
Add 20
9*Next
20*
Nround Nround
Nround
-
8/3/2019 10 B Tree Hash Index
30/38
Overflow chains do exist, but eventually get split Instead of doubling, new buckets added one-at-a-time
Linear Hashing Example (cont)
13*
10*
1* 21*
4* 12*
32* 16*
15*
5*
Add 6
9*Next
20*
6*
13*
10*
1*
21*
4* 12*
32* 16*
15*
5*
Add 17
9*
Next
20*
6*
Nround Nround
-
8/3/2019 10 B Tree Hash Index
31/38
Linear Hashing (Contd.)
Directory avoided in LH by using overflow pages, andchoosing bucket to split round-robin.
Splitting proceeds in `rounds. Round ends when all NRinitial (for round R) buckets are split. Buckets 0 to Next-1have been split; Nextto NRyet to be split.
Current round number also called Level.
Search: To find bucket for data entry r, findhround(r):
Ifhround(r) in range `Nextto NR, rbelongs here.
Else, r could be hround(r) or hround(r) + NR;
must apply hround+1(r) to find out.
-
8/3/2019 10 B Tree Hash Index
32/38
Overview of LH File
In the middle of a round.
Levelh
Buckets that existed at the
beginning of this round:
this is the range of
Next
Bucket to be split
of other buckets) in this round
Levelh search key value )(
search key value )(
Buckets split in this round:
If
is in this range, must use
h Level+1
`split image' bucket.
to decide if entry is in
created (through splitting
`split image' buckets:
-
8/3/2019 10 B Tree Hash Index
33/38
Linear Hashing (Contd.)
Insert: Find bucket by applying hround/ hround+1:
If bucket to insert into is full:
Add overflow page and insert data entry.
(Maybe) Split Nextbucket and increment Next.
Can choose any criterion to `trigger split.
Since buckets are split round-robin, long overflowchains dont develop!
Doubling of directory in Extendible Hashing issimilar; switching of hash functions is implicitin howthe # of bits examined is increased.
-
8/3/2019 10 B Tree Hash Index
34/38
Another Example of Linear Hashing
On split, hLevel+1 is used tore-distribute entries.
0hh
1
(This info
is for illustration
only!)
Round=0, N=4
00
01
10
11
000
001
010
011
(The actual contents
of the linear hashed
file)
Next=0PRIMARY
PAGES
Data entry rwith h(r)=5
Primarybucket page
44* 36*32*
25*9* 5*
14* 18*10*30*
31* 35* 11*7*
0hh
1
Round=0
00
01
10
11
000
001
010
011
Next=1
PRIMARYPAGES
44* 36*
32*
25*9* 5*
14* 18*10*30*
31* 35* 11*7*
OVERFLOWPAGES
43*
00100
-
8/3/2019 10 B Tree Hash Index
35/38
Example: End of a Round
0hh1
22*
00
01
10
11
000
001
010
011
00100
Next=3
01
10
101
110
Round=0
PRIMARYPAGES
OVERFLOW
PAGES
32*
9*
5*
14*
25*
66* 10*18* 34*
35*31* 7* 11* 43*
44*36*
37*29*
30*
0hh1
37*
00
01
10
11
000
001
010
011
00100
10
101
110
Next=0
Round=1
111
11
PRIMARYPAGES
OVERFLOWPAGES
11
32*
9* 25*
66* 18* 10* 34*
35* 11*
44* 36*
5* 29*
43*
14* 30* 22*
31* 7*
50*
-
8/3/2019 10 B Tree Hash Index
36/38
LH Described as a Variant of EH
The two schemes are actually quite similar: Begin with an EH index where directory has Nelements.
Use overflow pages, split buckets round-robin.
First split is at bucket 0. (Imagine directory being doubled
at this point.) But elements , , ... arethe same. So, need only create directory element N, whichdiffers from 0, now.
When bucket 1 splits, create directory element N+1, etc.
So, directory can double gradually. Also, primarybucket pages are created in order. If they areallocatedin sequence too (so that finding ith is easy),we actually dont need a directory! Voila, LH.
-
8/3/2019 10 B Tree Hash Index
37/38
Summary
Hash-based indexes: best for equality searches,cannot support range searches.
Static Hashing can lead to long overflow chains.
Extendible Hashing avoids overflow pages by splittinga full bucket when a new data entry is to be added to
it. (Duplicates may require overflow pages.) Directory to keep track of buckets, doubles periodically.
Can get large with skewed data; additional I/O if this doesnot fit in main memory.
-
8/3/2019 10 B Tree Hash Index
38/38
Summary (Contd.)
Linear Hashing avoids directory by splitting bucketsround-robin, and using overflow pages.
Overflow pages not likely to be long.
Duplicates handled easily.
Space utilization could be lower than Extendible Hashing,since splits not concentrated on `dense data areas.
Can tune criterion for triggering splits to trade-offslightly longer chains for better space utilization.
For hash-based indexes, a skeweddata distribution isone in which the hash valuesof data entries are notuniformly distributed!