10 b tree hash index

8/3/2019 10 B Tree Hash Index

1/38

B-Trees, Part 2

Hash-Based IndexesR&G Chapter 10

Lecture 10


2/38

Administrivia

The new Homework 3 now available

Due 1 week from Sunday

Homework 4 available the week after

Midterm exams available here


3/38

Review

Last time discussed File Organization Unordered heap files

Sorted Files

Clustered Trees

Unclustered Trees Unclustered Hash Tables

Indexes

B-Trees dynamic, good for changing data, rangequeries

Hash tables fastest for equality queries, uselessfor range queries


4/38

Review (2)

For any index, 3 alternatives for data entriesk*:

Data record with key value k

Choice orthogonal to the indexing technique


5/38

Today:

Indexes

Composite Keys, Index-Only Plans

B-Trees

details of insertion and deletion

Hash Indexes How to implement with changing data sets


6/38

Inserting a Data Entry into a B+ Tree

Find correct leafL. Put data entry onto L.

IfLhas enough space, done!

Else, must split L (into L and a new node L2) Redistribute entries evenly, copy up middle key.

Insert index entry pointing to L2into parent ofL.

This can happen recursively

To split index node, redistribute entries evenly, butpush up middle key. (Contrast with leaf splits.)

Splits grow tree; root split increases height.

Tree growth: gets wideror one level taller at top.


7/38

Indexes with Composite Search Keys

Composite Search Keys: Searchon a combination of fields.

Equality query: Every field value isequal to a constant value. E.g. wrt index:

age=20 and sal =75

Range query: Some field value isnot a constant. E.g.:

age =20; or age=20 and sal > 10

Data entries in index sorted bysearch key to support rangequeries.

Lexicographic order, or

Spatial order.

sue 13 75

bob

cal

joe 12

10

20

8011

12

name age sal

12,20

12,10

11,80

13,75

20,12

10,12

75,13

80,11

11

12

12

13

10

2075

80

Data recordssorted by name

Data entries in indexsorted by

Data entriessorted by

Examples of composite keyindexes using lexicographic order.


8/38

Composite Search Keys

To retrieve Emp records with age=30ANDsal=4000,an index on would be better than an indexon ageor an index on sal.

Choice of index key orthogonal to clustering etc.

If condition is: 20


9/38

Index-Only Plans

A number of queriescan be answeredwithout retrievingany tuples from one

or more of therelations involved ifa suitable index isavailable.

SELECT D.mgrFROM Dept D, Emp EWHERE D.dno=E.dno

SELECT D.mgr, E.eidFROM Dept D, Emp EWHERE D.dno=E.dno

SELECT E.dno, COUNT(*)FROM Emp EGROUP BY E.dno

SELECT E.dno, MIN(E.sal)FROM Emp EGROUP BY E.dno

SELECTAVG(E.sal)FROM Emp EWHERE E.age=25 AND

E.sal BETWEEN 3000 AND 5000

Tree index!

Tree index!

or

Tree!


10/38

Index-Only Plans (Contd.)

Index-only plansare possible if thekey is or we have a tree

index with key

Which is better?

What if weconsider thesecond query?

SELECT E.dno, COUNT (*)FROM Emp EWHERE E.age=30GROUP BY E.dno

SELECT E.dno, COUNT (*)FROM Emp EWHERE E.age>30GROUP BY E.dno


11/38

B-Trees: Insertion and Deletion

Insertion Find leaf where new record belongs

If leaf is full, redistribute*

If siblings too full, split, copy middle key up

If index too full, redistribute*

If index siblings full, split, push middle key up

Deletion

Find leaf where record exists, remove it

If leaf is less than 50% empty, redistribute* If siblings too empty, merge, remove key above

If index node above too empty, redistribute*

If index siblings too empty, merge, move above key down


12/38

B-Trees: For Homework 2

Insertion Find leaf where new record belongs

If leaf is full, redistribute*

If siblings too full, split, copy middle key up

If index too full, redistribute*

If index siblings full, split, push middle key up

Deletion

Find leaf where record exists, remove it

If leaf is less than 50% empty, redistribute* If siblings too empty, merge, remove key above

If index node above too empty, redistribute*

If index siblings too empty, merge, move above key down

This means that after deletion, nodes will often be < 50% full


13/38

B-Trees: For Homework 2 (cont)

Splits

When splitting nodes, choose the middle key

If there are even number of keys, choosemiddle

Your code must Handle Duplicate Keys

We promise that there will never be more than 1

page of duplicate values. Thus, when splitting, if the middle key is identical

to the key to the left, you must find the closestsplittable key to the middle.


14/38

B-Tree Demo


15/38

Hashing

Static and dynamic hashing techniques exist;trade-offs based on data change over time

Static Hashing Good if data never changes

Extendable Hashing

Uses directory to handle changing data

Linear Hashing

Avoids directory, usually faster


16/38

Static Hashing # primary pages fixed, allocated sequentially,

never de-allocated; overflow pages if needed.

h(k) mod N = bucket to which data entry withkey kbelongs. (N = # of buckets)

h(key) mod N

h

key

Primary bucket pages Overflow pages

2

0

N-1


17/38

Static Hashing (Contd.)

Buckets contain data entries.

Hash fn works on search keyfield of record r. Must

distribute values over range 0 ... N-1. h(key) = (a * key+ b) usually works well.

a and b are constants; lots known about how to tune h.

Long overflow chains can develop and degradeperformance.

Extendibleand LinearHashing: Dynamic techniques tofix this problem.


18/38

Extendible Hashing

Situation: Bucket (primary page) becomes full.

Why not re-organize file by doubling# of buckets?

Reading and writing all pages is expensive!

Idea: Use directory of pointers to buckets,

double # of buckets by doubling the directory,

splitting just the bucket that overflowed!

Directory much smaller than file, doubling muchcheaper.

Nooverflowpages!

Trick lies in how hash function is adjusted!


19/38

Extendible Hashing Details

Need directory with pointer to each bucket

Need hash function to incrementally double range

can just use increasingly more LSBs of h(key)

Must keep track of global depth

how many times directory doubled so far

Must keep track of local depth

how many time each bucket has been split


20/38


21/38

Insert h(r)=20 (Causes Doubling)

0001

10

11

2 2

2

2

LOCAL DEPTH 2

DIRECTORY

GLOBAL DEPTHBucket A

Bucket B

Bucket C

Bucket D

1* 5* 21*13*

32* 16*

10*

15* 7* 19*

4* 12*

19*

2

2

2

000

001

010

011

100

101110

111

3

3

3

DIRECTORY

Bucket A

Bucket B

Bucket C

Bucket D

Bucket A2

(`split image'of Bucket A)

32*

1* 5* 21* 13*

16*

10*

15* 7*

4* 20*12*

LOCAL DEPTH

GLOBAL DEPTH


22/38

Now Insert h(r)=9 (Causes Split Only)

19*

3

2

2

000

001

010

011

100

101

110111

3

3

3

DIRECTORY

Bucket A

Bucket B

Bucket C

Bucket D

Bucket A2

32*

1* 9*

16*

10*

15* 7*

4* 20*12*

LOCAL DEPTH

GLOBAL DEPTH

3

Bucket B25* 21* 13*

19*

2

2

2

000

001

010

011

100

101

110111

3

3

3

DIRECTORY

Bucket A

Bucket B

Bucket C

Bucket D

Bucket A2

(`split image'of Bucket A)

32*

1* 5* 21* 13*

16*

10*

15* 7*

4* 20*12*

LOCAL DEPTH

GLOBAL DEPTH


23/38

Points to Note

20 = binary 10100. Last 2 bits (00) tell us rbelongsin A or A2. Last 3 bits needed to tell which.

Global depth of directory: Max # of bits needed to tellwhich bucket an entry belongs to.

Local depth of a bucket: # of bits used to determine ifsplitting bucket will also double directory

When does bucket split cause directory doubling?

If, before insert, local depthof bucket = global depth. Insert causes local depthto become > global depth;

directory is doubled by copying it overand `fixing pointer to splitimage page.

(Use of least significant bits enables efficient doubling via copying

of directory!)


24/38

Directory DoublingWhy use least significant bits in directory?

Allows for doubling via copying!

13*00

01

10

11

2

2

2

2

2

10*

1* 21*

4* 12* 32* 16*

15* 7* 19*

5* 13*

000

001

010

011

3

2

2

2

2

10*

1* 21*

4* 12* 32* 16*

15* 7* 19*

5*

100

101110

111


25/38

Deletion

Delete: If removal of data entry makes bucket empty,

can be merged with `split image. If each directoryelement points to same bucket as its split image, canhalve directory.

13*00

01

10

11

2

2

2

2

2

10*

1* 21*

4* 12* 32* 16*

15*

5* 13*00

01

10

11

2

1

2

2

1* 21*

4* 12* 32* 16*

15*

5*

Delete 10


26/38

Deletion (cont)

Delete: If removal of data entry makes bucket empty,

can be merged with `split image. If each directoryelement points to same bucket as its split image, canhalve directory.

13*

00

01

10

11

2 1

1

1* 21*

4* 12* 32* 16*

5*

Delete 15

13*00

01

10

11

2

1

2

2

1* 21*

4* 12* 32* 16*

15*

5*

13*

0

1

1 1

1

1* 21*

4* 12* 32* 16*

5*


27/38

Comments on Extendible Hashing

If directory fits in memory, equality search

answered with one disk access; else two. 100MB file, 100 bytes/rec, 4K pages contains 1,000,000

records (as data entries) and 25,000 directory elements;chances are high that directory will fit in memory.

Directory grows in spurts, and, if the distribution of hashvaluesis skewed, directory can grow large.

Biggest problem:

Multiple entries with same hash value cause problems!

If bucket already full of same hash value, will keepdoubling forever! So must use overflow buckets if dups.


28/38

Linear Hashing

This is another dynamic hashing scheme, an

alternative to Extendible Hashing.

LH handles the problem of long overflow chainswithout using a directory, and handles duplicates.

Idea: Use a family of hash functions h0, h1, h2, ...

hi(key) = h(key) mod(2iN); N = initial # buckets

h is some hash function (range is not0 to N-1) If N = 2d0, for some d0, hi consists of applying h and

looking at the last dibits, where di= d0+ i.

hi+1 doubles the range ofhi (similar to directory doubling)


29/38

Lets start with N = 4 Buckets

Start at round 0, next 0, have 2round buckets Each time any bucket fills, split next bucket

If (O hround(key) < Next), use hround+1(key) instead

Linear Hashing Example

13*

10*

1* 21*

4* 12* 32* 16*

15*

5*

Start

13*

10*

1* 21*

4* 12*

32* 16*

15*

5*

Add 9

Next

9*Next

13*

10*

1* 21*

4* 12*

32* 16*

15*

5*

Add 20

9*Next

20*

Nround Nround

Nround


30/38

Overflow chains do exist, but eventually get split Instead of doubling, new buckets added one-at-a-time

Linear Hashing Example (cont)

13*

10*

1* 21*

4* 12*

32* 16*

15*

5*

Add 6

9*Next

20*

6*

13*

10*

1*

21*

4* 12*

32* 16*

15*

5*

Add 17

9*

Next

20*

6*

Nround Nround


31/38

Linear Hashing (Contd.)

Directory avoided in LH by using overflow pages, andchoosing bucket to split round-robin.

Splitting proceeds in `rounds. Round ends when all NRinitial (for round R) buckets are split. Buckets 0 to Next-1have been split; Nextto NRyet to be split.

Current round number also called Level.

Search: To find bucket for data entry r, findhround(r):

Ifhround(r) in range `Nextto NR, rbelongs here.

Else, r could be hround(r) or hround(r) + NR;

must apply hround+1(r) to find out.


32/38

Overview of LH File

In the middle of a round.

Levelh

Buckets that existed at the

beginning of this round:

this is the range of

Next

Bucket to be split

of other buckets) in this round

Levelh search key value )(

search key value )(

Buckets split in this round:

If

is in this range, must use

h Level+1

`split image' bucket.

to decide if entry is in

created (through splitting

`split image' buckets:


33/38

Linear Hashing (Contd.)

Insert: Find bucket by applying hround/ hround+1:

If bucket to insert into is full:

Add overflow page and insert data entry.

(Maybe) Split Nextbucket and increment Next.

Can choose any criterion to `trigger split.

Since buckets are split round-robin, long overflowchains dont develop!

Doubling of directory in Extendible Hashing issimilar; switching of hash functions is implicitin howthe # of bits examined is increased.


34/38

Another Example of Linear Hashing

On split, hLevel+1 is used tore-distribute entries.

0hh

1

(This info

is for illustration

only!)

Round=0, N=4

00

01

10

11

000

001

010

011

(The actual contents

of the linear hashed

file)

Next=0PRIMARY

PAGES

Data entry rwith h(r)=5

Primarybucket page

44* 36*32*

25*9* 5*

14* 18*10*30*

31* 35* 11*7*

0hh

1

Round=0

00

01

10

11

000

001

010

011

Next=1

PRIMARYPAGES

44* 36*

32*

25*9* 5*

14* 18*10*30*

31* 35* 11*7*

OVERFLOWPAGES

43*

00100


35/38

Example: End of a Round

0hh1

22*

00

01

10

11

000

001

010

011

00100

Next=3

01

10

101

110

Round=0

PRIMARYPAGES

OVERFLOW

PAGES

32*

9*

5*

14*

25*

66* 10*18* 34*

35*31* 7* 11* 43*

44*36*

37*29*

30*

0hh1

37*

00

01

10

11

000

001

010

011

00100

10

101

110

Next=0

Round=1

111

11

PRIMARYPAGES

OVERFLOWPAGES

11

32*

9* 25*

66* 18* 10* 34*

35* 11*

44* 36*

5* 29*

43*

14* 30* 22*

31* 7*

50*


36/38

LH Described as a Variant of EH

The two schemes are actually quite similar: Begin with an EH index where directory has Nelements.

Use overflow pages, split buckets round-robin.

First split is at bucket 0. (Imagine directory being doubled

at this point.) But elements , , ... arethe same. So, need only create directory element N, whichdiffers from 0, now.

When bucket 1 splits, create directory element N+1, etc.

So, directory can double gradually. Also, primarybucket pages are created in order. If they areallocatedin sequence too (so that finding ith is easy),we actually dont need a directory! Voila, LH.


37/38

Summary

Hash-based indexes: best for equality searches,cannot support range searches.

Static Hashing can lead to long overflow chains.

Extendible Hashing avoids overflow pages by splittinga full bucket when a new data entry is to be added to

it. (Duplicates may require overflow pages.) Directory to keep track of buckets, doubles periodically.

Can get large with skewed data; additional I/O if this doesnot fit in main memory.


38/38

Summary (Contd.)

Linear Hashing avoids directory by splitting bucketsround-robin, and using overflow pages.

Overflow pages not likely to be long.

Duplicates handled easily.

Space utilization could be lower than Extendible Hashing,since splits not concentrated on `dense data areas.

Can tune criterion for triggering splits to trade-offslightly longer chains for better space utilization.

For hash-based indexes, a skeweddata distribution isone in which the hash valuesof data entries are notuniformly distributed!

10 b tree hash index

Documents