1 optimization recap and examples. 2 optimization introduction for every sql expression, there are...

1

Optimization Recap and Optimization Recap and examplesexamples

2

Optimization introductionOptimization introduction

• For every SQL expression, there are

many possible ways of implementation.

• The different alternatives could result

in huge run-time differences.

• Our aim is to introduce the basic

hardware used, and optimization

principles

3

Disk-Memory-CPUDisk-Memory-CPU

Delete from Sailors where

sid=90

DISK

sailors

Reserves

Main Memory

CPU

4

Hardware RecapHardware Recap• The DB is kept on the Disk.

• The Disk is divided into BLOCKS

• Any processing of the information occurs in the Main Memory.

• Therefore, a block which we want to access has to be brought from the Disk to the memory, and perhaps written back.

• Blocks are read/written from/to the Disk as single units.

• The time of reading/writing a block to/from the disk is an I/O operation, and takes a lot of time.

5

Hardware RecapHardware Recap

• We assume a constant time for each Disk access,

and that only disk access define the run time. • We do not consider writing to the disk

• Every table in the DB is stored as a File (on the Disk), which is a ‘bunch of Blocks’.

• We will deal with files that are ‘heap-sorted’, i.e., there is no order in the file tuples

• Every block contains many tuples, each of them has a Record ID (RID), which states its location:

(number of block, number of tuple within the block)

6

SIDSNAM

E

ratin

g

ag

e

192

3

Joe832

332

1

Phil941

133

2

Boe733

122

6

Bill623

144

4

Paul121

111

2

Jim333

144

5

Vicky954

RID

Block 1

Block 2

Block 3

(1,1)

(1,2)

(1,3)

(2,1)

(2,2)

(2,3)

(3,1)

7

SIDSNAM

E

ratin

g

ag

e

192

3

Joe832

332

1

Phil941

133

2

Boe733

122

6

Bill623

144

4

Paul121

111

2

Jim333

144

5

Vicky954

B blocks

t tuples

Q: What would be the cost of the following queries?

Select * from sailors

Select * from sailors where sname=‘Jim’

Select * from sailors where rating>4

Answer: B

8

Indexes on filesIndexes on files

• An Index on a table is an additional file

which helps access the data fast.

• An index holds ‘data entries’ to the table file

• The index can have the structure of a B+

Tree, or a hash function.

9

Tree index Tree index on sname of on sname of

sailorssailors

‘A’->’M’ B1

‘N’->’Z’ B2

‘N’->’T’ L3

‘U’->’Z’ L4

‘A’->’G’ L1

‘H’->’M’ L2

Root block

Leaf blocks

Branch blocks

‘Bill’(2,1)

‘Boe’(1,3)

‘Vicky’ (3,1)

…

‘Paul’ (2,2)

‘Phil’ (1,2)

‘Jim’(2,3)

‘Joe’(1,1)

B1

L4L3L2L1

B2

SIDSNAM

E

ratin

gage

1923Joe832

3321Phil941

1332Boe733

1226Bill623

1444Paul121

1112Jim333

1445Vicky954

10

Tree indexTree index

• The tree is kept balanced

• The tree entries are always ordered

• The leaves point to the exact location of

tuples

• Getting to the leaf is typically 2-3 I/O

• Each leaf points to the next/previous leaf

• A Clustered index means that the index and

the table are ordered by the same attribute

11


sailorssailors

‘Bill’(2,1)

‘Boe’(1,3)

‘Phil’ (1,2)

…

‘Joe’ (2,2)

‘Joe’ (3,1)

‘Jim’(2,3)

‘Joe’(1,1)

B1

L4L3L2L1

B2

SIDSNAM

E

ratin

gage

1923Joe832

3321Phil941

1332Boe733

1226Bill623

1444Joe121

1112Jim333

1445Joe954

How would the following queries be processed?

Select * from sailors where sname=‘Joe’


Select * from sailors where sname>’J’

Notice: index Notice: index is not is not

clusteredclustered

12


sailorssailors

‘Bill’(1,1)

‘Boe’(1,2)

‘Phil’ (3,1)

…

‘Joe’ (2,2)

‘Joe’ (2,3)

‘Jim’(1,1)

‘Joe’(2,1)

B1

L4L3L2L1

B2

SIDSNAM

E

ratin

gage

1226Bill6 23

1332Boe733

1112Jim333

1923Joe832

1444Joe121

1445Joe954

3321Phil941

How would the following queries be processed?

Select * from sailors where sname=‘Joe’


Select * from sailors where sname>’J’

Notice: index Notice: index is clusteredis clustered

13

Hash indexHash index

• Works in a similar way, but using a

hash function instead of a tree

• Works only for equality conditions

• Average of 1.2 I/O to get to the tuple

location

14

Natural Join Natural Join

• We want to compute

• Naïve algorithm:

SELECT *

FROM Reserves R, Sailors S

WHERE R.sid = S.sid

Foreach tuple r in RForeach tuple s in S

if r.sid=s.sidadd r,s to result

Cost: BR+tR*BS

Running example data

tR=5000

tS=10,000

50 tuples per block

12 buffer pages

= 100+5000*200=1,000,100

15

Natural Join Natural Join • We want to compute

• We have 4 optional algorithms:

1. Block Nested Loops Join

2. Index Nested Loops Join

3. Sort Merge Join

4. Hash Join

SELECT *

FROM Reserves R, Sailors S

WHERE R.sid = S.sid

This is assuming there is not enough

space in the memory for the smaller of the 2

relations+2

16

Block Nested Loop JoinBlock Nested Loop Join

• Suppose there are B available blocks in the

memory, BR blocks of relation R, and BS

blocks of relations S, and BR<BS

• Until all blocks of R have been read:

– Read B-2 blocks of R

– Read all blocks of S (one by one), and write the

result

• Run time: BR + BS * ceil(BR /(B-2))= 100+200*100/10=2,100

17

Index Nested LoopIndex Nested Loop

• Suppose there is an index on sid of Sailors

• Until all blocks of R have been read:– Read a block of R

– For each tuple in the block, use the index of S to locate the matching tuples in S.

• We mark the time it takes to read the tuples in S that match a single tuple in R as X.

• Run time: BR + tR*X

• If the index is clustered, X=2-4

• If it is not clustered, we evaluate X.= 100+5000*3=15,100

18

• Q: So when would we typically choose

to use an index-nested loop over block-

nested?

• A: Look at the inequality…

19

Sort-Merge JoinSort-Merge Join

• Sort both relations on the join column

• Join them according to the join algorithm:

sidbiddayagent

2810312/4/96Joe

2810311/3/96Frank

3110110/2/96Joe

3110212/7/96Sam

3110113/7/96Sam

581032/6/96Frank

sidsnameratingage

22dustin745

28yuppy935

31lubber855

36lubber636

44guppy535

58rusty1035

20

Run time of Sort-MergeRun time of Sort-Merge

• M,N: number of blocks of the relations

• Sorting: MlogM+NlogN

• Merging: N+M if no partition is scanned twice.

• Total: MlogM+NlogN+N+M

• Especially good if one or both of the relations are already sorted.

= 100*7+200*8+100+200=2,600

21

QuestionQuestion

Suppose:

tuple size= 100 bytes

number of tuples (employees)=3,000

Page size=1000 bytes

You have an unclustered index on Hobby.

You know that 50 employees collect stamps.

Would you use the index?

And for 1,000 stamp-lovers?

SELECT E.dnoFROM Employees EWHERE E.hobby=‘stamps’

22

Question 2Question 2

• Length of tuples, Number of tuples– Emp: 20 bytes, 20,000 tuples

– Dept: 40 bytes, 5000 tuples

• Pages contain 4000 bytes; 12 buffer pages

• Which algorithm would you use if there is an unclustered tree index on E.eid? And clustered?

SELECT E.ename FROM Employees E, Departments DWHERE E.eid=D.eid

1 optimization recap and examples. 2 optimization introduction for every sql expression, there are...

Documents