1 optimization recap and examples. 2 optimization introduction for every sql expression, there are...
Post on 20-Dec-2015
215 views
TRANSCRIPT
1
Optimization Recap and Optimization Recap and examplesexamples
2
Optimization introductionOptimization introduction
• For every SQL expression, there are
many possible ways of implementation.
• The different alternatives could result
in huge run-time differences.
• Our aim is to introduce the basic
hardware used, and optimization
principles
3
Disk-Memory-CPUDisk-Memory-CPU
Delete from Sailors where
sid=90
DISK
sailors
Reserves
Main Memory
CPU
4
Hardware RecapHardware Recap• The DB is kept on the Disk.
• The Disk is divided into BLOCKS
• Any processing of the information occurs in the Main Memory.
• Therefore, a block which we want to access has to be brought from the Disk to the memory, and perhaps written back.
• Blocks are read/written from/to the Disk as single units.
• The time of reading/writing a block to/from the disk is an I/O operation, and takes a lot of time.
5
Hardware RecapHardware Recap
• We assume a constant time for each Disk access,
and that only disk access define the run time. • We do not consider writing to the disk
• Every table in the DB is stored as a File (on the Disk), which is a ‘bunch of Blocks’.
• We will deal with files that are ‘heap-sorted’, i.e., there is no order in the file tuples
• Every block contains many tuples, each of them has a Record ID (RID), which states its location:
(number of block, number of tuple within the block)
6
SIDSNAM
E
ratin
g
ag
e
192
3
Joe832
332
1
Phil941
133
2
Boe733
122
6
Bill623
144
4
Paul121
111
2
Jim333
144
5
Vicky954
RID
Block 1
Block 2
Block 3
(1,1)
(1,2)
(1,3)
(2,1)
(2,2)
(2,3)
(3,1)
7
SIDSNAM
E
ratin
g
ag
e
192
3
Joe832
332
1
Phil941
133
2
Boe733
122
6
Bill623
144
4
Paul121
111
2
Jim333
144
5
Vicky954
B blocks
t tuples
Q: What would be the cost of the following queries?
Select * from sailors
Select * from sailors where sname=‘Jim’
Select * from sailors where rating>4
Answer: B
8
Indexes on filesIndexes on files
• An Index on a table is an additional file
which helps access the data fast.
• An index holds ‘data entries’ to the table file
• The index can have the structure of a B+
Tree, or a hash function.
9
Tree index Tree index on sname of on sname of
sailorssailors
‘A’->’M’ B1
‘N’->’Z’ B2
‘N’->’T’ L3
‘U’->’Z’ L4
‘A’->’G’ L1
‘H’->’M’ L2
Root block
Leaf blocks
Branch blocks
‘Bill’(2,1)
‘Boe’(1,3)
‘Vicky’ (3,1)
…
‘Paul’ (2,2)
‘Phil’ (1,2)
‘Jim’(2,3)
‘Joe’(1,1)
B1
L4L3L2L1
B2
SIDSNAM
E
ratin
gage
1923Joe832
3321Phil941
1332Boe733
1226Bill623
1444Paul121
1112Jim333
1445Vicky954
10
Tree indexTree index
• The tree is kept balanced
• The tree entries are always ordered
• The leaves point to the exact location of
tuples
• Getting to the leaf is typically 2-3 I/O
• Each leaf points to the next/previous leaf
• A Clustered index means that the index and
the table are ordered by the same attribute
11
Tree index Tree index on sname of on sname of
sailorssailors
‘Bill’(2,1)
‘Boe’(1,3)
‘Phil’ (1,2)
…
‘Joe’ (2,2)
‘Joe’ (3,1)
‘Jim’(2,3)
‘Joe’(1,1)
B1
L4L3L2L1
B2
SIDSNAM
E
ratin
gage
1923Joe832
3321Phil941
1332Boe733
1226Bill623
1444Joe121
1112Jim333
1445Joe954
How would the following queries be processed?
Select * from sailors where sname=‘Joe’
Select * from sailors
Select * from sailors where sname>’J’
Notice: index Notice: index is not is not
clusteredclustered
12
Tree index Tree index on sname of on sname of
sailorssailors
‘Bill’(1,1)
‘Boe’(1,2)
‘Phil’ (3,1)
…
‘Joe’ (2,2)
‘Joe’ (2,3)
‘Jim’(1,1)
‘Joe’(2,1)
B1
L4L3L2L1
B2
SIDSNAM
E
ratin
gage
1226Bill6 23
1332Boe733
1112Jim333
1923Joe832
1444Joe121
1445Joe954
3321Phil941
How would the following queries be processed?
Select * from sailors where sname=‘Joe’
Select * from sailors
Select * from sailors where sname>’J’
Notice: index Notice: index is clusteredis clustered
13
Hash indexHash index
• Works in a similar way, but using a
hash function instead of a tree
• Works only for equality conditions
• Average of 1.2 I/O to get to the tuple
location
14
Natural Join Natural Join
• We want to compute
• Naïve algorithm:
SELECT *
FROM Reserves R, Sailors S
WHERE R.sid = S.sid
Foreach tuple r in RForeach tuple s in S
if r.sid=s.sidadd r,s to result
Cost: BR+tR*BS
Running example data
tR=5000
tS=10,000
50 tuples per block
12 buffer pages
= 100+5000*200=1,000,100
15
Natural Join Natural Join • We want to compute
• We have 4 optional algorithms:
1. Block Nested Loops Join
2. Index Nested Loops Join
3. Sort Merge Join
4. Hash Join
SELECT *
FROM Reserves R, Sailors S
WHERE R.sid = S.sid
This is assuming there is not enough
space in the memory for the smaller of the 2
relations+2
16
Block Nested Loop JoinBlock Nested Loop Join
• Suppose there are B available blocks in the
memory, BR blocks of relation R, and BS
blocks of relations S, and BR<BS
• Until all blocks of R have been read:
– Read B-2 blocks of R
– Read all blocks of S (one by one), and write the
result
• Run time: BR + BS * ceil(BR /(B-2))= 100+200*100/10=2,100
17
Index Nested LoopIndex Nested Loop
• Suppose there is an index on sid of Sailors
• Until all blocks of R have been read:– Read a block of R
– For each tuple in the block, use the index of S to locate the matching tuples in S.
• We mark the time it takes to read the tuples in S that match a single tuple in R as X.
• Run time: BR + tR*X
• If the index is clustered, X=2-4
• If it is not clustered, we evaluate X.= 100+5000*3=15,100
18
• Q: So when would we typically choose
to use an index-nested loop over block-
nested?
• A: Look at the inequality…
19
Sort-Merge JoinSort-Merge Join
• Sort both relations on the join column
• Join them according to the join algorithm:
sidbiddayagent
2810312/4/96Joe
2810311/3/96Frank
3110110/2/96Joe
3110212/7/96Sam
3110113/7/96Sam
581032/6/96Frank
sidsnameratingage
22dustin745
28yuppy935
31lubber855
36lubber636
44guppy535
58rusty1035
20
Run time of Sort-MergeRun time of Sort-Merge
• M,N: number of blocks of the relations
• Sorting: MlogM+NlogN
• Merging: N+M if no partition is scanned twice.
• Total: MlogM+NlogN+N+M
• Especially good if one or both of the relations are already sorted.
= 100*7+200*8+100+200=2,600
21
QuestionQuestion
Suppose:
tuple size= 100 bytes
number of tuples (employees)=3,000
Page size=1000 bytes
You have an unclustered index on Hobby.
You know that 50 employees collect stamps.
Would you use the index?
And for 1,000 stamp-lovers?
SELECT E.dnoFROM Employees EWHERE E.hobby=‘stamps’
22
Question 2Question 2
• Length of tuples, Number of tuples– Emp: 20 bytes, 20,000 tuples
– Dept: 40 bytes, 5000 tuples
• Pages contain 4000 bytes; 12 buffer pages
• Which algorithm would you use if there is an unclustered tree index on E.eid? And clustered?
SELECT E.ename FROM Employees E, Departments DWHERE E.eid=D.eid