query optimization r&g, chapter 15 lecture 17. administrivia homework 3 available from class...

28
Query Optimization R&G, Chapter 15 Lecture 17

Post on 20-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Query Optimization R&G, Chapter 15 Lecture 17. Administrivia Homework 3 available from class website –Due date: Tuesday, March 20 by end of class period

Query Optimization

R&G, Chapter 15Lecture 17

Page 2: Query Optimization R&G, Chapter 15 Lecture 17. Administrivia Homework 3 available from class website –Due date: Tuesday, March 20 by end of class period

Administrivia

• Homework 3 available from class website– Due date: Tuesday, March 20 by end of class period

• Homework 4 available today– Implement nested loops and hash join operators for (new!)

minibase– Due date: April 10 (after Spring Break)

• Midterm 2 is 3/22, 2 weeks from today– In class, covers lectures 10-17– Review will be held Tuesday 3/20 7-9 pm 306 Soda Hall

• Internships at Google this summer…– See http://www.postgresql.org/developer/summerofcode– Booth at the UCB TechExpo this Thursday:– http://csba.berkeley.edu/tech_expo.html– Contact [email protected]

Page 3: Query Optimization R&G, Chapter 15 Lecture 17. Administrivia Homework 3 available from class website –Due date: Tuesday, March 20 by end of class period

Review

Query Optimizationand Execution

Relational Operators

Files and Access Methods

Buffer Management

Disk Space Management

DB

We are here

•Query plans are a tree of operators that compute the result of a query•Optimization is the process of picking the best plan•Execution is the process of executing the plan

Page 4: Query Optimization R&G, Chapter 15 Lecture 17. Administrivia Homework 3 available from class website –Due date: Tuesday, March 20 by end of class period

Query Plans: turning text into tuples

SELECT A.aname, max(F.feedingshift)FROM Animals A, Feeding FWHERE A.aid = F.aid AND(A.species = 'Big Cat' or A.species = 'Bear')GROUP BY A.aname

HAVING COUNT(*) > 1

Aslan 3

Bageera 3

Elsa 3

Shere Khan 3

Tigger 2

Name Shift

10 2 1 100 3

10 3 2 100 3

20 3 2 100 3

20 2 3 100 3

30 3 2 100 3

… … … … …

40 100 Aslan Big Cat

50 300 Baloo Bear

60 100 Bageera Big Cat

70 100Shere Khan

Big Cat

90 100 Dumbo Elephant

… … … …

OperatorsQuery Plan

Query ResultQuery

Page 5: Query Optimization R&G, Chapter 15 Lecture 17. Administrivia Homework 3 available from class website –Due date: Tuesday, March 20 by end of class period

Query Optimization steps

1. Parse query from text to ‘intermediate model’

2. Traverse ‘intermediate model’ and produce alternate query plans– Query plan = relational algebra

tree– Plan cost = cumulative cost of tree– Consider equivalent but alternative

relational algebra tress– Optimizer keeps track of cost and

properties of plans

3. Pick the cheapest plan

Query parser

Query optimizer

SELECT A.aname, max(F.feedingshift)FROM Animals A, Feeding FWHERE A.aid = F.aid AND(A.species = 'Big Cat' or A.species = 'Bear')GROUP BY A.aname

HAVING COUNT(*) > 1

Block 2Block 1

Block 3

Cost = 200 Cost = 150 Cost = 500

To execution engine

Page 6: Query Optimization R&G, Chapter 15 Lecture 17. Administrivia Homework 3 available from class website –Due date: Tuesday, March 20 by end of class period

500,500 IOs

Optimized query is 124x cheaper than the original!

Sailors Reserves

sid=sid

bid=100 rating > 5

sname

(Page-Oriented Nested loops)

(On-the-fly)

(On-the-fly)

ReservesSailors

sid=sid

bid=100

sname

(Page-Oriented Nested loops)

(On-the-fly)

rating>5

(Scan &Write totemp T2)(On-the-fly)

4010 IOs

SELECT S.snameFROM Reserves R, Sailors SWHERE R.sid=S.sid AND R.bid=100 AND S.rating>5

Page 7: Query Optimization R&G, Chapter 15 Lecture 17. Administrivia Homework 3 available from class website –Due date: Tuesday, March 20 by end of class period

System R-Style Optimization• Impact:

– Most widely used currently; works well for < 10 joins.• Cost estimation:

– Plan cost = cumulative cost of plan tree– Cost = function(I/O, CPU) costs– Very inexact, but works ok in practice.– Statistics, maintained in system catalogs, used to

estimate cost of operations and result sizes.• Plan Space: Too large, must be pruned.

– Many plans share common, “overpriced” subtrees• ignore them all!

– Only left-deep plans are considered.• May cause optimizer to miss good plans

– Avoid Cartesian products.

Page 8: Query Optimization R&G, Chapter 15 Lecture 17. Administrivia Homework 3 available from class website –Due date: Tuesday, March 20 by end of class period

Query Optimization steps

1. Parse query from text to ‘intermediate model’-> A list of query blocks

2. For each query block, pick the best plan– Convert block to tree of

relational algebra operators– Build alternative plans

bottom up:– Leaf nodes represent

access paths to relations– Consider reordering

relational algebra tree– Push selections and

projections down– Consider left-deep join

plans– Avoid cross products – Consider all possible join

methods– Prune expensive plans with

the same properties

Block 2Block 1

Block 3

BNL Cost = 200no sorted order

BNL Cost = 150no sorted order

X

SM Cost = 300Sorted by bid

HJ Cost = 125no sorted order

SM Cost = 300Sorted by bid

HJ Cost = 125no sorted order

X

X

X

X

Rel 1 access path

Rel 2 access path

Page 9: Query Optimization R&G, Chapter 15 Lecture 17. Administrivia Homework 3 available from class website –Due date: Tuesday, March 20 by end of class period

Schema for Examples

• Reserves:– Each tuple is 40 bytes long, 100 tuples per page,

1000 pages. 100 distinct bids.• Sailors:

– Each tuple is 50 bytes long, 80 tuples per page, 500 pages. 10 ratings, 40,000 sids.

Sailors (sid: integer, sname: string, rating: integer, age: real)Reserves (sid: integer, bid: integer, day: dates, rname: string)

Page 10: Query Optimization R&G, Chapter 15 Lecture 17. Administrivia Homework 3 available from class website –Due date: Tuesday, March 20 by end of class period

Translating SQL to Relational AlgebraSELECT S.sid, MIN (R.day)FROM Sailors S, Reserves R, Boats BWHERE S.sid = R.sid AND R.bid = B.bid AND B.color = “red” AND S.rating > 5GROUP BY S.sidHAVING COUNT (*) >= 2

For each sailor with a rating > 5 that has reserved at least 2 red boats, find the sailor id and the earliest date on which the sailor has a reservation for a red boat.

Page 11: Query Optimization R&G, Chapter 15 Lecture 17. Administrivia Homework 3 available from class website –Due date: Tuesday, March 20 by end of class period

HAVING COUNT(*)>2

S.sid, MIN(R.day)

Translating SQL to Relational AlgebraSELECT S.sid, MIN (R.day)

FROM Sailors S, Reserves R, Boats BWHERE S.sid = R.sid AND R.bid = B.bid AND B.color = “red” AND S.rating > 5GROUP BY S.sidHAVING COUNT (*) >= 2

Sailors Reserves Boats

B.color = “red”

GROUP BY S.Sid

V

S.rating > 5

Page 12: Query Optimization R&G, Chapter 15 Lecture 17. Administrivia Homework 3 available from class website –Due date: Tuesday, March 20 by end of class period

• Allow us to choose different join orders and to `push’ selections and projections ahead of joins.

• Selections can be cascaded:

c1…cn(R) c1(…(cn(R))…)

Relational Algebra Equivalences

SELECT S.sid, MIN (R.day)FROM Sailors S, Reserves R, Boats BWHERE S.sid = R.sid AND R.bid = B.bid AND B.color = “red” AND S.rating > 5GROUP BY S.sidHAVING COUNT (*) >= 2

HAVING COUNT(*)>2

B.color = “red”

GROUP BY S.Sid

S.sid, MIN(R.day)

Sailors

Reserves BoatsS.rating > 5

Can apply these predicates separately•Can ‘push’ S.rating > 5 down to Sailors

Page 13: Query Optimization R&G, Chapter 15 Lecture 17. Administrivia Homework 3 available from class website –Due date: Tuesday, March 20 by end of class period

• Selections can be commuted:

c1(c2(R)) c2(c1(R))

Relational Algebra Equivalences

SELECT S.sid, MIN (R.day)FROM Sailors S, Reserves R, Boats BWHERE S.sid = R.sid AND R.bid = B.bid AND B.color = “red” AND S.rating > 5GROUP BY S.sidHAVING COUNT (*) >= 2

HAVING COUNT(*)>2

S.Rating > 5

GROUP BY S.Sid

S.sid, MIN(R.day)

Boats

Reserves SailorsB.color = “red”

Can apply these predicates in different order

Page 14: Query Optimization R&G, Chapter 15 Lecture 17. Administrivia Homework 3 available from class website –Due date: Tuesday, March 20 by end of class period

• Projections can be cascaded:

a1(R) a1(…(a1, …, an(R))…)

Relational Algebra Equivalences

SELECT S.sid, MIN (R.day)FROM Sailors S, Reserves R, Boats BWHERE S.sid = R.sid AND R.bid = B.bid AND B.color = “red” AND S.rating > 5GROUP BY S.sidHAVING COUNT (*) >= 2

HAVING COUNT(*)>2

B.color = “red”

GROUP BY S.Sid

S.sid, MIN(R.day)

Reserves Boats

S.rating > 5

Sailors

S.sid

Can project S.sid to reduce size of tuples

Page 15: Query Optimization R&G, Chapter 15 Lecture 17. Administrivia Homework 3 available from class website –Due date: Tuesday, March 20 by end of class period

Relational Algebra Equivalences

• Eager projection– Can cascade and “push”

some projections thru selection

– Can cascade and “push” some projections below one side of a join

– Rule of thumb: can project anything not needed “downstream”

HAVING COUNT(*)>2

B.color = “red”

GROUP BY S.Sid

BoatsS.rating > 5

Sailors

S.sid

Reserves

R.sid, R.bid, R.day

SELECT S.sid, MIN (R.day)FROM Sailors S, Reserves R, Boats BWHERE S.sid = R.sid AND R.bid = B.bid AND B.color = “red” AND S.rating > 5GROUP BY S.sidHAVING COUNT (*) >= 2

??

B.bid, B.color

??

S.sid, R.day

S.sid, MIN(R.day)

??

??

Page 16: Query Optimization R&G, Chapter 15 Lecture 17. Administrivia Homework 3 available from class website –Due date: Tuesday, March 20 by end of class period

• Cartesian Product and Joins

– R (S T) (R S) T (associative)

– R S S R (commutative)– This means we can do joins in any order.

• But…beware of cartesian product!– Only consider R X S if there is a join predicate between

R & Select S.sid, R.bid, B.bidFrom Sailors S, Reserves R, Boats BWhere S.sid = R.sid and R.bid = B.bid

What about this query?Select S.sid, B.bidFrom Sailors S, Reserves R, Boats BWhere S.sid = R.sid

Relational Algebra Equivalences

No need to consider plans with S X B

User asked for Cartesian product so you have to compute it!

Page 17: Query Optimization R&G, Chapter 15 Lecture 17. Administrivia Homework 3 available from class website –Due date: Tuesday, March 20 by end of class period

Cost Estimation• For each plan considered, must estimate total cost:

– Must estimate cost of each operation in plan tree.– Plan cost = cumulative cost of the operators– Operator cost is computed from:

• Input cardinalities.• Cost of each operator

– e.g. (sequential scan, index scan, joins, etc.)

– Must estimate size of result for each operation in tree!• It will contribute to the input cardinality for the next operator

up the tree!• Computed from cardinality of input relations + selecitivity of

the predicates.• For selections and joins, assume predicates are independent.

– Q: Is “cost” the same as estimated “run time”?

Page 18: Query Optimization R&G, Chapter 15 Lecture 17. Administrivia Homework 3 available from class website –Due date: Tuesday, March 20 by end of class period

Statistics and Catalogs• Optimizer needs statistics about relations and their

indexes to compute cardinality and selectivity estimates.

• System Catalogs contain metadata about tables and relations– Just another set of relations; you can query them like any

other table!• For optimizer, they typically contain:

– # tuples (cardinality) and # pages (NPages) per rel’n.– # distinct key values (NKeys) for each index.– low/high key values (Low/High) for each index.– Index height (IHeight) for each tree index.– # index pages (INPages) for each index.

• Statistics must be kept up to date– Updating whenever data changes is too expensive; lots of

approximation anyway, so slight inconsistency ok.– Updated periodically

• E.g. DB2: runstats on table feeding– Statistics out of date -> optimizer choosing very bad plans!

Page 19: Query Optimization R&G, Chapter 15 Lecture 17. Administrivia Homework 3 available from class website –Due date: Tuesday, March 20 by end of class period

The optimizer relies on good statistics

SELECT S.sname, R.bidFROM Sailors S, Reserves RWHERE S.sid = R.sid

T1: Create table Sailors(…)T2: Create table Reserves(…)T3: Runstats on table SailorsT4: Runstats on table ReservesT5: Insert 100 pages of Sailors into SailorsT6: Runstats on SailorsT6: Insert 1000 pages of Reservations into ReservesT7: Run Query:

Table StatsName NumPagesSailors 1Reserves 1

X 100

•Assume 7 pages of memory and only BNL

BNLJ: SailorsXReserves 100 + 100/5x1 = 120

Optimizer would incorrectly pick this plan!

ReservesXSailors1 + 1/5x100 = 101

Page 20: Query Optimization R&G, Chapter 15 Lecture 17. Administrivia Homework 3 available from class website –Due date: Tuesday, March 20 by end of class period

Size Estimation and Reduction Factors

• Consider a query block:• Maximum Cardinality = product of the cardinalities of

relations in the FROM clause.• Reduction factor (RF) associated with each predicate

reflects the impact of the predicate in reducing result size. – Result cardinality = Max # tuples * product of all RF’s.

• RF usually called “selectivity”– only R&G seem to call it Reduction Factor

SELECT attribute listFROM relation listWHERE pred1 AND ... AND predk

Page 21: Query Optimization R&G, Chapter 15 Lecture 17. Administrivia Homework 3 available from class website –Due date: Tuesday, March 20 by end of class period

Result Size Estimation• Result cardinality =

Max # tuples * product of all RF’s.• Term col=value (given index I on col, or

knowledge about # column values)RF = 1/NKeys(I)

• Term col1=col2 (This is handy for joins too…)RF = 1/MAX(NKeys(I1), NKeys(I2))

• Term col>valueRF = (High(I)-value)/(High(I)-Low(I))

(Implicit assumptions: values are uniformly distributed and predicates are independent!)

• Question: What if the optimizer has no statistics?

Page 22: Query Optimization R&G, Chapter 15 Lecture 17. Administrivia Homework 3 available from class website –Due date: Tuesday, March 20 by end of class period

No statistics is a common case

/* default selectivity estimate for equalities such as "A = b" */

#define DEFAULT_EQ_SEL 0.005

/* default selectivity estimate for inequalities such as "A < b" */

#define DEFAULT_INEQ_SEL 0.3333333333333333

/* default selectivity estimate for range inequalities "A > b AND A < c" */

#define DEFAULT_RANGE_INEQ_SEL 0.005

/* default selectivity estimate for pattern-match operators such as LIKE */

#define DEFAULT_MATCH_SEL 0.005

/* default number of distinct values in a table */

#define DEFAULT_NUM_DISTINCT 200

/* default selectivity estimate for boolean and null test nodes */

#define DEFAULT_UNK_SEL 0.005

#define DEFAULT_NOT_UNK_SEL (1.0 - DEFAULT_UNK_SEL)

•RF = 1/NKeys(I) for a column with an index• What about a non-index column,

e.g. R.day = ’01/31/2007’?•The original System R paper suggested 1/10 for this case, and many systems today still use that!•Here is what POSTGRES does:

Page 23: Query Optimization R&G, Chapter 15 Lecture 17. Administrivia Homework 3 available from class website –Due date: Tuesday, March 20 by end of class period

Estimating Join Cardinality

• Q: Given a join of R and S, what is the range of possible result sizes (in #of tuples)?– Suppose the join is on a key for R and S

• e.g. Students(sid, sname, did), Dorm(did,d.addr)

Select S.sid, D.addressFrom Students S, Dorms DWhere S.did = D.did

What is the cardinality? A student can only live in at most 1 dorm-> each S tuple can match with at most 1 D tuple-> cardinality (S join D) = cardinality of S

Page 24: Query Optimization R&G, Chapter 15 Lecture 17. Administrivia Homework 3 available from class website –Due date: Tuesday, March 20 by end of class period

• General case: join on {A} ({A} is key for neither)– estimate each tuple r of R generates uniform number of

matches in S and each tuple s of S generates uniform number of matches in R, e.g.

RF = min(NTuples(R) * NTuples(S)/NKeys(A,S) NTuples(S) * NTuples(R)/NKeys(A,R))

Sailors: 100 tuples, 75 unique names -> 1.3 tuples for each sailor name

Boats: 20 tuples, 10 unique names -> 2 tuples for each boat name

e.g. SELECT S.name, B.name FROM Sailors S, Boats B WHERE S.name = B.name

Estimating Join Cardinality

RF = 100*20/10 = 200RF = 20*100/75 = 26.6

Page 25: Query Optimization R&G, Chapter 15 Lecture 17. Administrivia Homework 3 available from class website –Due date: Tuesday, March 20 by end of class period

Plan Enumeration• A heuristic decision in System R:

only left-deep join trees are considered.– As the number of joins increases, the number of alternative

plans grows rapidly; we need to restrict the search space.– Left-deep trees allow us to generate all fully pipelined plans.

• Intermediate results not written to temporary files.• Not all left-deep trees are fully pipelined (e.g., SM join).

BA

C

D

BA

C

D

C DBA

Page 26: Query Optimization R&G, Chapter 15 Lecture 17. Administrivia Homework 3 available from class website –Due date: Tuesday, March 20 by end of class period

Enumeration of Left-Deep Plans

• Left-deep plans differ:– Order of relations– Access method for each relation, – Join method for each join.

• Enumerated using N passes (if N relations joined):– Pass 1: Find best 1-relation plan for each relation.– Pass 2: Find best way to join result of each 1-relation plan (as

outer) to another relation. (All 2-relation plans.) – Pass N: Find best way to join result of a (N-1)-relation plan (as

outer) to the N’th relation. (All N-relation plans.)

• For each subset of relations, retain only:– Cheapest plan overall, plus– Cheapest plan for each interesting order of the tuples.

Page 27: Query Optimization R&G, Chapter 15 Lecture 17. Administrivia Homework 3 available from class website –Due date: Tuesday, March 20 by end of class period

A Note on “Interesting Orders”

• An intermediate result has an “interesting order” if it is sorted by any of:

– ORDER BY attributes– GROUP BY attributes– Join attributes of yet-to-be-added

(downstream) joins

Page 28: Query Optimization R&G, Chapter 15 Lecture 17. Administrivia Homework 3 available from class website –Due date: Tuesday, March 20 by end of class period

Enumeration of Plans (Contd.)

• Avoid Cartesian products!– An N-1 way plan is not combined with an additional

relation unless there is a join condition between them, or all predicates in WHERE have been used up.

– i.e., consider a Cartesian product only if the user specified one!

• ORDER BY, GROUP BY, aggregates – handled as a final step, using either an `interestingly

ordered’ plan or an additonal sort/hash operator.• In spite of pruning plan space, System R

approach is still exponential in the # of tables.