access path selection in a relation database management system (summarized in section 2)

Access Path Selection in a Relation Database Management

System(summarized in section 2)

Processing an SQL statement

• parsing, optimization, code generation, execution

• an SQL statement may have many query blocks (nesting)

Optimizer

• validates parsed query• collects statistics on referenced relations & columns• discovers available access paths for each relation• checks for type errors in expressions• Access path selection:

– determines order of evaluation of query blocks– a tree of alternate path choices is created for each query block with

more than one relation– minimum cost access path is chosen from the tree

• results of optimizer is passed to code generation and execution components

RSS (Research Storage System)

• storage manager for System R

• Maintains physical storage, access paths, locking, logging, and recovery• Relations are stored as a collection of tuples• tuples are stored on 4K pages; pages are organized into segments• segments completely contain one or more relations• tuples are accessed via a scan: sequential scan or index scan• indexes are B-trees with linked leaves• sequential scan touches all the pages of a segment that contains a relation once• index scans touch all the leaf pages of the index once; relation pages >=1 times• if index and data tuples are in the same order, the data is “clustered”• scans may takes a set of predicates to apply to a tuple before returning it

– predicates are of the form (column op value)

Cost computation

• cost = page fetches + W*(RSI calls) – cost = IO costs + W * CPU costs

• an index that matches a boolean factor of the query is an efficient access path

Statistics

• NCARD(T): cardinality of the relation T• TCARD(T): number of pages used for T• P(T): fraction of pages in a segment used for

T

• ICARD(I): number of distinct keys in index I• NINDX(I): number of pages in index I

Selectivity

• column = value : F = 1/ICARD(column) if there is an index. F = 1/10 otherwise

• column1 = column2: F = 1/MAX(ICARD(column1), ICARD(column2)); F = 1/ICARD(column i); F = 1/10

• column > value: F = (high key value - value) / (high key - low key)

• column between value1 and value2: F= (value2 - value1)/ (high key - low key)

• column IN (list of values): F = (# of items in list) * (selectivity for column = value)max is 1/2

• columnA IN subquery: F = (card. of subquery) / ( card. of subquery relations)

• (pred1) OR (pred2): F = F(pred1) + F(pred2) - F(pred1) * F(pred2)

• (pred1) AND (pred2): F = F(pred1) * F(pred2)

• NOT pred: F = 1 - F(pred)

QCARD

• QCARD is ( card. of all relations) * ( F(pred i))

• RSICARD is the expected number of calls to RSI ( card. of all relations) * ( F(sargable pred i))

• An “interesting order” is an order specified by the GROUP BY or ORDER BY clause

• Single relation cost: cheapest access path which produces the “interesting order” or cheapest access path plus sorting cost of result

Cost Table (p. 515)

• index pages fetched plus data pages fetched plus W times RSI tuple retrieval calls.

• unique index matching an equal predicate: 1+1+W

• clustered index I matching one or more boolean factors: F(preds) *(NINDX(I) + TCARD) + W * RSICARD

• etc…

Joins

• nested loops and merging scans

• merging scans require sorts on the join column -- another “interesting order”

• n-way joins can be done by a succession of 2-way joins; not necessarily using the same technique. Results may be pipelined if a sort is not required.

Join ordering

• n! permutations of relation join orders

• join of (k+1) relation with previous k relations is independent of first k join order

• avoid Cartesian products when possible; make them as late as possible

Construct a tree

• construct a tree of possible join orderings: keep the cheapest order that produces an interesting ordering.

• First find the best way to access each single relation for each interesting ordering and unordered.

• Next, find the best way of joining any relation to each of these.

• Repeat until all relations have been added to each branch• Choose the cheapest strategy that has an interesting

ordering, or the cheapest strategy plus a sort.• Total number of solutions to store: 2n

access path selection in a relation database management system (summarized in section 2)

Documents