access path selection in a relation database management system (summarized in section 2)
TRANSCRIPT
Access Path Selection in a Relation Database Management
System(summarized in section 2)
Processing an SQL statement
• parsing, optimization, code generation, execution
• an SQL statement may have many query blocks (nesting)
Optimizer
• validates parsed query• collects statistics on referenced relations & columns• discovers available access paths for each relation• checks for type errors in expressions• Access path selection:
– determines order of evaluation of query blocks– a tree of alternate path choices is created for each query block with
more than one relation– minimum cost access path is chosen from the tree
• results of optimizer is passed to code generation and execution components
RSS (Research Storage System)
• storage manager for System R
• Maintains physical storage, access paths, locking, logging, and recovery• Relations are stored as a collection of tuples• tuples are stored on 4K pages; pages are organized into segments• segments completely contain one or more relations• tuples are accessed via a scan: sequential scan or index scan• indexes are B-trees with linked leaves• sequential scan touches all the pages of a segment that contains a relation once• index scans touch all the leaf pages of the index once; relation pages >=1 times• if index and data tuples are in the same order, the data is “clustered”• scans may takes a set of predicates to apply to a tuple before returning it
– predicates are of the form (column op value)
Cost computation
• cost = page fetches + W*(RSI calls) – cost = IO costs + W * CPU costs
• an index that matches a boolean factor of the query is an efficient access path
Statistics
• NCARD(T): cardinality of the relation T• TCARD(T): number of pages used for T• P(T): fraction of pages in a segment used for
T
• ICARD(I): number of distinct keys in index I• NINDX(I): number of pages in index I
Selectivity
• column = value : F = 1/ICARD(column) if there is an index. F = 1/10 otherwise
• column1 = column2: F = 1/MAX(ICARD(column1), ICARD(column2)); F = 1/ICARD(column i); F = 1/10
• column > value: F = (high key value - value) / (high key - low key)
• column between value1 and value2: F= (value2 - value1)/ (high key - low key)
• column IN (list of values): F = (# of items in list) * (selectivity for column = value)max is 1/2
• columnA IN subquery: F = (card. of subquery) / ( card. of subquery relations)
• (pred1) OR (pred2): F = F(pred1) + F(pred2) - F(pred1) * F(pred2)
• (pred1) AND (pred2): F = F(pred1) * F(pred2)
• NOT pred: F = 1 - F(pred)
QCARD
• QCARD is ( card. of all relations) * ( F(pred i))
• RSICARD is the expected number of calls to RSI ( card. of all relations) * ( F(sargable pred i))
• An “interesting order” is an order specified by the GROUP BY or ORDER BY clause
• Single relation cost: cheapest access path which produces the “interesting order” or cheapest access path plus sorting cost of result
Cost Table (p. 515)
• index pages fetched plus data pages fetched plus W times RSI tuple retrieval calls.
• unique index matching an equal predicate: 1+1+W
• clustered index I matching one or more boolean factors: F(preds) *(NINDX(I) + TCARD) + W * RSICARD
• etc…
Joins
• nested loops and merging scans
• merging scans require sorts on the join column -- another “interesting order”
• n-way joins can be done by a succession of 2-way joins; not necessarily using the same technique. Results may be pipelined if a sort is not required.
Join ordering
• n! permutations of relation join orders
• join of (k+1) relation with previous k relations is independent of first k join order
• avoid Cartesian products when possible; make them as late as possible
Construct a tree
• construct a tree of possible join orderings: keep the cheapest order that produces an interesting ordering.
• First find the best way to access each single relation for each interesting ordering and unordered.
• Next, find the best way of joining any relation to each of these.
• Repeat until all relations have been added to each branch• Choose the cheapest strategy that has an interesting
ordering, or the cheapest strategy plus a sort.• Total number of solutions to store: 2n