using the optimizer to generate an effective regression suite: a first step murali m. krishna...

Using the Optimizer to Using the Optimizer to Generate an Effective Generate an Effective

Regression Suite: Regression Suite: A First StepA First Step

Murali M. KrishnaMurali M. Krishna

Presented by Harumi KunoPresented by Harumi Kuno

HPHP

Space of possible queries, Space of possible queries, schemas, data is practically infiniteschemas, data is practically infinite

22

Po

ssib

le s

chem

as

Possible queries

Possible data

Innumerable plans are possible for any query in this space.

33

Po

ssib

le d

esig

ns

Possible queries

Possible data

Problem: how to construct an effective Problem: how to construct an effective and economical query optimizer / and economical query optimizer /

executor regression suite?executor regression suite?• If a new plan is now produced

for a given query, is its performance worse than before? (Optimizer regression)

• If the same plan is produced, is its performance worse? (Executor regression)

Goal: Building the regression Goal: Building the regression test suite should…test suite should…

… … cover the plan space (code coverage is cover the plan space (code coverage is not enough)not enough)

… … be economical not only for the optimizer be economical not only for the optimizer but also for the execution engine but also for the execution engine

… … require little knowledge of the optimizer require little knowledge of the optimizer (e.g., without hiring an expensive expert)(e.g., without hiring an expensive expert)

44

55

Po

ssib

le d

esig

ns

Possible queries

Possible data

Historically, test suite focus has been Historically, test suite focus has been on known workloadson known workloads

Customer workloads

benchmarks

known

problems

• Developers and QA add “point” test cases

• Test specific optimization rules

• Improve code coverage

• May be very complex (large search space)

Disadvantages of historical Disadvantages of historical approachapproach

Can’t quantify/evaluate benefit added by new Can’t quantify/evaluate benefit added by new teststests

When a regression occurs (especially if query is When a regression occurs (especially if query is complex), hard to find the exact cause; Need an complex), hard to find the exact cause; Need an human expert to find and fix problemhuman expert to find and fix problem

Time to run increases with added testsTime to run increases with added tests

QA groups use code coverage as metric but this QA groups use code coverage as metric but this does not guarantee plan coveragedoes not guarantee plan coverage

66

InsightInsight

Enumerate the Optimizer Plan Space (OPS) Enumerate the Optimizer Plan Space (OPS) over physical operators (PO)over physical operators (PO)

Target OPS (PO, n): All possible legal plans with Target OPS (PO, n): All possible legal plans with n operators chosen from POn operators chosen from PO PO: subset of physical operators used by DBMS that PO: subset of physical operators used by DBMS that

we want to testwe want to test

n will have to be quite small (n will have to be quite small (≤ ≤ 10)10)

Goal: try to cover this finite space with as few Goal: try to cover this finite space with as few queries as possible – a non trivial problem!queries as possible – a non trivial problem!

77

Difficult design decisionsDifficult design decisions How many table instances in each query?How many table instances in each query?

4, 10, 20? Plan space grows exponentially4, 10, 20? Plan space grows exponentially A 4 table query may be good enough for our purposes A 4 table query may be good enough for our purposes

(regression testing the cost model)(regression testing the cost model) A 20 table query plan is composed of a series of 4 table A 20 table query plan is composed of a series of 4 table

queriesqueries

How many physical operators per query?How many physical operators per query? 3 successive hash joins behave very differently from 3 NL 3 successive hash joins behave very differently from 3 NL

joins at exec. timejoins at exec. time Regression suite should capture execution dependencies Regression suite should capture execution dependencies

(e.g., intra-query resource contention)(e.g., intra-query resource contention) We settle for a small number of operators in each of our We settle for a small number of operators in each of our

regression queries regression queries

88

Initial ExperimentInitial Experiment Focus on join space only Focus on join space only Start with a simple skeleton queryStart with a simple skeleton query select T1.a select T1.a from T1, T2, T3, T4 from T1, T2, T3, T4 where T1.a = T2.a & T2.c = T3.c & T3.d = T4.d &where T1.a = T2.a & T2.c = T3.c & T3.d = T4.d & T1.b ≤ C1 & T2.b ≤ C2 & T3.b ≤ C3 & T4.b ≤ C4T1.b ≤ C1 & T2.b ≤ C2 & T3.b ≤ C3 & T4.b ≤ C4 1 ≤ Ci ≤ |T|, 1 ≤ i ≤ 4 1 ≤ Ci ≤ |T|, 1 ≤ i ≤ 4

Column b is unique and exactly controls # Column b is unique and exactly controls # of rows from each instanceof rows from each instance

All queries in the regression suite will have All queries in the regression suite will have the same structurethe same structure

99

T1 T2 T3 T4

AssumptionAssumption 5 different join methods:5 different join methods:J1: cartesian product, J1: cartesian product,

J2: regular hash join (where one or both inputs have to be J2: regular hash join (where one or both inputs have to be optionally optionally partitioned on the fly), partitioned on the fly),

J3: small table broadcast hash join (here the smaller input J3: small table broadcast hash join (here the smaller input is is broadcast to all the sites/cpus of the bigger table), broadcast to all the sites/cpus of the bigger table),

J4: merge join (one or both inputs may need to be sorted), J4: merge join (one or both inputs may need to be sorted), and and

J5: index nested loops join.J5: index nested loops join.1010

Enumerating Join Plan SpaceEnumerating Join Plan Space

# of non bushy join plans = (4!)*5^3 = 3000 # of non bushy join plans = (4!)*5^3 = 3000

= subset of OPS ({J1, J2, J3, J4, J5}, 3)= subset of OPS ({J1, J2, J3, J4, J5}, 3) Insight: We can do with just one table and use 4 Insight: We can do with just one table and use 4

instances of the same tableinstances of the same table What is important is the number of rows from What is important is the number of rows from

each table participating in each query (from the each table participating in each query (from the costing perspective) – not the order of the tablescosting perspective) – not the order of the tables

Therefore the number of join plans of interest Therefore the number of join plans of interest is 3,000/(4!) = 125is 3,000/(4!) = 125

1111

Structure of Table T Structure of Table T (details in paper)(details in paper)

After some trial and error, we picked T with After some trial and error, we picked T with the following schemathe following schema

T (int a, int b, int c, int d, [e, f, g])T (int a, int b, int c, int d, [e, f, g])a: primary clust. key, hash part. 8 ways on column aa: primary clust. key, hash part. 8 ways on column ab: unique random secondary key b: unique random secondary key c: Beta (4, 4): integer-valued, Normal-like over c: Beta (4, 4): integer-valued, Normal-like over [1 .. [1 .. 8388608] with mean = 4194419, std. dev = 8388608] with mean = 4194419, std. dev = 13981311398131d: Beta (2, 0.5): integer-valued, Zipfian-like over [1 .. d: Beta (2, 0.5): integer-valued, Zipfian-like over [1 .. 8388608] with mean = 6711152 , skew = -1.258388608] with mean = 6711152 , skew = -1.25

e: uniformly distributed between 1 and 256 e: uniformly distributed between 1 and 256 f: uniformly distributed between 1 and 4096 f: uniformly distributed between 1 and 4096 g: uniformly distributed between 1 and 65536 g: uniformly distributed between 1 and 65536

1212

Generating the Regression QueriesGenerating the Regression Queries

Ultimately, we want 125 queries, each with a Ultimately, we want 125 queries, each with a distinct join plandistinct join plan

We can try to generate them manually – very We can try to generate them manually – very tedious; will require immense knowledge of tedious; will require immense knowledge of the optimizerthe optimizer

Skeleton query lets us generate many queries Skeleton query lets us generate many queries with diverse plans ‘automatically’ by varying with diverse plans ‘automatically’ by varying constants in predicates; No knowledge of constants in predicates; No knowledge of optimizer required. No pain and lots of gain optimizer required. No pain and lots of gain

1313

Generating the Regression Queries Generating the Regression Queries (2)(2)

Key idea: Vary the constants C1 thru C4 to ‘cover’ Key idea: Vary the constants C1 thru C4 to ‘cover’ the 4 dimensional cardinality spacethe 4 dimensional cardinality space

By varying the cardinalities of the participating tables we are By varying the cardinalities of the participating tables we are essentially covering the cardinality space, which in turn will essentially covering the cardinality space, which in turn will cover the plan space by letting the optimizer choose among the cover the plan space by letting the optimizer choose among the various join implementationsvarious join implementations

We chose each Ci from these 10 valuesWe chose each Ci from these 10 values {1, 10, 100, 1K,10K, 100K, 1M, 2M, 4M, 8M} {1, 10, 100, 1K,10K, 100K, 1M, 2M, 4M, 8M} resulting in 10^4 queriesresulting in 10^4 queries These queries were prepared and only 42These queries were prepared and only 42 (< 35%) distinct plans were found(< 35%) distinct plans were found Linear Skeleton query not good enough! Only one Linear Skeleton query not good enough! Only one

cartesian product possible; no star joinscartesian product possible; no star joins

1414

Generalized Skeleton QueryGeneralized Skeleton Query

select T1.a select T1.a

from T T1, T T2, T T3, T T4 from T T1, T T2, T T3, T T4

where T1.a = T2.a &T2.c = T3.c & T3.d = T4.d &where T1.a = T2.a &T2.c = T3.c & T3.d = T4.d &

T1.e = T3.e & T1.f = T4.f & T2.g = T4.gT1.e = T3.e & T1.f = T4.f & T2.g = T4.g & &

T1.b ≤ C1 & T2.b ≤ C2 & T3.b ≤ C3 & T4.b ≤ C4T1.b ≤ C1 & T2.b ≤ C2 & T3.b ≤ C3 & T4.b ≤ C4

6 = (4 choose 2) join predicates6 = (4 choose 2) join predicates 2^6 subsets of join predicates possible to allow for all join 2^6 subsets of join predicates possible to allow for all join

geometries (includes star joins, 0-3 CPs etc.)geometries (includes star joins, 0-3 CPs etc.) Number of queries generated = 64*10K = 640,000Number of queries generated = 64*10K = 640,000 When these queries were compiled, # of distinct plans When these queries were compiled, # of distinct plans

found was 101 (> 80%) found was 101 (> 80%) Picking the right skeleton query crucial Picking the right skeleton query crucial

1515

T1

T3 T4

T2

Generating the Regression SuiteGenerating the Regression Suite Pick a small number of queries corresponding to Pick a small number of queries corresponding to

each distinct planeach distinct plan How to pick these queries may be a good How to pick these queries may be a good

research topicresearch topic We picked 2 queries for each plan (‘closest’ and We picked 2 queries for each plan (‘closest’ and

‘farthest’ from the origin)‘farthest’ from the origin) Intuition: These represent the two extremes for Intuition: These represent the two extremes for

the planthe plan For a good starting point, verify cardinality For a good starting point, verify cardinality

estimates and optimality of plans in as many estimates and optimality of plans in as many cases as possible. Possible because queries are cases as possible. Possible because queries are simple.simple.

1616

Comparison with TPC-H/DSComparison with TPC-H/DS Compiled queries with at least 4 tables in H and Compiled queries with at least 4 tables in H and

DS (turned off bushy plans)DS (turned off bushy plans) Found all sequences of 3 consecutive joins in Found all sequences of 3 consecutive joins in

non bushy plansnon bushy plans Note: J1-J2-J3-J1-J4-J5 yields 4 sets of 3 joinsNote: J1-J2-J3-J1-J4-J5 yields 4 sets of 3 joins Even so, found 67 distinct plans (incrementally Even so, found 67 distinct plans (incrementally

harder to generate additional plans manually)harder to generate additional plans manually) Our suite was 50% better Our suite was 50% better Hard to do manually Hard to do manually Key point: Our queries are very SIMPLE while Key point: Our queries are very SIMPLE while

DS queries are very COMPLEX (easy to DS queries are very COMPLEX (easy to diagnose and fix problems)diagnose and fix problems)

Our suite can be generated easily as wellOur suite can be generated easily as well Not suggesting that our suite replace H/DSNot suggesting that our suite replace H/DS 1717

ConclusionsConclusions

Use enumeration to generate a regression Use enumeration to generate a regression suite that covers the plan space – suite that covers the plan space – automaticallyautomatically

No knowledge of optimizer internals No knowledge of optimizer internals requiredrequired

BUT need to come up with a good skeleton BUT need to come up with a good skeleton queryquery

Validated approach by applying it to join Validated approach by applying it to join plan space.plan space.

1818

Future WorkFuture Work

How to extend to other SQL operators?How to extend to other SQL operators? How to extend to larger number of tables How to extend to larger number of tables

(number of generated queries is ‘doubly (number of generated queries is ‘doubly exponential’)? If T = 5, generated queries exponential’)? If T = 5, generated queries = 102.4 million= 102.4 million

Will sampling work to reduce #of queries?Will sampling work to reduce #of queries? How to come up with a better metric that How to come up with a better metric that

takes into account the quality of plans in takes into account the quality of plans in the suite?the suite?

1919

using the optimizer to generate an effective regression suite: a first step murali m. krishna...

Documents