lightweight graphical models for selectivity estimation without independence assumptions

Lightweight Graphical Models for Selectivity Estimation without Independence Assumptions

Kostas Tzoumas Amol Deshpande Christian S. Jensen

Query Optimization• Need to decide (among others) the

“best” join plan.• Very simplified picture:

– Cost of a plan = # of intermediate tuples it produces.

– Use upper plan if |LO|<|OC|, lower plan otherwise.

• Use size estimates of intermediate relations.

• Cardinality estimation: Estimating relation sizes at compile time.– Typically using statistical summaries.

• Errors in estimates can lead to wrong plan. Independence assumptions a frequent factor of errors.

L O C

LOcost = |LO|

L O C

OCcost = |OC|

2

Why Correlations Matterselect c_name,c_addressfrom lineitem,orders,customerwhere l_orderkey=o_orderkey and o_custkey=c_custkey and o_totalprice in [t1,t2] and l_extendedprice in [e1,e2] and c_acctbal in [b1,b2]

lineitem orders customer

eprice orderkey orderkey tprice custkey custkey acctbal

An order’s total price is a function of the prices of the order’s items. Causes LO to have more tuples than in average.

3

Independence Assumption|L’| = Pr(eprice in [e1,e2]) |L||O’| = Pr(tprice in [t1,t2]) |O||L’O’| = Pr(l_orderkey=o_orderkey)|L’||O’|

|L’O’| = Pr(l_orderkey=o_orderkey) Pr(eprice in [e1,e2]) Pr(tprice in [t1,t2]) |L||O|

•Results to under-estimation of |L’O’| wrong query plan, nested loop join.•Can result in orders of magnitude slower execution.•Solution: estimate joint probabilities: |L’O’| = Pr (l_orderkey=o_orderkey,

eprice in [e1,e2],

tprice in [t1,t2]) |L||O|

L O C

HJ

NLJ

σL σΟ σC

L’ O’ C’

L’O’

4

select c_name,c_addressfrom lineitem,orders,customerwhere l_orderkey=o_orderkey and o_custkey=c_custkey and o_totalprice in [t1,t2] and l_extendedprice in [e1,e2] and c_acctbal in [b1,b2]

Problem Setting

• Database + workload schema graph• Schema graph Set of random variables (descriptive

attributes and join indicators)• Goal: Approximate P(JLO, JLP, JLS, JSC, JOC, L.sdate, L.cdate,

L.rdate, O.odate, P.size, S.acctbal,C.acctbal)

sdatecdaterdate

lineitem

odate

orders

size

part

acctbal

supplier

acctbal

customer

JLO ≡L.okey=O.okey

JOC ≡O.ckey=C.ckey

JLS ≡L.skey=S.skey

JLP ≡L.pkey=P.pkey

JSC ≡S.nkey=C.nkey

Join indicatorA binary random variableCaptures the “event” that two random tuples join.We don’t need “key” attributes in the model (hard to approximate).Pr(JOC=T)=sel(O.ckey=C.ckey)

Problem: Exponential blow-

up of storage space and selectivity

estimation time.

5

Graphical Models• Exploit independence and conditional independence

to factor the full joint distribution.– X┴Y ↔P(X,Y)=P(X)P(Y)– X┴Z|Y ↔ P(X,Y,Z)=P(X,Y)P(Y,Z)/P(Y)

• Bayesian networks– Graphical representation of a set of cond. inds– Express a factorization of the joint distribution

• Can be used directly for selectivity estimation [Getoor’01].– High dimensional distributions high overhead– Construction algorithms do not scale.

X Y Z

6

Fixed Structure• “Tailored” solution

– 2D histograms only, scalable construction• Relational independencies:

– L.rdate ┴ O.odate σL.rdate=a,O.odate=b(LxO)=σL.rdate=a(L) x σO.odate=b(O)

– L.rdate ┴ JSC

– JLO ┴ JSC

• Design decisions– Join indicator has at most two parents– BN within a relation a directed tree

7

Bayesian Network

sdate

cdaterdate

size S.acctbal S.acctbal

odate

JLP JLS

JSC

JOC

JLO

lineitem

part supplier customer

orders

8

Moral Graph

sdate

cdaterdate

size S.acctbal C.acctbal

odate

JLP JLS

JSC

JOC

JLO

9

Chordal Graph

sdate

cdaterdate

size S.acctbal C.acctbal

odate

JLP JLS

JSC

JOC

JLO

10

Junction Tree

sdaterdate

C.acctbalodatesdate

C.acctbalS.acctbal

sdate

sdatecdateodate

cdateodate

JLO

sdateS.acctbal

JLS

sdatesizeJLP

odateC.acctbal

JOC

S.acctbalC.acctbal

JSC

sdate

sdate

sdat

eod

ate

sdate

sdat

e

cdateodate

odat

eac

ctba

l

Distributions to be kept = marginals of “cliques” and “separators.”

Factorization achieved!

Storage space:•(sdate,rdate) One 2D histogram•(odate,acctbal,JOC) Two 2D histograms (JOC=false,true)•(acctbal,odate,sdate) Three 1D histogramsOnly 2D histograms needed! 11

Efficient selectivity estimation via junction tree propagation, or a custom dynamic programming algorithm.

Scalable Model Construction• Create a “local” Bayesian network for each relation R(X,Y,…)

by testing pairwise correlations.– P(X,Y) = select X,Y,count(*) from R group by X,Y– Extract the maximum spanning tree using mutual information

I(X;Y) as weight.

• Find the best two predictors of each join indicator JRS.– P(X,Y,JRS=T) = select R.X,S.Y,count(*) from R,S where R.a=S.a group by R.X,S.Y

– P(X,Y,JRS=F) = P(X)P(Y) – P(X,Y,JRS=T)

• Very efficient; same complexity as CORDS [Ilyas’04].12

Selectivity Estimation

sdaterdate

C.acctbalodatesdate

C.acctbalS.acctbal

sdate

sdatecdateodate

cdateodate

JLO

sdateS.acctbal

JLS

sdatesizeJLP

odateC.acctbal

JOC

S.acctbalC.acctbal

JSC

sdate

sdate

sdat

eod

ate

sdate

sdat

e

cdateodate

odat

eac

ctba

l

Estimate size of query:select c_name,c_addressfrom lineitem,orders,customerwhere l_orderkey=o_orderkey and o_custkey=c_custkey and l_sdate<=“25/7/2011” and c_acctbal<=200000

Equivalent: Estimate Pr(JLO=true, JOC=true, sdate<=“25/7/2011”, C.acctbal<=200000)

Extract “Steiner tree”: Minimal subtree that contains JLO, JOC, sdate, C.acctbal 13

Selectivity Estimation

sdatecdateodate

cdateodate

JLO

odateacctbal

JOCcdateodate

odat

e

Estimate Pr(JLO=true, JOC=true, sdate<=“25/7/2011”, acctbal<=200000)

φ2=P(cdate,odate,JLO)

φ3=P(odate,acctbal,JOC)

φ1=P(sdate,cdate,odate)

μ12=P(cdate,odate)

μ23=P(odate)

1. Substituteφ1

*(cdate,odate)= φ1[sdate<=“25/7/2011”]φ2

*(cdate,odate)= φ2[JLO=true]φ3

*(odate)= φ3[JOC=true,acctbal<=200000]2. Multiplyφ12

*(cdate,odate)= φ1

*φ2*/μ12

3. Marginalizeφ12

**(odate)= Σcdateφ12

*

4. Multiplyφ123

*(odate)= φ12

**φ3*/μ23

5. Return Σodateφ123

*

Also in paper: dynamic programming algorithm that minimizes #multiplications by exploiting query optimizer order of requests.14

Implementation• Model construction outside DBMS. Create distributions

with SQL queries.– Stores junction tree as tables in PostgreSQL catalog.

• Graphical model foundation as PostgreSQL “nodes”.– Probability distributions stored as equi-width multidimensional

histograms. – Cliques, multiplication, division, marginalization.– Junction tree, selectivity estimation algorithms

• Integration with PostgreSQL query optimizer.– Load Steiner tree from catalog tables.– Bypass PostgreSQL selectivity estimation functions.

15

Impact of Capturing Correlationsselect c_name,c_addressfrom lineitem,orders,customerwhere l_orderkey=o_orderkey and o_custkey=c_custkey and o_tprice in [t1,t2] and l_eprice in [e1,e2] and c_acctbal in [b1,b2]

Avoid under-estimationby capturing the pricecorrelation.Estimates very close toreality.Optimizer picks different plan than default PGSQL.

Huge impact in execution time.

Can do selectivity estimation efficiently. Optimization time in the range of 10s of milliseconds

Cost of plans (intermediate tuples)

Execution & optimization times

16

Accuracy of EstimatesMultiplicative error:max(real,estimate) / min(real,estimate)Penalizes both under- and over- estimation. Subset of TPC-H schema.Tweaked data generator to introduce correlations.400 random queries.Geometric average ofmultiplicative error.

One order of magnitude better estimates for 5-join queries

Optimization time less than 25 milliseconds

17

Conclusions and Future Work• Attribute value independence assumption unfounded

– Heuristic, not good enough– Most frequent source of horrible plans

• Practical adaptation of graphical models implemented in PostgreSQL– Low overhead, good estimate quality

• Specialized synopses– Low-dimensional, minimize multiplicative error, error

guarantees after multiplication

• Incremental updates– Building graphical model as a side-effect of query execution

lightweight graphical models for selectivity estimation without independence assumptions

Documents

rdate o

okeyjoc o

c graphsdatecdaterdatesizes

orderkey preprice

e2 prtprice

random tuples

t2 loresults

t2 olo