a comparison of column, row and array dbmss to process recursive queries carlos ordonez att labs
TRANSCRIPT
Acknowledgments
Michael Stonebraker Visited MIT 2013, 2014 Wellington Cabrera, PhD student;
Achyuth Gurram, MS student Divesh Srivastava for inviting me to
spend my sabbatical at ATT
Introduction
Recursion defined in ANSI SQL Graph algorithms: paths,
reachability, neighborhood analysis Complexity: Cubic, NP-completeness Before: Deductive databases:
datalog Harder query optimization than
traditional SPJ queries
Directed Graphs
Definitions: Directed Graph G=(V,E), maybe cyclic! A vertex in V : i or j and i,j=1..n. and edge
(i,j ) has a direction and weight v storage: adjacency list table E : |E|=N
Problems: Transitive closure: vertices j reachable
from i Power matrix Ek
Examples
V=cities,E=roads. Is there some path from San Diego to
NYC?: path from i to j? shortest one? V=employees, E=manager -> employee q1: all employees under i q2: Is j supervised by i? Bill of materials: The well-known
part/subpart manufacturing DB: all subparts Y from part X
Technical details
Linear recursion; Intuition: R=R*E Inner joins; no negation SELECTs must have same data types No GROUP-BY, DISTINCT, HAVING, NOT IN, OUTER
JOIN clauses inside R Any SQL query on R is valid Seminaive; recursion depth k: loop
with k-1 joins or (rarely) fixpoint
Stonebraker: One size does not fit all!!
Analytics: Row: OLTP, point queries Column: DSS/cube queries Array: math, science
Other: Stream: one pass; in-RAM MMDB: OLTP Hadoop/noSQL: yawn
DBMS storage elevator storyrow | column | array Row: old, single file, block,
B-trees/hash, horizontal partitioning Column: new, multiple files, var. size
blocks, ordered values, compressed, no row-level index!
Array: very different storage; attributes={dimensions|columns}; chunk==subarray; multidimensional; grid index in RAM
Algorithms
Semi-naïve: classical, general, reasonably efficient, expressive
Direct: very efficient; TC only; in-place update; matrix-based; requires arrays; not good for SQL; not used today! [TKDE 2010, Teradata DBMS]
Optimizations: SPJ Relational algebra + physical operators
Join: hash or sort-merge (nested loop does not make sense with E)
Projection: push dup elimination & aggreg. Selection: push filter
To be explored later Outer joins External joins Indexing: row-level only?
Join: hash versus sort-mergeGoal: O(N)
Main computation:
Join optimization: Column: projection={unordered,
ordered values} Row: unordered, ordered versus index Array: default={ordered, indexed} choice={sparse,dense}
Projection
Duplicate elimination reachability binary edges
Aggregation shortest/longest path count # paths length vs weight/cost
Issues with select operator
Incorrect to use a predicate involving a join expression column in recursive step
Cicles => Infinite recursion Monotonically increasing v, OK to
prune Recursion depth k: required in
practice
Benchmark with graphs
Real Skewed Complex structure; sample==different But Fixed size
Synthetic Vary n,N Vary shape NEW: Cliques!
Conclusions Query optimizations
Confirm decades of research: required But impact definitely varies G knowledge helps (catch 22)
Benchmark with tuned query processing Column DBMS faster; cliques/skewed
degree OK Array DBMS competitive for dense/clique G Row DBMS reasonable
Conclusions Graph features impacting time and I/O
Density: avg vertex degree; deg(i) skew Cliques: K Cycles: deep k
Recommendations Tune DBMS storage Tune join (skewed hash join) Beware of large cliques and short cycles Increment recursion depth k gradually
Future work
Develop operators for Array DBMS Time complexity based on G Beyond semi-naïve: Direct,
logarithmic More specific graph problems like
CC, Neighborhood analysis, vertex similarity
Query processing cost model