hive cbo: accelerating complex queries john pullokkaran 11/19/2013 page 1
TRANSCRIPT
Hive CBO: Accelerating Complex Queries
John Pullokkaran11/19/2013
Page 1
© Hortonworks Inc. 2013
CBO in Hive – Why?
Page 2
• Ease of Use
• View Chaining
• Ad hoc queries involving multiple views
• Enables BI Tools front ending Hive
© Hortonworks Inc. 2013
CBO in Hive – How?Hive SQL Rewrite
• Use Optiq CBO
• Introduce transformation rules in Optiq
• Convert Hive op tree to Optiq op tree
• Optimize Optiq op tree
• Convert optimized Optiq op tree back to Hive AST
Page 3
© Hortonworks Inc. 2013
CBO in Hive – How?Optiq
• Open source, Apache licensed, query execution framework implemented in Java
• Used by:I. Apache Cascade
II. Apache Drill
III. Lucid DB
IV. SqlStream
• Based on Volcano paper & derived from Eigenbase Project
• ~ 20 Man years of development
• More than 50 optimization rules
• Plan search space can be controlled
Page 4
© Hortonworks Inc. 2013
CBO in Hive – How?Cost computation• Emphasis on latency reduction
• Cost computation will be used for: Join ordering Join algorithm selection Tez Vertex Boundary Selection
• Shuffling cost is important
• Cost formula uses, CPU, I/O and cardinality
• I/O data size = Cardinality * avg size of tuple
Page 5
© Hortonworks Inc. 2013
CBO in Hive – How?Cost computation Cont’d• I/O cost differentiates among:I. Network
II. Local Disk
III. HDFS
• CPU cost and I/O Cost is normalized in to standard units of time
• relation between CPU and various I/O cost is important
Page 6
© Hortonworks Inc. 2013
CBO in Hive – How?Control Flow• “SemanticAnalyzer.analyzeInternal” calls into CBO optionally
• CBO invocation conditional on lossless query tree translation.
• All of the CBO code is contained in new packages
• Optiq would pull in 3 new jars
• SemanticAnalyzer.analyzeInternal” is called with CBO returned AST.
Page 7
© Hortonworks Inc. 2013
CBO in Hive
Questions?
Page 8