hive cbo: accelerating complex queries john pullokkaran 11/19/2013 page 1

8
Hive CBO: Accelerating Complex Queries John Pullokkaran 11/19/2013 Page 1

Upload: baldwin-lindsey

Post on 14-Jan-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Hive CBO: Accelerating Complex Queries John Pullokkaran 11/19/2013 Page 1

Hive CBO: Accelerating Complex Queries

John Pullokkaran11/19/2013

Page 1

Page 2: Hive CBO: Accelerating Complex Queries John Pullokkaran 11/19/2013 Page 1

© Hortonworks Inc. 2013

CBO in Hive – Why?

Page 2

• Ease of Use

• View Chaining

• Ad hoc queries involving multiple views

• Enables BI Tools front ending Hive

Page 3: Hive CBO: Accelerating Complex Queries John Pullokkaran 11/19/2013 Page 1

© Hortonworks Inc. 2013

CBO in Hive – How?Hive SQL Rewrite

• Use Optiq CBO

• Introduce transformation rules in Optiq

• Convert Hive op tree to Optiq op tree

• Optimize Optiq op tree

• Convert optimized Optiq op tree back to Hive AST

Page 3

Page 4: Hive CBO: Accelerating Complex Queries John Pullokkaran 11/19/2013 Page 1

© Hortonworks Inc. 2013

CBO in Hive – How?Optiq

• Open source, Apache licensed, query execution framework implemented in Java

• Used by:I. Apache Cascade

II. Apache Drill

III. Lucid DB

IV. SqlStream

• Based on Volcano paper & derived from Eigenbase Project

• ~ 20 Man years of development

• More than 50 optimization rules

• Plan search space can be controlled

Page 4

Page 5: Hive CBO: Accelerating Complex Queries John Pullokkaran 11/19/2013 Page 1

© Hortonworks Inc. 2013

CBO in Hive – How?Cost computation• Emphasis on latency reduction

• Cost computation will be used for: Join ordering Join algorithm selection Tez Vertex Boundary Selection

• Shuffling cost is important

• Cost formula uses, CPU, I/O and cardinality

• I/O data size = Cardinality * avg size of tuple

Page 5

Page 6: Hive CBO: Accelerating Complex Queries John Pullokkaran 11/19/2013 Page 1

© Hortonworks Inc. 2013

CBO in Hive – How?Cost computation Cont’d• I/O cost differentiates among:I. Network

II. Local Disk

III. HDFS

• CPU cost and I/O Cost is normalized in to standard units of time

• relation between CPU and various I/O cost is important

Page 6

Page 7: Hive CBO: Accelerating Complex Queries John Pullokkaran 11/19/2013 Page 1

© Hortonworks Inc. 2013

CBO in Hive – How?Control Flow• “SemanticAnalyzer.analyzeInternal” calls into CBO optionally

• CBO invocation conditional on lossless query tree translation.

• All of the CBO code is contained in new packages

• Optiq would pull in 3 new jars

• SemanticAnalyzer.analyzeInternal” is called with CBO returned AST.

Page 7

Page 8: Hive CBO: Accelerating Complex Queries John Pullokkaran 11/19/2013 Page 1

© Hortonworks Inc. 2013

CBO in Hive

Questions?

Page 8