october 2014 hug : hive on spark
TRANSCRIPT
1
Headline Goes HereSpeaker Name or Subhead Goes Here
DO NOT USE PUBLICLY PRIOR TO 10/23/12Hive on Spark
Szehon Ho Software Engineer at Cloudera, Apache Hive CommitterOctober 2014
2
Background (Hive)
• Apache Hive: a data query and management tool for a distributed dataset, exposed via a SQL-like query language called HiveQL
3
Background (Hive)
• 2007-2013, MapReduce = only distributed processing engine• Map(), Reduce() primitives, not designed for long data pipelines• Complex SQL-like queries inefficiently expressed as many MR
stages.• Disk IO between MR’s• Shuffle-sort between M+R
Map() Red()
Hive Query
Map() Red() Map() Red()
HDFS
4
Background (Hive)
• 2013 Hive Community started work on Hive on Tez • Tez DAG execution graph
Map() Red()
Hive Query
Map() Red()Red()
HDFS
5
Background (Spark)
• Generalized distributed processing framework created in ~2011 by UC Berkeley AMPLab
• Many advantages (community, ease-of-use), heading to succeed MapReduce
6
Background (Spark)
• Community Momentum:• Already the most active project in Hadoop ecosystem
• June 2014: 255 contributors from 50 companies• First half of 2014: ~1200 commits, 250000 LOC changed
• Integration from with many Hadoop components, ie Pig, Flume, Mahout, Crunch, Solr, now Hive.
7
Background (Spark)
• Clean programming abstraction: Resilient Distributed Dataset (RDD):• A fault-tolerant dataset, can be a stage in a data pipeline.• Created from existing data set like HDFS file, or
transformation from other RDD (chain-up RDD’s)• Expressive API’s, much more than MapReduce
• Transformations: map, filter, groupBy• Actions: cache, save
• => More efficient representation of Hive queries
8
Hive on Spark
• Shark Project:• AMPLab github project, fork of Hive• Not maintained by Hive community, sunsetted 2014
• Hive on Spark:• Done in Hive community• Architecturally compatible, by keeping same physical abstraction for Hive on
Spark as Hive on Tez/MR.• Code maintenance• Maximize re-use of common functionality across execution engine
9
Hive on Spark
• Hive on Spark, User Benefits• Another seamless execution option (MR, Tez, Spark)• Leverage Spark clusters coming in use for ML, Graph Processing,
Streaming, etc.• Continued efficiency, performance improvements via strong Spark
community.
10
SparkCompilerMapRedCompiler TezCompiler
High-Level Design
Hive Query
Logical Op Tree
Task
TaskCompiler
Work
MapRedTask
MapRedWork
TezTask SparkTask
Common across engines: • HQL syntax• Tool Integrations (auditing plugins, authorization,
Drivers, Thrift clients, UDF, StorageHandler)• Logical optimizations
MapRedWork
TezWork
TezWork
TezWork SparkWk
SparkWk
SparkWk
11
Simple ExampleSELECT COUNT(*) from status_updates where ds = ‘2014-10-01’ group by region;
TableScan (status_updates)
Filter (ds=‘2014 10-01’)
Select (region)
Group-By (count)
Select
Operator Tree:
Hive Query:
GBY trigger reduce-boundary:
12
Simple Example
ReducerGroupBy
SelectFileOutput
MapperTableScan
Filter
Select
Group-By
ReduceSink
MapRed Work Tree• Map->Reduce
ShuffleSort
13
Simple Example
mapPartition()GroupBy
Select
FileOutput
mapPartition()TableScan
Filter
Select
Group-By
ReduceSink
Spark Work Tree:• RDD Chain
groupBy()
No sorting
14
Join ExampleTableScan
Filter
Select
Join
Select
Sort
Select
TableScan
Filter
Select
SELECT * FROM (SELECT * FROM src WHERE src.key < 10) src1 JOIN (SELECT * FROM src WHERE src.key < 10) src2 ORDER BY src1.key;
• Operator Tree:• Join/Sort trigger Reduce
boundary
Hive Query:
15
Join Example
Map
ReduceSink (Sort)
TableScan
MapTableScan
Filter
Select
Reduce Sink ReduceJoin
Select
FileOutput
Reduce
FileOutput
Select
MapTableScan
FilterSelect
Reduce Sink
HDFS
ShuffleSort ShuffleSort
Disk IO
MapRed Work Tree• 2 MapReduce Works
16
Join Example
mapPartition()Join
Select
Reduce Sink
mapPartition()
FileOutput
Select
union() Partition/Sort()
sortBy()
No spill to disk
mapPartition()TableScan
FilterSelect
Reduce Sink
mapPartition()TableScan
FilterSelect
Reduce Sink
Spark Work Tree:RDD Transform Chain
17
Improvements to Spark
• Reduce-side join: SPARK-2978• Spark had group(), sort(), but not partition+sort like MR-style shuffle-sort.• Can help other apps migrate from Map-Reduce to Spark
• Remote Spark-context (push down to AM)• SparkContext is not allowed concurrently in client application process.• SparkContext is heavy-weight
• Spark Monitoring API’s• Elastic scaling of Spark application: SPARK-3174
18
Community
• Thanks to contributors from many organizations:
• Follow our progress on HIVE-7292• Thank you!