october 2014 hug : hive on spark

1

Headline Goes HereSpeaker Name or Subhead Goes Here

DO NOT USE PUBLICLY PRIOR TO 10/23/12Hive on Spark

Szehon Ho Software Engineer at Cloudera, Apache Hive CommitterOctober 2014

2

Background (Hive)

• Apache Hive: a data query and management tool for a distributed dataset, exposed via a SQL-like query language called HiveQL

3

Background (Hive)

• 2007-2013, MapReduce = only distributed processing engine• Map(), Reduce() primitives, not designed for long data pipelines• Complex SQL-like queries inefficiently expressed as many MR

stages.• Disk IO between MR’s• Shuffle-sort between M+R

Map() Red()

Hive Query

Map() Red() Map() Red()

HDFS

4

Background (Hive)

• 2013 Hive Community started work on Hive on Tez • Tez DAG execution graph

Map() Red()

Hive Query

Map() Red()Red()

HDFS

5

Background (Spark)

• Generalized distributed processing framework created in ~2011 by UC Berkeley AMPLab

• Many advantages (community, ease-of-use), heading to succeed MapReduce

6

Background (Spark)

• Community Momentum:• Already the most active project in Hadoop ecosystem

• June 2014: 255 contributors from 50 companies• First half of 2014: ~1200 commits, 250000 LOC changed

• Integration from with many Hadoop components, ie Pig, Flume, Mahout, Crunch, Solr, now Hive.

7

Background (Spark)

• Clean programming abstraction: Resilient Distributed Dataset (RDD):• A fault-tolerant dataset, can be a stage in a data pipeline.• Created from existing data set like HDFS file, or

transformation from other RDD (chain-up RDD’s)• Expressive API’s, much more than MapReduce

• Transformations: map, filter, groupBy• Actions: cache, save

• => More efficient representation of Hive queries

8

Hive on Spark

• Shark Project:• AMPLab github project, fork of Hive• Not maintained by Hive community, sunsetted 2014

• Hive on Spark:• Done in Hive community• Architecturally compatible, by keeping same physical abstraction for Hive on

Spark as Hive on Tez/MR.• Code maintenance• Maximize re-use of common functionality across execution engine

9

Hive on Spark

• Hive on Spark, User Benefits• Another seamless execution option (MR, Tez, Spark)• Leverage Spark clusters coming in use for ML, Graph Processing,

Streaming, etc.• Continued efficiency, performance improvements via strong Spark

community.

10

SparkCompilerMapRedCompiler TezCompiler

High-Level Design

Hive Query

Logical Op Tree

Task

TaskCompiler

Work

MapRedTask

MapRedWork

TezTask SparkTask

Common across engines: • HQL syntax• Tool Integrations (auditing plugins, authorization,

Drivers, Thrift clients, UDF, StorageHandler)• Logical optimizations

MapRedWork

TezWork

TezWork

TezWork SparkWk

SparkWk

SparkWk

11

Simple ExampleSELECT COUNT(*) from status_updates where ds = ‘2014-10-01’ group by region;

TableScan (status_updates)

Filter (ds=‘2014 10-01’)

Select (region)

Group-By (count)

Select

Operator Tree:

Hive Query:

GBY trigger reduce-boundary:

12

Simple Example

ReducerGroupBy

SelectFileOutput

MapperTableScan

Filter

Select

Group-By

ReduceSink

MapRed Work Tree• Map->Reduce

ShuffleSort

13

Simple Example

mapPartition()GroupBy

Select

FileOutput

mapPartition()TableScan

Filter

Select

Group-By

ReduceSink

Spark Work Tree:• RDD Chain

groupBy()

No sorting

14

Join ExampleTableScan

Filter

Select

Join

Select

Sort

Select

TableScan

Filter

Select

SELECT * FROM (SELECT * FROM src WHERE src.key < 10) src1 JOIN (SELECT * FROM src WHERE src.key < 10) src2 ORDER BY src1.key;

• Operator Tree:• Join/Sort trigger Reduce

boundary

Hive Query:

15

Join Example

Map

ReduceSink (Sort)

TableScan

MapTableScan

Filter

Select

Reduce Sink ReduceJoin

Select

FileOutput

Reduce

FileOutput

Select

MapTableScan

FilterSelect

Reduce Sink

HDFS

ShuffleSort ShuffleSort

Disk IO

MapRed Work Tree• 2 MapReduce Works

16

Join Example

mapPartition()Join

Select

Reduce Sink

mapPartition()

FileOutput

Select

union() Partition/Sort()

sortBy()

No spill to disk


FilterSelect

Reduce Sink


FilterSelect

Reduce Sink

Spark Work Tree:RDD Transform Chain

17

Improvements to Spark

• Reduce-side join: SPARK-2978• Spark had group(), sort(), but not partition+sort like MR-style shuffle-sort.• Can help other apps migrate from Map-Reduce to Spark

• Remote Spark-context (push down to AM)• SparkContext is not allowed concurrently in client application process.• SparkContext is heavy-weight

• Spark Monitoring API’s• Elastic scaling of Spark application: SPARK-3174

18

Community

• Thanks to contributors from many organizations:

• Follow our progress on HIVE-7292• Thank you!

october 2014 hug : hive on spark

Technology