october 2014 hug : hive on spark

18
1 Headline Goes Here Speaker Name or Subhead Goes Here DO NOT USE PUBLICLY PRIOR TO 10/23/12 Hive on Spark Szehon Ho Software Engineer at Cloudera, Apache Hive Committer October 2014

Upload: yahoo-developer-network

Post on 12-Apr-2017

4.259 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: October 2014 HUG : Hive On Spark

1

Headline Goes HereSpeaker Name or Subhead Goes Here

DO NOT USE PUBLICLY PRIOR TO 10/23/12Hive on Spark

Szehon Ho Software Engineer at Cloudera, Apache Hive CommitterOctober 2014

Page 2: October 2014 HUG : Hive On Spark

2

Background (Hive)

• Apache Hive: a data query and management tool for a distributed dataset, exposed via a SQL-like query language called HiveQL

Page 3: October 2014 HUG : Hive On Spark

3

Background (Hive)

• 2007-2013, MapReduce = only distributed processing engine• Map(), Reduce() primitives, not designed for long data pipelines• Complex SQL-like queries inefficiently expressed as many MR

stages.• Disk IO between MR’s• Shuffle-sort between M+R

Map() Red()

Hive Query

Map() Red() Map() Red()

HDFS

Page 4: October 2014 HUG : Hive On Spark

4

Background (Hive)

• 2013 Hive Community started work on Hive on Tez • Tez DAG execution graph

Map() Red()

Hive Query

Map() Red()Red()

HDFS

Page 5: October 2014 HUG : Hive On Spark

5

Background (Spark)

• Generalized distributed processing framework created in ~2011 by UC Berkeley AMPLab

• Many advantages (community, ease-of-use), heading to succeed MapReduce

Page 6: October 2014 HUG : Hive On Spark

6

Background (Spark)

• Community Momentum:• Already the most active project in Hadoop ecosystem

• June 2014: 255 contributors from 50 companies• First half of 2014: ~1200 commits, 250000 LOC changed

• Integration from with many Hadoop components, ie Pig, Flume, Mahout, Crunch, Solr, now Hive.

Page 7: October 2014 HUG : Hive On Spark

7

Background (Spark)

• Clean programming abstraction: Resilient Distributed Dataset (RDD):• A fault-tolerant dataset, can be a stage in a data pipeline.• Created from existing data set like HDFS file, or

transformation from other RDD (chain-up RDD’s)• Expressive API’s, much more than MapReduce

• Transformations: map, filter, groupBy• Actions: cache, save

• => More efficient representation of Hive queries

Page 8: October 2014 HUG : Hive On Spark

8

Hive on Spark

• Shark Project:• AMPLab github project, fork of Hive• Not maintained by Hive community, sunsetted 2014

• Hive on Spark:• Done in Hive community• Architecturally compatible, by keeping same physical abstraction for Hive on

Spark as Hive on Tez/MR.• Code maintenance• Maximize re-use of common functionality across execution engine

Page 9: October 2014 HUG : Hive On Spark

9

Hive on Spark

• Hive on Spark, User Benefits• Another seamless execution option (MR, Tez, Spark)• Leverage Spark clusters coming in use for ML, Graph Processing,

Streaming, etc.• Continued efficiency, performance improvements via strong Spark

community.

Page 10: October 2014 HUG : Hive On Spark

10

SparkCompilerMapRedCompiler TezCompiler

High-Level Design

Hive Query

Logical Op Tree

Task

TaskCompiler

Work

MapRedTask

MapRedWork

TezTask SparkTask

Common across engines: • HQL syntax• Tool Integrations (auditing plugins, authorization,

Drivers, Thrift clients, UDF, StorageHandler)• Logical optimizations

MapRedWork

TezWork

TezWork

TezWork SparkWk

SparkWk

SparkWk

Page 11: October 2014 HUG : Hive On Spark

11

Simple ExampleSELECT COUNT(*) from status_updates where ds = ‘2014-10-01’ group by region;

TableScan (status_updates)

Filter (ds=‘2014 10-01’)

Select (region)

Group-By (count)

Select

Operator Tree:

Hive Query:

GBY trigger reduce-boundary:

Page 12: October 2014 HUG : Hive On Spark

12

Simple Example

ReducerGroupBy

SelectFileOutput

MapperTableScan

Filter

Select

Group-By

ReduceSink

MapRed Work Tree• Map->Reduce

ShuffleSort

Page 13: October 2014 HUG : Hive On Spark

13

Simple Example

mapPartition()GroupBy

Select

FileOutput

mapPartition()TableScan

Filter

Select

Group-By

ReduceSink

Spark Work Tree:• RDD Chain

groupBy()

No sorting

Page 14: October 2014 HUG : Hive On Spark

14

Join ExampleTableScan

Filter

Select

Join

Select

Sort

Select

TableScan

Filter

Select

SELECT * FROM (SELECT * FROM src WHERE src.key < 10) src1 JOIN (SELECT * FROM src WHERE src.key < 10) src2 ORDER BY src1.key;

• Operator Tree:• Join/Sort trigger Reduce

boundary

Hive Query:

Page 15: October 2014 HUG : Hive On Spark

15

Join Example

Map

ReduceSink (Sort)

TableScan

MapTableScan

Filter

Select

Reduce Sink ReduceJoin

Select

FileOutput

Reduce

FileOutput

Select

MapTableScan

FilterSelect

Reduce Sink

HDFS

ShuffleSort ShuffleSort

Disk IO

MapRed Work Tree• 2 MapReduce Works

Page 16: October 2014 HUG : Hive On Spark

16

Join Example

mapPartition()Join

Select

Reduce Sink

mapPartition()

FileOutput

Select

union() Partition/Sort()

sortBy()

No spill to disk

mapPartition()TableScan

FilterSelect

Reduce Sink

mapPartition()TableScan

FilterSelect

Reduce Sink

Spark Work Tree:RDD Transform Chain

Page 17: October 2014 HUG : Hive On Spark

17

Improvements to Spark

• Reduce-side join: SPARK-2978• Spark had group(), sort(), but not partition+sort like MR-style shuffle-sort.• Can help other apps migrate from Map-Reduce to Spark

• Remote Spark-context (push down to AM)• SparkContext is not allowed concurrently in client application process.• SparkContext is heavy-weight

• Spark Monitoring API’s• Elastic scaling of Spark application: SPARK-3174

Page 18: October 2014 HUG : Hive On Spark

18

Community

• Thanks to contributors from many organizations:

• Follow our progress on HIVE-7292• Thank you!