trihug feb: hive on spark

Hive on Spark

Szehon Ho // Cloudera Software Engineer, Apache Hive PMC

2 © 2014 Cloudera, Inc. All rights reserved.

Background (Hive) •  Apache Hive: SQL-based data query and management tool for a

distributed dataset •  Founded in 2007 at Facebook, most of our customers run Hive

jobs in production.


Background (Hive) •  Inflexibility of MapReduce framework => Inefficient Hive

•  Map(), Reduce() primitives, not designed for long data pipelines •  Complex SQL-like queries inefficiently expressed as many MR stages.

•  Disk IO between MR’s •  Shuffle-sort between M+R

Map() Red()

Hive Query

Map() Red() Map() Red()

HDFS


Background (Hive) •  2013 Hive Community started work on Hive on Tez

•  Tez DAG execution graph

Map() Red()

Hive Query

Map() Red()

Red()

HDFS


Background (Spark) •  Generalized distributed processing framework created in ~2011 by

UC Berkeley AMPLab •  Popular framework, heading to succeed MapReduce


Background (Spark) •  Clean programming abstrac:on: Resilient Distributed Dataset (RDD):

•  A fault-‐tolerant dataset, can be a stage in a data pipeline. •  Created from exis:ng data set like HDFS file, or transforma:on from other RDD (chain-‐up RDD’s)

•  Expressive API’s, much more than MapReduce •  Transforma:ons: map, filter, groupBy •  Ac:ons: cache, save

•  => More efficient representa:on of Hive queries


Background (Spark) •  Community Momentum:

•  Spark Summit 2014: Already the most active project in Hadoop ecosystem, top 3 most active Apache projects.

•  Since Spark 1.0 in June, two more biggest releases 1.1, 1.2

Compared to Other Projects

Map

Redu

ce

YARN

HDFS

St

orm

Spar

k

0

200

400

600

800

1000

1200

1400

Map

Redu

ce

YARN

HDFS

St

orm

Spar

k 0

50000

100000

150000

200000

250000

300000

Commits Lines of Code Changed

Activity in past 6 months

Compared to Other Projects

Map

Redu

ce

YARN

HDFS

St

orm

Spar

k

0

200

400

600

800

1000

1200

1400

Map

Redu

ce

YARN

HDFS

St

orm

Spar

k 0

50000

100000

150000

200000

250000

300000

Commits Lines of Code Changed

Activity in past 6 months


Background (Spark) •  Community Momentum:

•  Advanced analytics, data science, ML, graph processing, etc. •  Integration from with many Hadoop tools, ie Pig, Flume, Mahout, Crunch, Solr •  Hive jobs can now leverage these Spark clusters as well


Hive on Spark •  Shark Project:

•  AMPLab github project, fork of Hive • Not maintained by Hive community, sunseUed 2014

• Hive on Spark: • Done in Hive community •  Architecturally compa:ble, by keeping same physical abstrac:on for Hive on Spark as Hive on Tez/MR.

•  Code maintenance •  Maximize re-‐use of common func:onality across execu:on engine


High-Level Design

10

Hive Query

Logical Op Tree

Task

TaskCompiler

Work

MapRedTask

MapWork

TezTask SparkTask

Common across engines: •  HQL syntax •  Tool Integrations (auditing plugins,

authorization, Drivers, Thrift clients, UDF, StorageHandler)

•  Logical optimizations

ReduceWork

MapWork

ReduceWork

MapWork MapWk

RedWk

MapWk

SparkCompiler MapRedCompiler TezCompiler


Simple Example

11

SELECT COUNT(*) from status_updates where ds = ‘2014-10-01’ group by region;

TableScan (status_updates)

Filter (ds=‘2014 10-01’)

Select (region)

Group-By (count)

Select

Operator Tree:

Hive Query:

GBY trigger reduce-boundary:


Simple Example

12

Reducer GroupBy

Select FileOutput

Mapper TableScan

Filter

Select

Group-By

ReduceSink

MapRed Work Tree •  Map->Reduce

ShuffleSort


Simple Example

13

mapPartition() GroupBy

Select

FileOutput

mapPartition() TableScan

Filter

Select

Group-By

ReduceSink

Spark Work Tree: •  RDD Chain

groupBy()

No sorting


Join Example TableScan

Filter

Select

Join

Select

Sort

Select

TableScan

Filter

Select

SELECT * FROM (SELECT key FROM src WHERE src.key < 10) src1 JOIN (SELECT key FROM src WHERE src.key < 10) src2 ON src1.key = src2.key ORDER BY src1.key;

Hive Query:


Join Example

Map

ReduceSink (Sort)

TableScan

Map TableScan

Filter Select

Reduce Sink Reduce Join

Select

FileOutput

Reduce

FileOutput

Select Map

TableScan Filter Select

Reduce Sink

HDFS

ShuffleSort ShuffleSort

Disk IO

MapRed Work Tree •  2 MapReduce Works


Join Example

mapPartition() Join

Select

Reduce Sink

mapPartition()

FileOutput

Select

union() Partition/ Sort()

sortBy()

No spill to disk


Filter Select

Reduce Sink


Filter Select

Reduce Sink

Spark Work Tree: RDD Transform Chain


Demo


Improvements to Spark •  Largest MR Java app ported on to Spark, can serve as reference.

•  Spark Umbrella JIRA for improvements needed by Hive: SPARK-‐3145 •  Implement Java version of Scala API’s (various), shade Spark Guava Library: SPARK-‐2848 • Monitoring API’s (SPARK-‐2636, various) •  Shuffle-‐Sort Transform: SPARK-‐2978

•  Spark had group(), sort(), but not par::on+sort like MR-‐style shuffle-‐sort. •  Elas:c scaling of Spark applica:on: SPARK-‐3174


Community •  Thanks to contributors from many organiza:ons:

•  Follow our progress on HIVE-‐7292 •  Thank you!

Thank you.

trihug feb: hive on spark

Technology