apache tez : accelerating hadoop query processing

28
Apache Tez : Accelerating Hadoop Query Processing Page 1 Jeff Markham Technical Director, APAC Hortonworks

Upload: teddy-choi

Post on 27-Jan-2015

134 views

Category:

Technology


3 download

DESCRIPTION

호튼웍스 아시아 기술 총괄 이사 제프 마크햄 (Jeff Markham) 이 테즈에 대한 소개를 합니다. 테즈는 맵리듀스를 대체하여 하둡의 질의 처리를 가속하는 소프트웨어입니다. 왜 테즈를 만들었고, 어떻게 구성되었으며, 최적화는 어떻게 진행되고, 그 성능은 얼마나 좋아졌는지 전반에 대해 설명합니다.

TRANSCRIPT

Page 1: Apache Tez : Accelerating Hadoop Query Processing

Apache Tez : Accelerating Hadoop Query Processing

Page 1

Jeff Markham Technical Director, APAC Hortonworks

Page 2: Apache Tez : Accelerating Hadoop Query Processing

© Hortonworks Inc. 2013

Tez – Introduction

Page 2

• Distributed execution framework targeted towards data-processing applications.

• Based on expressing a computation as a dataflow graph.

• Built on top of YARN – the resource management framework for Hadoop.

• Open source Apache incubator project and Apache licensed.

Page 3: Apache Tez : Accelerating Hadoop Query Processing

© Hortonworks Inc. 2013.

© Hortonworks Inc. 2013.

YARN: Taking Hadoop Beyond Batch

HADOOP 1.0

HDFS  (redundant,  reliable  storage)  

MapReduce  (cluster  resource  management  

 &  data  processing)  

Pig  (data  flow)  

Hive  (sql)  

 Others  (cascading)  

 

HDFS2  (redundant,  reliable  storage)  

YARN  (cluster  resource  management)  

Tez  (execu:on  engine)  

HADOOP 2.0

Data  Flow  Pig  

SQL  Hive  

 Others  (cascading)  

 

Batch  MapReduce   Real  Time    

Stream    Processing  

Storm  

Online    Data    

Processing  HBase,  

Accumulo    

MapReduce as Base Apache Tez as Base

Page 4: Apache Tez : Accelerating Hadoop Query Processing

© Hortonworks Inc. 2013.

© Hortonworks Inc. 2013.

Apache Tez (“Speed”) • Replaces MapReduce as primitive for Pig, Hive, Cascading etc.

– Smaller latency for interactive queries – Higher throughput for batch queries – 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft

YARN ApplicationMaster to run DAG of Tez Tasks

Task with pluggable Input, Processor and Output

Tez Task - <Input, Processor, Output>

Task  

Processor  Input   Output  

Page 5: Apache Tez : Accelerating Hadoop Query Processing

© Hortonworks Inc. 2013.

© Hortonworks Inc. 2013.

Tez: Building blocks for scalable data processing

Classical ‘Map’ Classical ‘Reduce’

Intermediate ‘Reduce’ for Map-Reduce-Reduce

Map  Processor  

HDFS  Input  

Sorted  Output  

Reduce  Processor  

Shuffle  Input  

HDFS  Output  

Reduce  Processor  

Shuffle  Input  

Sorted  Output  

Page 6: Apache Tez : Accelerating Hadoop Query Processing

© Hortonworks Inc. 2013.

© Hortonworks Inc. 2013.

Hive – MR Hive – Tez

Hive-on-MR vs. Hive-on-Tez SELECT a.x, AVERAGE(b.y) AS avg FROM a JOIN b ON (a.id = b.id) GROUP BY a UNION SELECT x, AVERAGE(y) AS AVG FROM c GROUP BY x

ORDER BY AVG;

SELECT a.state

JOIN (a, c) SELECT c.price

SELECT b.id

JOIN(a, b) GROUP BY a.state

COUNT(*) AVERAGE(c.price)

M M M

R R

M M

R

M M

R

M M

R

HDFS

HDFS

HDFS

M M M

R R

R

M M

R

R

SELECT a.state, c.itemId

JOIN (a, c)

JOIN(a, b) GROUP BY a.state

COUNT(*) AVERAGE(c.price)

SELECT b.id

Tez avoids unneeded writes to

HDFS

Page 7: Apache Tez : Accelerating Hadoop Query Processing

© Hortonworks Inc. 2013.

© Hortonworks Inc. 2013.

Tez Sessions

… because Map/Reduce query startup is expensive

• Tez Sessions – Hot containers ready for immediate use – Removes task and job launch overhead (~5s – 30s)

• Hive – Session launch/shutdown in background (seamless, user not

aware) – Submits query plan directly to Tez Session

Native Hadoop service, not ad-hoc

Page 8: Apache Tez : Accelerating Hadoop Query Processing

© Hortonworks Inc. 2013.

© Hortonworks Inc. 2013.

Tez Delivers Interactive Query - Out of the Box!

Page 8

Feature   DescripEon   Benefit  

Tez  Session   Overcomes  Map-­‐Reduce  job-­‐launch  latency  by  pre-­‐launching  Tez  AppMaster   Latency  

Tez  Container  Pre-­‐Launch  

Overcomes  Map-­‐Reduce  latency  by  pre-­‐launching  hot  containers  ready  to  serve  queries.   Latency  

Tez  Container  Re-­‐Use  Finished  maps  and  reduces  pick  up  more  work  rather  than  exi:ng.  Reduces  latency  and  eliminates  difficult  split-­‐size  tuning.  Out  of  box  performance!  

Latency  

Run:me  re-­‐configura:on  of  DAG  

Run:me  query  tuning  by  picking  aggrega:on  parallelism  using  online  query  sta:s:cs   Throughput  

Tez  In-­‐Memory  Cache   Hot  data  kept  in  RAM  for  fast  access.   Latency  

Complex  DAGs   Tez  Broadcast  Edge  and  Map-­‐Reduce-­‐Reduce  paXern  improve  query  scale  and  throughput.   Throughput  

Page 9: Apache Tez : Accelerating Hadoop Query Processing

© Hortonworks Inc. 2013

Tez – Design Themes

Page 9

• Empowering End Users • Execution Performance

Page 10: Apache Tez : Accelerating Hadoop Query Processing

© Hortonworks Inc. 2013

Tez – Empowering End Users

• Expressive dataflow definition API’s • Flexible Input-Processor-Output runtime model • Data type agnostic • Simplifying deployment

Page 10

Page 11: Apache Tez : Accelerating Hadoop Query Processing

© Hortonworks Inc. 2013

Tez – Empowering End Users

• Expressive dataflow definition API’s – Enable definition of complex data flow pipelines using simple

graph connection API’s. Tez expands the logical plan at runtime. – Targeted towards data processing applications like Hive/Pig but

not limited to it. Hive/Pig query plans naturally map to Tez dataflow graphs with no translation impedance.

Page 11

TaskA-1 TaskA-2 TaskB-1 TaskB-2 TaskC-1 TaskC-2

TaskD-1 TaskD-2 TaskE-1 TaskE-2

Page 12: Apache Tez : Accelerating Hadoop Query Processing

© Hortonworks Inc. 2013

Aggregate Stage

Partition Stage

Preprocessor Stage

Tez – Empowering End Users

• Expressive dataflow definition API’s

Page 12

Sampler

Task-1 Task-2

Task-1 Task-2

Task-1 Task-2

Samples

Ranges

Distributed Sort

Page 13: Apache Tez : Accelerating Hadoop Query Processing

© Hortonworks Inc. 2013

Tez – Empowering End Users

• Flexible Input-Processor-Output runtime model – Construct physical runtime executors dynamically by connecting

different inputs, processors and outputs. – End goal is to have a library of inputs, outputs and processors that

can be programmatically composed to generate useful tasks.

Page 13

Mapper

HDFSInput

MapProcessor

FileSortedOutput

Reducer

ShuffleInput

ReduceProcessor

HDFSOutput

PairwiseJoin

Input1

JoinProcessor

FileSortedOutput

Input2

Page 14: Apache Tez : Accelerating Hadoop Query Processing

© Hortonworks Inc. 2013

Tez – Empowering End Users

• Data type agnostic – Tez is only concerned with the movement of data. Files and

streams of bytes. – Does not impose any data format on the user application. MR

application can use Key-Value pairs on top of Tez. Hive and Pig can use tuple oriented formats that are natural and native to them.

Page 14

File

Stream

Key Value

Tez Task

Tuples

User Code

Bytes Bytes

Page 15: Apache Tez : Accelerating Hadoop Query Processing

© Hortonworks Inc. 2013

Tez – Empowering End Users

• Simplifying deployment – Tez is a completely client side application. – No deployments to do. Simply upload to any accessible

FileSystem and change local Tez configuration to point to that. – Enables running different versions concurrently. Easy to test new

functionality while keeping stable versions for production. – Leverages YARN local resources.

Page 15

Client Machine

Node Manager

TezTask

Node Manager

TezTask TezClient

HDFS Tez Lib 1 Tez Lib 2

Client Machine

TezClient

Page 16: Apache Tez : Accelerating Hadoop Query Processing

© Hortonworks Inc. 2013

Tez – Empowering End Users

• Expressive dataflow definition API’s • Flexible Input-Processor-Output runtime model • Data type agnostic • Simplifying usage

With great power API’s come great responsibilities J Tez is a framework on which end user applications can be built

Page 16

Page 17: Apache Tez : Accelerating Hadoop Query Processing

© Hortonworks Inc. 2013

Tez – Execution Performance

• Performance gains over Map Reduce • Optimal resource management • Plan reconfiguration at runtime • Dynamic physical data flow decisions

Page 17

Page 18: Apache Tez : Accelerating Hadoop Query Processing

© Hortonworks Inc. 2013

Tez – Execution Performance

• Performance gains over Map Reduce – Eliminate replicated write barrier between successive

computations. – Eliminate job launch overhead of workflow jobs. – Eliminate extra stage of map reads in every workflow job. – Eliminate queue and resource contention suffered by workflow

jobs that are started after a predecessor job completes.

Page 18

Pig/Hive - MR Pig/Hive - Tez

Page 19: Apache Tez : Accelerating Hadoop Query Processing

© Hortonworks Inc. 2013

Tez – Execution Performance

• Plan reconfiguration at runtime – Dynamic runtime concurrency control based on data size, user

operator resources, available cluster resources and locality. – Advanced changes in dataflow graph structure. – Progressive graph construction in concert with user optimizer.

Page 19

HDFS Blocks

YARN Resources

Stage 1 50 maps

100 partitions

Stage 2 100

reducers

Stage 1 50 maps

100 partitions

Stage 2 100 10

reducers

Only 10GB’s of data

Page 20: Apache Tez : Accelerating Hadoop Query Processing

© Hortonworks Inc. 2013

Tez – Execution Performance

• Optimal resource management – Reuse YARN containers to launch new tasks. – Reuse YARN containers to enable shared objects across tasks.

Page 20

YARN Container

TezTask Host

TezTask1

TezTask2

Sha

red

Obj

ects

YARN Container

Tez Application Master

Start Task

Task Done

Start Task

Page 21: Apache Tez : Accelerating Hadoop Query Processing

© Hortonworks Inc. 2013

Tez – Execution Performance

• Dynamic physical data flow decisions – Decide the type of physical byte movement and storage on the fly. – Store intermediate data on distributed store, local store or in-

memory. – Transfer bytes via blocking files or streaming and the spectrum in

between.

Page 21

Producer (small size)

In-Memory

Consumer

Producer

Local File

Consumer

At Runtime

Page 22: Apache Tez : Accelerating Hadoop Query Processing

© Hortonworks Inc. 2013

Tez – Sessions

Page 33

Client

•  Key for interactive queries •  Analogous to database

sessions and represents a connection between the user and the cluster

•  Run multiple DAGs / queries in the same session

•  Maintains a pool of reusable containers for low latency execution of tasks within and across queries

•  Takes care of data locality and releasing resources when idle

•  Session cache in the Application Master and in the container pool reduce re-computation and re-initialization

Application Master

Con

tain

er P

ool

Pre-Warmed

JVM

Shared Object

Registry

Task Scheduler

Start Session

Submit DAG

Page 23: Apache Tez : Accelerating Hadoop Query Processing

© Hortonworks Inc. 2013

Tez – Benchmark Performance

Page 35

Significant (but not all) speed-ups due to Tez: •  DAG support and runtime graph re-

configuration enable utilizing the parallelism of the cluster

•  Tez Session and container re-use enable efficient and low latency execution

Page 24: Apache Tez : Accelerating Hadoop Query Processing

© Hortonworks Inc. 2013

Tez – Performance Analysis

Page 36

Tez Session populates container pool

Dimension table calculation and HDFS split generation in parallel

Dimension tables broadcasted to Hive MapJoin tasks

Final Reducer pre-launched and fetches completed inputs

AM

… …

TPC-DS – Query 27 with Hive on Tez

Page 25: Apache Tez : Accelerating Hadoop Query Processing

© Hortonworks Inc. 2013

Tez – Current status

• Apache Incubator Project – Rapid development. Over 600 jiras opened. Over 400 resolved. – Growing community of contributors and users.

• Focus on stability – Testing and quality are highest priority. – Code ready and deployed on multi-node environments.

• Support for a vast topology of DAGs – Already functionally equivalent to Map Reduce. Existing Map

Reduce jobs can be executed on Tez with few or no changes. – Hive re-targeted to use Tez for execution of queries (HIVE-4660). – Work started on Pig to use Tez for execution of scripts (PIG-3446).

Page 37

Page 26: Apache Tez : Accelerating Hadoop Query Processing

© Hortonworks Inc. 2013

Tez – Roadmap

• Richer DAG support – Support for co-scheduling and streaming – Better fault tolerance with checkpoints

• Performance optimizations – More efficiencies in transfer of data – Improve session performance

• Usability – Stability and testability – Recovery and history – Tools for performance analysis and debugging

Page 38

Page 27: Apache Tez : Accelerating Hadoop Query Processing

© Hortonworks Inc. 2013

Tez – Key Takeaways

• Distributed execution framework that works on computations represented as dataflow graphs

• Naturally maps to execution plans produced by query optimizers

• Customizable execution architecture designed to enable dynamic performance optimizations at runtime

• Works out of the box with the platform figuring out the hard stuff

• Span the spectrum of interactive latency to batch • Open source Apache project – your use-cases and code are welcome

• It works and is already being used by Hive and Pig

Page 40

Page 28: Apache Tez : Accelerating Hadoop Query Processing

© Hortonworks Inc. 2013

Thank You !

Page 41