yarn ready: integrating to yarn with tez

39
© Hortonworks Inc. 2013 Page 1 Apache Tez : Accelerating Hadoop Data Processing Bikas Saha @bikassaha

Upload: hortonworks

Post on 15-Jan-2015

784 views

Category:

Technology


7 download

DESCRIPTION

YARN Ready webinar series helps developers integrate their applications to YARN. Tez is one vehicle to do that. We take a deep dive including code review to help you get started.

TRANSCRIPT

Page 1: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013 Page 1

Apache Tez : Accelerating Hadoop Data Processing

Bikas Saha @bikassaha

Page 2: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013

Tez – Introduction

Page 2

• Distributed execution framework targeted towards data-processing applications.

• Based on expressing a computation as a dataflow graph.• Highly customizable to meet a broad spectrum of use cases.• Built on top of YARN – the resource management framework for

Hadoop.• Open source Apache project and Apache licensed.

Page 3: YARN Ready: Integrating to YARN with Tez

Tez – Hadoop 1 ---> Hadoop 2Monolithic• Resource Management – MR • Execution Engine – MR

Layered• Resource Management – YARN• Engines – Hive, Pig, Cascading, Your App!

Page 4: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013

Tez – Empowering Applications• Tez solves hard problems of running on a distributed Hadoop environment

• Apps can focus on solving their domain specific problems

• This design is important to be a platform for a variety of applications

Page 4

App

Tez

• Custom application logic• Custom data format• Custom data transfer technology

• Distributed parallel execution• Negotiating resources from the Hadoop framework• Fault tolerance and recovery• Horizontal scalability• Resource elasticity• Shared library of ready-to-use components• Built-in performance optimizations• Security

Page 5: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013

Tez – End User Benefits• Better performance of applications

• Built-in performance + Application define optimizations

• Better predictability of results• Minimization of overheads and queuing delays

• Better utilization of compute capacity• Efficient use of allocated resources

• Reduced load on distributed filesystem (HDFS)• Reduce unnecessary replicated writes

• Reduced network usage• Better locality and data transfer using new data patterns

• Higher application developer productivity• Focus on application business logic rather than Hadoop internals

Page 5

Page 6: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013

Tez – Design considerations

Don’t solve problems that have already been solved. Or else you will have to solve them again!

• Leverage discrete task based compute model for elasticity, scalability and fault tolerance

• Leverage several man years of work in Hadoop Map-Reduce data shuffling operations

• Leverage proven resource sharing and multi-tenancy model for Hadoop and YARN

• Leverage built-in security mechanisms in Hadoop for privacy and isolation

Page 6

Look to the Future with an eye on the Past

Page 7: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013

Tez – Problems that it addresses• Expressing the computation

• Direct and elegant representation of the data processing flow• Interfacing with application code and new technologies

• Performance• Late Binding : Make decisions as late as possible using real data from at runtime• Leverage the resources of the cluster efficiently• Just work out of the box!• Customizable engine to let applications tailor the job to meet their specific

requirements

• Operation simplicity• Painless to operate, experiment and upgrade

Page 7

Page 8: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013

Tez – Simplifying Operations• No deployments to do. No side effects. Easy and safe to try it out!

• Tez is a completely client side application.

• Simply upload to any accessible FileSystem and change local Tez configuration to point to that.

• Enables running different versions concurrently. Easy to test new functionality while keeping stable versions for production.

• Leverages YARN local resources.

Page 8

ClientMachine

NodeManager

TezTask

NodeManager

TezTaskTezClient

HDFSTez Lib 1 Tez Lib 2

ClientMachine

TezClient

Page 9: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013

Tez – Expressing the computationDistributed data processing jobs typically look like DAGs (Directed Acyclic Graph).

• Vertices in the graph represent data transformations

• Edges represent data movement from producers to consumers

Page 9

Aggregate Stage

Partition Stage

Preprocessor Stage

Sampler

Task-1 Task-2

Task-1 Task-2

Task-1 Task-2

Samples

Ranges

Distributed Sort

Page 10: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013

Tez – Expressing the computationTez provides the following APIs to define the processing• DAG API

• Defines the structure of the data processing and the relationship between producers and consumers

• Enable definition of complex data flow pipelines using simple graph connection API’s. Tez expands the logical DAG at runtime

• This is how the connection of tasks in the job gets specified

• Runtime API• Defines the interfaces using which the framework and app code interact

with each other• App code transforms data and moves it between tasks• This is how we specify what actually executes in each task on the cluster

nodes

Page 10

Page 11: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013

Tez – DAG API

// Define DAG

DAG dag = DAG.create();

// Define Vertex

Vertex Map1 = Vertex.create(Processor.class);

// Define Edge

Edge edge = Edge.create(Map1, Reduce1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, Output.class, Input.class);

// Connect them

dag.addVertex(Map1).addEdge(edge)…

Page 11

Defines the global processing flow

Map1 Map2

Reduce1 Reduce2

Join

Scatter Gather

Scatter Gather

Page 12: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013

Tez – DAG API

Data movement – Defines routing of data between tasks

• One-To-One : Data from the ith producer task routes to the ith consumer task.

• Broadcast : Data from a producer task routes to all consumer tasks.

• Scatter-Gather : Producer tasks scatter data into shards and consumer tasks gather the data. The ith shard from all producer tasks routes to the ith consumer task.

Page 12

Edge properties define the connection between producer and consumer tasks in the DAG

Page 13: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013

Tez – Logical DAG expansion at Runtime

Page 13

Reduce1

Map2

Reduce2

Join

Map1

Page 14: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013

Tez – Runtime APIFlexible Inputs-Processor-Outputs Model• Thin API layer to wrap around arbitrary application code• Compose inputs, processor and outputs to execute arbitrary processing• Applications decide logical data format and data transfer technology• Customize for performance• Built-in implementations for Hadoop 2.0 data services – HDFS and YARN

ShuffleService. Built on the same API. Your impls are as first class as ours!

Page 14

Page 15: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013

Tez – Library of Inputs and Outputs

Classical ‘Map’

Page 15

Classical ‘Reduce’

Intermediate ‘Reduce’ for Map-Reduce-Reduce

Map Processor

HDFS Input

Sorted Output

Reduce Processor

Shuffle Input

HDFS Output

Reduce Processor

Shuffle Input

Sorted Output

• What is built in?– Hadoop InputFormat/OutputFormat– SortedGroupedPartitioned Key-Value

Input/Output– UnsortedGroupedPartitioned Key-Value

Input/Output– Key-Value Input/Output

Page 16: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013

Tez – Composable Task Model

Page 16

Hive Processor

HDFSInput

RemoteFile

ServerInput

HDFSOutput

LocalDisk

Output

Your Processor

HDFSInput

RemoteFile

ServerInput

HDFSOutput

LocalDisk

Output

Your Processor

RDMAInput

NativeDB

Input

KakfaPub-SubOutput

AmazonS3

Output

Adopt Evolve Optimize

Page 17: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013

Tez – Performance• Benefits of expressing the data processing as a DAG

• Reducing overheads and queuing effects• Gives system the global picture for better planning

• Efficient use of resources• Re-use resources to maximize utilization• Pre-launch, pre-warm and cache• Locality & resource aware scheduling

• Support for application defined DAG modifications at runtime for optimized execution

• Change task concurrency • Change task scheduling• Change DAG edges• Change DAG vertices (TBD)

Page 17

Page 18: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013

Tez – Benefits of DAG executionFaster Execution and Higher Predictability

• Eliminate replicated write barrier between successive computations.• Eliminate job launch overhead of workflow jobs.• Eliminate extra stage of map reads in every workflow job.• Eliminate queue and resource contention suffered by workflow jobs that

are started after a predecessor job completes.• Better locality because the engine has the global picture

Page 18

Pig/Hive - MRPig/Hive - Tez

Page 19: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013

Tez – Container Re-Use• Reuse YARN containers/JVMs to launch new tasks• Reduce scheduling and launching delays• Shared in-memory data across tasks• JVM JIT friendly execution

Page 19

YARN Container / JVM

TezTask Host

TezTask1

TezTask2

Shar

ed O

bjec

ts

YARN Container

Tez Application Master

Start Task

Task Done

Start Task

Page 20: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013

Tez – Sessions

Page 20

Application Master

Client

Start Session

Submit DAG

Task Scheduler

Cont

aine

r Poo

lShared Object

Registry

Pre Warmed

JVM

Sessions• Standard concepts of pre-launch

and pre-warm applied• Key for interactive queries• Represents a connection between

the user and the cluster• Multiple DAGs executed in the

same session• Containers re-used across queries• Takes care of data locality and

releasing resources when idle

Page 21: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013

Tez – Current status• Apache Project

–Rapid development. Over 1100 jiras opened. Over 800 resolved–Growing community of contributors and users

• 0.5 being released. In voting for release–Developer release focused ease of development– Stable API–Better debugging–Documentation and code samples

• Support for a vast topology of DAGs• Being used by multiple applications such as Apache Hive,

Apache Pig, Cascading, Scalding

Page 21

Page 22: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013

Tez – Adoption Path Pre-requisite : Hadoop 2 with YARNTez has zero deployment pain. No side effects or traces left behind on your cluster. Low risk and low effort to try out.• Using Hive, Pig, Cascading, Scalding

• Try them with Tez as execution engine

• Already have MapReduce based pipeline• Use configuration to change MapReduce to run on Tez by setting

‘mapreduce.framework.name’ to ‘yarn-tez’ in mapred-site.xml• Consolidate MR workflow into MR-DAG on Tez• Change MR-DAG to use more efficient Tez constructs

• Have custom pipeline• Wrap custom code/scripts into Tez inputs-processor-outputs• Translate your custom pipeline topology into Tez DAG• Change custom topology to use more efficient Tez constructs

Page 22

Page 23: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013

Tez – Theory to Practice• Performance• Scalability

Page 23

Page 24: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013

Tez – Hive TPC-DS Scale 200GB latency

Page 25: YARN Ready: Integrating to YARN with Tez

Tez – Pig performance gains

• Demonstrates performance gains from a basic translation to a Tez DAG

0

40

80

120

160

MRTez

Tim

e in

min

s

Page 26: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013

Tez – Observations on Performance• Number of stages in the DAG

• Higher the number of stages in the DAG, performance of Tez (over MR) will be better.

• Cluster/queue capacity• More congested a queue is, the performance of Tez (over MR) will be better due

to container reuse.

• Size of intermediate output• More the size of intermediate output, the performance of Tez (over MR) will be

better due to reduced HDFS usage.

• Size of data in the job• For smaller data and more stages, the performance of Tez (over MR) will be better

as percentage of launch overhead in the total time is high for smaller jobs.

• Offload work to the cluster• Move as much work as possible to the cluster by modelling it via the job DAG.

Exploit the parallelism and resources of the cluster. E.g. MR split calculation.

• Vertex caching• The more re-computation can be avoided the better is the performance.

Page 26

Page 27: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013

Tez – Data at scale

Page 27

Hive TPC-DS Scale 10TB

Page 28: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013

Tez – DAG definition at scale

Page 28

Hive : TPC-DS Query 64 Logical DAG

Page 29: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013

Tez – Container Reuse at Scale• 78 vertices + 8374 tasks on 50 containers (TPC-DS Query 4)

Page 29

Page 30: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013

Tez – Bridging the Data Spectrum

Page 30

Fact Table Dimension Table 1

Result Table 1

Dimension Table 2

Result Table 2

Dimension Table 3

Result Table 3

BroadcastJoin

ShuffleJoin

Typical pattern in a TPC-DS query

Fact Table

Dimension Table 1

Dimension Table 1

Dimension Table 1

Broadcast join for small data sets

Based on data size, the query optimizer can run either plan as a single Tez job

BroadcastJoin

Page 31: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013

Tez – Community• Early adopters and code contributors welcome

– Adopters to drive more scenarios. Contributors to make them happen.

• Tez meetup for developers and users– http://www.meetup.com/Apache-Tez-User-Group

• Technical blog series– http://

hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-processing

• Useful links– Work tracking: https://issues.apache.org/jira/browse/TEZ– Code: https://github.com/apache/tez – Developer list: [email protected]

User list: [email protected] Issues list: [email protected]

Page 31

Page 32: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013

Tez – Takeaways• Distributed execution framework that works on computations

represented as dataflow graphs• Customizable execution architecture designed to enable

dynamic performance optimizations at runtime• Works out of the box with the platform figuring out the hard

stuff• Span the spectrum of interactive latency to batch, small to

large• Open source Apache project – your use-cases and code are

welcome• It works and is already being used by Hive and Pig

Page 32

Page 33: YARN Ready: Integrating to YARN with Tez

© Hortonworks Inc. 2013

TezThanks for your time and attention!

Video with Deep Dive on Tez http://goo.gl/BL67o7http://www.infoq.com/presentations/apache-tez

Questions?

@bikassaha

Page 33

Page 34: YARN Ready: Integrating to YARN with Tez

Next Steps

Page 35: YARN Ready: Integrating to YARN with Tez

4Next Steps

1. Review YARN Resources

2. Review webinar recording

3. Attend Office Hours

4. Attend the next YARN webinar

Page 36: YARN Ready: Integrating to YARN with Tez

ResourcesSetup HDP 2.1 environment

• Leverage Sandbox: Hortonworks.com/Sandbox

Get Started with YARN

• http://hortonworks.com/get-started/YARN

Video Deep Dive on Tez

• http://goo.gl/BL67o7 and http://www.infoq.com/presentations/apache-tez

Tez meetup for developers and users

• http://www.meetup.com/Apache-Tez-User-Group

Technical blog series

http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-processing

Useful links

Work tracking: https://issues.apache.org/jira/browse/TEZ

Code: https://github.com/apache/tez

Developer list: [email protected] User list: [email protected]

Page 37: YARN Ready: Integrating to YARN with Tez

Hortonworks Office HoursYARN Office HoursDial in and chat with YARN experts

We plan Office Hours for September 11th and October 9th @ 10am PT (2nd Thursdays)Invitations will go out to those that attended or reviewed YARN webinars

These will also be posted to hortonworks.com/webinarsRegistration required.

Page 38: YARN Ready: Integrating to YARN with Tez

YARN Ready Webinar Schedule

Native IntegrationSlider

Office Hours

August 14

Tez

August 21

Ambari

Sept. 4

Office Hours

Sept. 11

Scalding

Sept. 18

Spark

Oct. 2

Office Hours

Oct. 9

Upcoming Webinars

Office Hours

Recorded Webinars:

Timeline

Introductionto YARN Ready

Sign up here:For the Series, Individual webinar or office hourshttp://info.hortonworks.com/YarnReady-BigData-Webcast-Series.html

You can also visit http://hortonworks.com/webinars/#library

Page 39: YARN Ready: Integrating to YARN with Tez

Thank you!