intro to yarn (hadoop 2.0) & apex as yarn app (next gen big data)

Post on 12-Apr-2017

484 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introduction to YARN and Apex as YARN Application

Priyanka Gugale (priyag@apache.org)September 30th 2016

Apache Apex - Stream ProcessingEasily Operable - Exposes an easy API for developing Operators (part of an

application) and Applications

Highly Scalable - Scales statically as well as dynamically

Highly Performant - Can reach single digit millisecond end-to-end latency

Fault Tolerant - Automatically recovers from failures - without manual intervention

Stateful - Guarantees that no state will be lost

Apex Malhar library

YARN - Native - Uses Hadoop YARN framework for resource negotiation

Apex Platform Overview

3

An Apex Application is a DAG(Directed Acyclic Graph)

A DAG is composed of vertices (Operators) and edges (Streams).A Stream is a sequence of data tuples which connects operators at end-points called PortsAn Operator takes one or more input streams, performs computations & emits one or more output streams

● Each operator is USER’s business logic, or built-in operator from our open source library● Operator may have multiple instances that run in parallel

DAG Components

• Tuple● Atomic data that flows over a stream

• Operator● Basic compute unit per tuple

• Stream● Connector abstraction between operators● Tuples flow over this

Operator1

Operator2

Streamtuple

3tuple

1tuple

2

How Apex is Yarn Native?

Introducing YARN● YARN - Yet Another Resource Negotiator

● framework that facilitates writing arbitrary distributed processing frameworks and applications.

● YARN Applications/frameworks:e.g. MapReduce2, Apache Spark, Apache Giraph, Apache Apex etc.

Introducing YARNMap Reduce 1YARN

≈ 8Proprietary and Confidential

Job Tracker

Resource Manager

Application Master

Timeline Server

Task Tracker Node Manager

Map Slot

Reduce Slot

Hadoop beyond Batch

YARN for better resource utilization

More applications than MapReduce

• Resource ManagerManages and allocates cluster resources

Application scheduling

Applications Manager

• Node Manager

Per-machine agent

Manages life-cycle of container

Monitors resources

• Application Master

Per-application

Manages application scheduling and task execution

Hadoop v2 (YARN) Architecture

App Master Cont

NodeManager

Cont Cont

NodeManager

App Master

AppMaster

NodeManager

ResourceManager

MapReduce StatusJob SubmissionNode StatusResource Request

Client

Client

Application Submission workflow

YarnClient

Node RM

(ApplicationsManager + Scheduler)

Node

NM

Node

NMApplication Master

ContainerContainer

1) Submit application

2) Launch application Master

RM = Resource ManagerNM = Node ManagerAM = Application Master = Heartbeats

3) AM registers with RM

4) AM negotiates for containers

5) Launch Container

5) Launch Container

Apex as YARN application

Node

ResourceManager(AsM + Scheduler)

NM Node NM Node NM

YarnClient

AppMaster

YarnContainer

YarnContainer

YarnContainerStrAM

(AppMaster)

YarnContainerStrAMChild

O1 O2

YarnContainerStrAMChild

O3

Apex cliStrAMClient

YarnClient

Apache Apex Meetup

ClientRMProtocol

AMRMProtocol

ContainerManagerProtocol

ContainerManagerProtocol

ClientRMProtocol

AMRMProtocol

ContainerManagerProtocol

Application Components of Apex - StrAMClient• Part of apex client interface• Invoked by “launch” command of apex

• Tasks:● Copy required the application package files into HDFS● Validate Logical Plan● Serialize Logical plan to HDFS● Launch Application Master i.e. StrAM

Apache Apex Meetup

Application Components of Apex - StrAM• Streaming Application Master• Started by StrAMClient on a YarnContainer• Tasks:

● Convert logical plan to physical plan● Serialize operators to HDFS● Request for resources to ResourceManager● Start StrAMChild in YarnContainer(s)● Monitor StrAMChild using ContainerManager protocol● Generate Application statistics● Host results on WebService (dtManage)● Checkpointing/Committing Application States● Fault Tolerance● Support Security● Shutdown Application

Apache Apex Meetup

Application Components of Apex - StrAMChild• Deployed on YarnContainer• Started by NodeManager as instructed by StrAM• Instance of StreamingContainer• Contains Operators (compute-related)• Contains BufferServer (stream-related)• Tasks:

● Regularly send heartbeat to StrAM● Execute commands from StrAM● Shutdown or Kill self if instructed● Manage lifecycle of an Operator● Network communication using BufferServer

Apache Apex Meetup

Apex as YARN application

Node

ResourceManager(AsM + Scheduler)

NM Node NM

StrAM(AppMaster)

YarnContainerStrAMChild

O1 O2

YarnContainerStrAMChild

O3

Apex cliStrAMClient

YarnClient

Apache Apex Meetup

ClientRMProtocol

AMRMProtocol

ContainerManagerProtocol

Summary – Apex platform• Enables YARN to be used for Streaming Applications

• Takes care of YARN specific work

• User can focus on business logic defined in Operators

Apache Apex Meetup

Q&A

18

Resources

19

• http://apex.apache.org/• Learn more: http://apex.apache.org/docs.html • Subscribe - http://apex.apache.org/community.html• Download - http://apex.apache.org/downloads.html• Follow @ApacheApex - https://twitter.com/apacheapex• Meetups – http://www.meetup.com/pro/apacheapex/• More examples: https://github.com/DataTorrent/examples• Slideshare:

http://www.slideshare.net/ApacheApex/presentations• https://www.youtube.com/results?search_query=apache+ape

x• Free Enterprise License for Startups -

https://www.datatorrent.com/product/startup-accelerator/

top related