intro to yarn (hadoop 2.0) & apex as yarn app (next gen big data)
Post on 12-Apr-2017
484 Views
Preview:
TRANSCRIPT
Introduction to YARN and Apex as YARN Application
Priyanka Gugale (priyag@apache.org)September 30th 2016
Apache Apex - Stream ProcessingEasily Operable - Exposes an easy API for developing Operators (part of an
application) and Applications
Highly Scalable - Scales statically as well as dynamically
Highly Performant - Can reach single digit millisecond end-to-end latency
Fault Tolerant - Automatically recovers from failures - without manual intervention
Stateful - Guarantees that no state will be lost
Apex Malhar library
YARN - Native - Uses Hadoop YARN framework for resource negotiation
Apex Platform Overview
3
An Apex Application is a DAG(Directed Acyclic Graph)
A DAG is composed of vertices (Operators) and edges (Streams).A Stream is a sequence of data tuples which connects operators at end-points called PortsAn Operator takes one or more input streams, performs computations & emits one or more output streams
● Each operator is USER’s business logic, or built-in operator from our open source library● Operator may have multiple instances that run in parallel
DAG Components
• Tuple● Atomic data that flows over a stream
• Operator● Basic compute unit per tuple
• Stream● Connector abstraction between operators● Tuples flow over this
Operator1
Operator2
Streamtuple
3tuple
1tuple
2
How Apex is Yarn Native?
Introducing YARN● YARN - Yet Another Resource Negotiator
● framework that facilitates writing arbitrary distributed processing frameworks and applications.
● YARN Applications/frameworks:e.g. MapReduce2, Apache Spark, Apache Giraph, Apache Apex etc.
Introducing YARNMap Reduce 1YARN
≈
≈
≈ 8Proprietary and Confidential
Job Tracker
Resource Manager
Application Master
Timeline Server
Task Tracker Node Manager
Map Slot
Reduce Slot
Hadoop beyond Batch
YARN for better resource utilization
More applications than MapReduce
• Resource ManagerManages and allocates cluster resources
Application scheduling
Applications Manager
• Node Manager
Per-machine agent
Manages life-cycle of container
Monitors resources
• Application Master
Per-application
Manages application scheduling and task execution
Hadoop v2 (YARN) Architecture
App Master Cont
NodeManager
Cont Cont
NodeManager
App Master
AppMaster
NodeManager
ResourceManager
MapReduce StatusJob SubmissionNode StatusResource Request
Client
Client
Application Submission workflow
YarnClient
Node RM
(ApplicationsManager + Scheduler)
Node
NM
Node
NMApplication Master
ContainerContainer
1) Submit application
2) Launch application Master
RM = Resource ManagerNM = Node ManagerAM = Application Master = Heartbeats
3) AM registers with RM
4) AM negotiates for containers
5) Launch Container
5) Launch Container
Apex as YARN application
Node
ResourceManager(AsM + Scheduler)
NM Node NM Node NM
YarnClient
AppMaster
YarnContainer
YarnContainer
YarnContainerStrAM
(AppMaster)
YarnContainerStrAMChild
O1 O2
YarnContainerStrAMChild
O3
Apex cliStrAMClient
YarnClient
Apache Apex Meetup
ClientRMProtocol
AMRMProtocol
ContainerManagerProtocol
ContainerManagerProtocol
ClientRMProtocol
AMRMProtocol
ContainerManagerProtocol
Application Components of Apex - StrAMClient• Part of apex client interface• Invoked by “launch” command of apex
• Tasks:● Copy required the application package files into HDFS● Validate Logical Plan● Serialize Logical plan to HDFS● Launch Application Master i.e. StrAM
Apache Apex Meetup
Application Components of Apex - StrAM• Streaming Application Master• Started by StrAMClient on a YarnContainer• Tasks:
● Convert logical plan to physical plan● Serialize operators to HDFS● Request for resources to ResourceManager● Start StrAMChild in YarnContainer(s)● Monitor StrAMChild using ContainerManager protocol● Generate Application statistics● Host results on WebService (dtManage)● Checkpointing/Committing Application States● Fault Tolerance● Support Security● Shutdown Application
Apache Apex Meetup
Application Components of Apex - StrAMChild• Deployed on YarnContainer• Started by NodeManager as instructed by StrAM• Instance of StreamingContainer• Contains Operators (compute-related)• Contains BufferServer (stream-related)• Tasks:
● Regularly send heartbeat to StrAM● Execute commands from StrAM● Shutdown or Kill self if instructed● Manage lifecycle of an Operator● Network communication using BufferServer
Apache Apex Meetup
Apex as YARN application
Node
ResourceManager(AsM + Scheduler)
NM Node NM
StrAM(AppMaster)
YarnContainerStrAMChild
O1 O2
YarnContainerStrAMChild
O3
Apex cliStrAMClient
YarnClient
Apache Apex Meetup
ClientRMProtocol
AMRMProtocol
ContainerManagerProtocol
Summary – Apex platform• Enables YARN to be used for Streaming Applications
• Takes care of YARN specific work
• User can focus on business logic defined in Operators
Apache Apex Meetup
Q&A
18
Resources
19
• http://apex.apache.org/• Learn more: http://apex.apache.org/docs.html • Subscribe - http://apex.apache.org/community.html• Download - http://apex.apache.org/downloads.html• Follow @ApacheApex - https://twitter.com/apacheapex• Meetups – http://www.meetup.com/pro/apacheapex/• More examples: https://github.com/DataTorrent/examples• Slideshare:
http://www.slideshare.net/ApacheApex/presentations• https://www.youtube.com/results?search_query=apache+ape
x• Free Enterprise License for Startups -
https://www.datatorrent.com/product/startup-accelerator/
top related