introduction to yarn
TRANSCRIPT
Chinmay Kolhatkar ([email protected])Bhupesh Chawda ([email protected])
DataTorrent
Introduction to YARNNext Gen Hadoop
Image Source: https://memegenerator.net/instance/64508420
Why YARN
Hadoop v1 (MR1) Architecture● Job Tracker
○ Manages cluster resources ○ Job scheduling○ Bottleneck
● Task Tracker○ Per-node Agent○ Manages tasks○ Map / Reduce task slots
MapReduce Status
Job Submission
JobTracker
Task Task
Task Task
Client
Client
TaskTracker
Task Task
Task Tracker
TaskTracker
Limitations with MR1• Scalability
Maximum cluster size: 4,000 nodesMaximum concurrent tasks: 40,000
• Availability - Job Tracker is a SPOF• Resource Utilization - Map / Reduce slots• Runs only MapReduce applications
Why YARN (Cont…)
Image Source: memegenerator.net
Introducing YARN
● YARN - Yet Another Resource Negotiator● Framework that facilitates writing arbitrary distributed processing
frameworks and applications.● YARN Applications/frameworks:
e.g. MapReduce2, Apache Spark, Apache Giraph, Apache Apex etc.
Image Source: http://tm.durusau.net/?cat=1525
Hadoop beyond Batch
YARN for better resource utilization
More applications than MapReduce
Image Source: http://tm.durusau.net/?cat=1525
Comparing MapReduce with YARN
MapReduceYARN
≈
≈
≈8Proprietary and Confidential
Job Tracker
Resource Manager
Application Master
Task Tracker Node Manager
Map Slot
Reduce Slot
Backward Compatibility Maintained!
● Existing Map Reduce jobs run as is on the YARN framework
● No Job Tracker and Task Tracker processes
• Resource ManagerManages and allocates cluster resources
Application scheduling
Applications Manager
• Node Manager
Per-machine agent
Manages life-cycle of container
Monitors resources
• Application Master
Per-application
Manages application scheduling and task execution
Hadoop v2 (YARN) Architecture
Image Source: hadoop.apache.org
Application Submission workflow
YarnClient
Node RM
(ApplicationsManagers + Scheduler)
Resource Manager
Node NM
Node Manager
Node NM
Node ManagerApplication
Master
ContainerContainer
1) Submit application
2) Launch application Master
RM = Resource ManagerNM = Node ManagerAM = Application Master = Heartbeats
3) AM registers with RM
4) AM negotiates for containers
5) Launch Container
Application Masters - One for each Application Type
MapReduce Application MapReduce Application Master
Apex ApplicationApex
Application Master (StrAM)
Flink Application Flink Application Master
Giraph Application Giraph Application Master
Already provided by Hadoop as a backward compatibility option for MapReduce
Provided by Apache Apex
● YARN enables non-MapReduce applications to run in a distributed fashion● Each Application first asks for a container for the Application Master
○ The Application Master then talks to YARN to get resources needed by the application
○ Once YARN allocates containers as requested to the Application Master, it starts the application components in those containers.
● Hadoop is no more just batch processing!!
Key Takeaways
Image Source: memegenerator.net
References● Simple Yarn code example
○ https://github.com/hortonworks/simple-yarn-app
● Document references○ https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html○ http://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/○ http://www.slideshare.net/
● Acknowledgements○ Priyanka Gugale, DataTorrent - Some of the slides