yarn essentials - droppdf1.droppdf.com/files/4dwyu/yarn-essentials-amol-fasale-2015.pdfyarn...

285

Upload: others

Post on 13-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 2: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 3: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

YARNEssentials

Page 4: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

TableofContents

YARNEssentials

Credits

AbouttheAuthors

AbouttheReviewers

www.PacktPub.com

Supportfiles,eBooks,discountoffers,andmore

Whysubscribe?

FreeaccessforPacktaccountholders

Preface

Whatthisbookcovers

Whatyouneedforthisbook

Whothisbookisfor

Conventions

Readerfeedback

Customersupport

Downloadingtheexamplecode

Errata

Piracy

Questions

1.NeedforYARN

Theredesignidea

LimitationsoftheclassicalMapReduceorHadoop1.x

YARNasthemodernoperatingsystemofHadoop

WhatarethedesigngoalsforYARN

Summary

2.YARNArchitecture

CorecomponentsofYARNarchitecture

ResourceManager

ApplicationMaster(AM)

Page 5: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

NodeManager(NM)

YARNschedulerpolicies

TheFIFO(FirstInFirstOut)scheduler

Thefairscheduler

Thecapacityscheduler

RecentdevelopmentsinYARNarchitecture

Summary

3.YARNInstallation

Single-nodeinstallation

Prerequisites

Platform

Software

Startingwiththeinstallation

Thestandalonemode(localmode)

Thepseudo-distributedmode

Thefully-distributedmode

HistoryServer

Slavefiles

OperatingHadoopandYARNclusters

StartingHadoopandYARNclusters

StoppingHadoopandYARNclusters

WebinterfacesoftheEcosystem

Summary

4.YARNandHadoopEcosystems

TheHadoop2release

AshortintroductiontoHadoop1.xandMRv1

MRv1versusMRv2

UnderstandingwhereYARNfitsintoHadoop

OldandnewMapReduceAPIs

BackwardcompatibilityofMRv2APIs

Binarycompatibilityoforg.apache.hadoop.mapredAPIs

Page 6: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Sourcecompatibilityoforg.apache.hadoop.mapredAPIs

PracticalexamplesofMRv1andMRv2

Preparingtheinputfile(s)

Runningthejob

Result

Summary

5.YARNAdministration

Containerallocation

Containerallocationtotheapplication

Containerconfigurations

YARNschedulingpolicies

TheFIFO(FirstInFirstOut)scheduler

TheFIFO(FirstInFirstOut)scheduler

Thecapacityscheduler

Capacityschedulerconfigurations

Thefairscheduler

Fairschedulerconfigurations

YARNmultitenancyapplicationsupport

AdministrationofYARN

Administrativetools

AddingandremovingnodesfromaYARNcluster

AdministratingYARNjobs

MapReducejobconfigurations

YARNlogmanagement

YARNwebuserinterface

Summary

6.DevelopingandRunningaSimpleYARNApplication

RunningsampleexamplesonYARN

RunningasamplePiexample

MonitoringYARNapplicationswithwebGUI

YARN’sMapReducesupport

Page 7: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

TheMapReduceApplicationMaster

ExampleYARNMapReducesettings

YARN’scompatibilitywithMapReduceapplications

DevelopingYARNapplications

TheYARNapplicationworkflow

WritingtheYARNclient

WritingtheYARNApplicationMaster

ResponsibilitiesoftheApplicationMaster

Summary

7.YARNFrameworks

ApacheSamza

WritingaKafkaproducer

Writingthehello-samzaproject

Startingagrid

Storm-YARN

Prerequisites

HadoopYARNshouldbeinstalled

ApacheZooKeepershouldbeinstalled

SettingupStorm-YARN

Gettingthestorm.yamlconfigurationofthelaunchedStormcluster

BuildingandrunningStorm-Starterexamples

ApacheSpark

WhyrunonYARN?

ApacheTez

ApacheGiraph

HOYA(HBaseonYARN)

KOYA(KafkaonYARN)

Summary

8.FailuresinYARN

ResourceManagerfailures

ApplicationMasterfailures

Page 8: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

NodeManagerfailures

Containerfailures

HardwareFailures

Summary

9.YARN–AlternativeSolutions

Mesos

Omega

Corona

Summary

10.YARN–FutureandSupport

WhatYARNmeanstothebigdataindustry

Journey–presentandfuture

Presenton-goingfeatures

Futurefeatures

YARN-supportedframeworks

Summary

Index

Page 9: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 10: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

YARNEssentials

Page 11: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 12: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

YARNEssentialsCopyright©2015PacktPublishing

Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.

Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyoftheinformationpresented.However,theinformationcontainedinthisbookissoldwithoutwarranty,eitherexpressorimplied.Neithertheauthors,norPacktPublishing,anditsdealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecauseddirectlyorindirectlybythisbook.

PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthecompaniesandproductsmentionedinthisbookbytheappropriateuseofcapitals.However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.

Firstpublished:February2015

Productionreference:1190215

PublishedbyPacktPublishingLtd.

LiveryPlace

35LiveryStreet

BirminghamB32PB,UK.

ISBN978-1-78439-173-7

www.packtpub.com

Page 13: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 14: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

CreditsAuthors

AmolFasale

NirmalKumar

Reviewers

LakshmiNarasimhan

SwapnilSalunkhe

Jenny(Xiao)Zhang

CommissioningEditor

TaronPereira

AcquisitionEditor

JamesJones

ContentDevelopmentEditor

ArwaManasawala

TechnicalEditor

IndrajitA.Das

CopyEditors

KarunaNarayanan

LaxmiSubramanian

ProjectCoordinator

PuravMotiwalla

Proofreaders

SafisEditing

MariaGould

Indexer

PriyaSane

Graphics

SheetalAute

ValentinaD’silva

AbhinashSahu

ProductionCoordinator

Page 15: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

ShantanuN.Zagade

CoverWork

ShantanuN.Zagade

Page 16: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 17: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

AbouttheAuthorsAmolFasalehasmorethan4yearsofindustryexperienceactivelyworkinginthefieldsofbigdataanddistributedcomputing;heisalsoanactivebloggerinandcontributortotheopensourcecommunity.AmolworksasaseniordatasystemengineeratMakeMyTrip.com,averywell-knowntravelandhospitalityportalinIndia,responsibleforreal-timepersonalizationofonlineuserexperiencewithApacheKafka,ApacheStorm,ApacheHadoop,andmanymore.Also,Amolhasactivehands-onexperienceinJava/J2EE,SpringFrameworks,Python,machinelearning,Hadoopframeworkcomponents,SQL,NoSQL,andgraphdatabases.

YoucanfollowAmolonTwitterat@amolfasaleoronLinkedIn.Amolisveryactiveonsocialmedia.Youcancatchhimonlineforanytechnicalassistance;hewouldbehappytohelp.

Amolhascompletedhisbachelor’sinengineering(electronicsandtelecommunication)fromPuneUniversityandpostgraduatediplomaincomputersfromCDAC.

Thegiftofloveisoneofthegreatestblessingsfromparents,andIamheartilythankfultomymom,dad,friends,andcolleagueswhohaveshownandcontinuetoshowtheirsupportindifferentways.Finally,IowemuchtoJamesandArwawithoutwhosedirectionandunderstanding,Iwouldnothavecompletedthiswork.

NirmalKumarisaleadsoftwareengineeratiLabs,theR&DteamatImpetusInfotechPvt.Ltd.Hehasmorethan8yearsofexperienceinopensourcetechnologiessuchasJava,JEE,Spring,Hibernate,webservices,Hadoop,Hive,Flume,Sqoop,Kafka,Storm,NoSQLdatabasessuchasHBaseandCassandra,andMPPdatabasessuchasTeradata.

YoucanfollowhimonTwitterat@nirmal___kumar.Hespendsmostofhistimereadingaboutandplayingwithdifferenttechnologies.Hehasalsoundertakenmanytechtalksandtrainingsessionsonbigdatatechnologies.

Hehasattainedhismaster’sdegreeincomputerapplicationsfromHarcourtButlerTechnologicalInstitute(HBTI),Kanpur,IndiaandiscurrentlypartofthebigdataR&DteaminiLabsatImpetusInfotechPvt.Ltd.

Iwouldliketothankmyorganization,especiallyiLabs,forsupportingmeinwritingthisbook.Also,aspecialthankstothePacktPublishingteam;withoutyouguys,thisworkwouldnothavebeenpossible.

Page 18: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 19: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

AbouttheReviewersLakshmiNarasimhanisafullstackdeveloperwhohasbeenworkingonbigdataandsearchsincetheearlydaysofLuceneandwasapartofthesearchteamatAsk.com.Heisabigadvocateofopensourceandregularlycontributesandconsultsonvarioustechnologies,mostnotablyDrupalandtechnologiesrelatedtobigdata.Lakshmiiscurrentlyworkingasthecurriculumdesignerforhisowntrainingcompany,http://www.readybrains.com.Heblogsoccasionallyabouthistechnicalendeavorsathttp://www.lakshminp.comandcanbecontactedviahisTwitterhandle,@lakshminp.

It’shardfindareadyreferenceordocumentationforasubjectlikeYARN.I’dliketothanktheauthorforwritingabookonYARNandhopethetargetaudiencefindsituseful.

SwapnilSalunkheisapassionatesoftwaredeveloperwhoiskeenlyinterestedinlearningandimplementingnewtechnologies.Hehasapassionforfunctionalprogramming,machinelearning,andworkingwithdata.Hehasexperienceworkinginthefinanceandtelecomdomains.

I’dliketothankPacktPublishinganditsstaffforanopportunitytocontributetothisbook.

Jenny(Xiao)Zhangisatechnologyprofessionalinbusinessanalytics,KPIs,andbigdata.Shehelpsbusinessesbettermanage,measure,report,andanalyzedatatoanswercriticalbusinessquestionsanddrivebusinessgrowth.SheisanexpertinSaaSbusinessandhadexperienceinavarietyofindustrydomainssuchastelecom,oilandgas,andfinance.Shehaswrittenanumberofblogpostsathttp://jennyxiaozhang.comonbigdata,Hadoop,andYARN.ShealsoactivelyusesTwitterat@smallnarutotoshareinsightsonbigdataandanalytics.

Iwanttothankallmyblogreaders.Itistheencouragementfromthemthatmotivatesmetodeepdiveintotheoceanofbigdata.Ialsowanttothankmydad,Michael(Tiegang)Zhang,forprovidingtechnicalinsightsintheprocessofreviewingthebook.AspecialthankstothePacktPublishingteamforthisgreatopportunity.

Page 20: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 21: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

www.PacktPub.com

Page 22: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Supportfiles,eBooks,discountoffers,andmoreForsupportfilesanddownloadsrelatedtoyourbook,pleasevisitwww.PacktPub.com.

DidyouknowthatPacktofferseBookversionsofeverybookpublished,withPDFandePubfilesavailable?YoucanupgradetotheeBookversionatwww.PacktPub.comandasaprintbookcustomer,youareentitledtoadiscountontheeBookcopy.Getintouchwithusat<[email protected]>formoredetails.

Atwww.PacktPub.com,youcanalsoreadacollectionoffreetechnicalarticles,signupforarangeoffreenewslettersandreceiveexclusivediscountsandoffersonPacktbooksandeBooks.

https://www2.packtpub.com/books/subscription/packtlib

DoyouneedinstantsolutionstoyourITquestions?PacktLibisPackt’sonlinedigitalbooklibrary.Here,youcansearch,access,andreadPackt’sentirelibraryofbooks.

Page 23: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Whysubscribe?FullysearchableacrosseverybookpublishedbyPacktCopyandpaste,print,andbookmarkcontentOndemandandaccessibleviaawebbrowser

Page 24: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

FreeaccessforPacktaccountholdersIfyouhaveanaccountwithPacktatwww.PacktPub.com,youcanusethistoaccessPacktLibtodayandview9entirelyfreebooks.Simplyuseyourlogincredentialsforimmediateaccess.

Page 25: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 26: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

PrefaceInashortspanoftime,YARNhasattainedagreatdealofmomentumandacceptanceinthebigdataworld.

YARNessentialsisaboutYARN—themodernoperatingsystemforHadoop.ThisbookcontainsallthatyouneedtoknowaboutYARN,rightfromitsinceptiontothepresentandfuture.

Inthefirstpartofthebook,youwillbeintroducedtothemotivationbehindthedevelopmentofYARNandlearnaboutitscorearchitecture,installation,andadministration.ThispartalsotalksaboutthearchitecturaldifferencesthatYARNbringstoHadoop2withrespecttoHadoop1andwhythisredesignwasneeded.

Inthesecondpart,youwilllearnhowtowriteaYARNapplication,howtosubmitanapplicationtoYARN,andhowtomonitortheapplication.Next,youwilllearnaboutthevariousemergingopensourceframeworksthataredevelopedtorunontopofYARN.YouwilllearntodevelopanddeploysomeusecaseexamplesusingApacheSamzaandStormYARN.

Finally,wewilltalkaboutthefailuresinYARN,somealternativesolutionsavailableonthemarket,andthefutureandsupportforYARNinthebigdataworld.

Page 27: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

WhatthisbookcoversChapter1,NeedforYARN,discussesthemotivationbehindthedevelopmentofYARN.ThischapterdiscusseswhatYARNisandwhyitisneeded.

Chapter2,YARNArchitecture,isadeepdiveintoYARN’sarchitecture.Allthemajorcomponentsandtheirinnerworkingsareexplainedinthischapter.

Chapter3,YARNInstallation,describesthestepsrequiredtosetupasingle-nodeandfully-distributedYARNcluster.Italsotalksabouttheimportantconfigurations/propertiesthatyoushouldbeawareofwhileinstallingtheYARNcluster.

Chapter4,YARNandHadoopEcosystems,talksaboutHadoopwithrespecttoYARN.ItgivesashortintroductiontotheHadoop1.xversion,thearchitecturaldifferencesbetweenHadoop1.xandHadoop2.x,andwhereexactlyYARNfitsintoHadoop2.x.

Chapter5,YARNAdministration,coversinformationontheadministrationofYARNclusters.ItexplainstheadministrativetoolsthatareavailableinYARN,whattheymean,andhowtousethem.ThischaptercoversvarioustopicsfromYARNcontainerallocationandconfigurationtovariousschedulingpolicies/configurationsandin-builtsupportformultitenancy.

Chapter6,DevelopingandRunningaSimpleYARNApplication,focusesonsomerealapplicationswithYARN,withsomehands-onexamples.ItexplainshowtowriteaYARNapplication,howtosubmitanapplicationtoYARN,andfinally,howtomonitortheapplication.

Chapter7,YARNFrameworks,discussesthevariousemergingopensourceframeworksthataredevelopedtorunontopofYARN.ThechapterthentalksindetailaboutApacheSamzaandStormonYARN,wherewewilldevelopandrunsomesampleapplicationsusingtheseframeworks.

Chapter8,FailuresinYARN,discussesthefault-toleranceaspectofYARN.ThischapterfocusesonvariousfailuresthatcanoccurintheYARNframework,theircauses,andhowYARNgracefullyhandlesthosefailures.

Chapter9,YARN–AlternativeSolutions,discussesotheralternativesolutionsthatareavailableonthemarkettoday.Thesesystems,likeYARN,sharecommoninspiration/requirementsandthehigh-levelgoalofimprovingscalability,latency,fault-tolerance,andprogrammingmodelflexibility.ThischapterhighlightsthekeydifferencesinthewaythesealternativesolutionsaddressthesamefeaturesprovidedbyYARN.

Chapter10,YARNFutureandSupport,talksaboutYARN’sjourneyanditspresentandfutureintheworldofdistributedcomputing.

Page 28: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 29: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

WhatyouneedforthisbookYouwillneedasingleLinux-basedmachinewithJDK1.6orlaterinstalled.AnyrecentversionoftheApacheHadoop2distributionwillbesufficienttosetupaYARNclusterandrunsomeexamplesontopofYARN.

ThecodeinthisbookhasbeentestedonCentOS6.4butwillrunonothervariantsofLinux.

Page 30: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 31: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

WhothisbookisforThisbookisforthebigdataenthusiastswhowanttogainin-depthknowledgeofYARNandknowwhatreallymakesYARNthemodernoperatingsystemforHadoop.YouwilldevelopagoodunderstandingofthearchitecturaldifferencesthatYARNbringstoHadoop2withrespecttoHadoop1.

Youwilldevelopin-depthknowledgeaboutthearchitectureandinnerworkingsoftheYARNframework.

Afterfinishingthisbook,youwillbeabletoinstall,administrate,anddevelopYARNapplications.ThisbooktellsyouanythingyouneedtoknowaboutYARN,rightfromitsinceptiontoitspresentandfutureinthebigdataindustry.

Page 32: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 33: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

ConventionsInthisbook,youwillfindanumberoftextstylesthatdistinguishbetweendifferentkindsofinformation.Herearesomeexamplesofthesestylesandanexplanationoftheirmeaning.

Codewordsintext,databasetablenames,foldernames,filenames,fileextensions,pathnames,dummyURLs,userinput,andTwitterhandlesareshownasfollows:”TheURLforNameNodeishttp://<namenode_host>:<port>/andthedefaultHTTPportis50070.”

Ablockofcodeissetasfollows:

<property>

<name>io.file.buffer.size</name>

<value>4096</value>

<description>readandwritebuffersizeoffiles</description>

</property>

Anycommand-lineinputoroutputiswrittenasfollows:

${path_to_your_input_dir}

${path_to_your_output_dir_old}

Newtermsandimportantwordsareshowninbold.Wordsthatyouseeonthescreen,forexample,inmenusordialogboxes,appearinthetextlikethis:“UndertheToolssection,youcanfindtheYARNconfigurationfiledetails,schedulinginformation,containerconfigurations,locallogsofthejobs,andalotofotherinformationonthecluster.”

NoteWarningsorimportantnotesappearinaboxlikethis.

TipTipsandtricksappearlikethis.

Page 34: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 35: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

ReaderfeedbackFeedbackfromourreadersisalwayswelcome.Letusknowwhatyouthinkaboutthisbook—whatyoulikedordisliked.Readerfeedbackisimportantforusasithelpsusdeveloptitlesthatyouwillreallygetthemostoutof.

Tosendusgeneralfeedback,simplye-mail<[email protected]>,andmentionthebook’stitleinthesubjectofyourmessage.

Ifthereisatopicthatyouhaveexpertiseinandyouareinterestedineitherwritingorcontributingtoabook,seeourauthorguideatwww.packtpub.com/authors.

Page 36: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 37: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

CustomersupportNowthatyouaretheproudownerofaPacktbook,wehaveanumberofthingstohelpyoutogetthemostfromyourpurchase.

Page 38: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

DownloadingtheexamplecodeYoucandownloadtheexamplecodefilesfromyouraccountathttp://www.packtpub.comforallthePacktPublishingbooksyouhavepurchased.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.

Page 39: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

ErrataAlthoughwehavetakeneverycaretoensuretheaccuracyofourcontent,mistakesdohappen.Ifyoufindamistakeinoneofourbooks—maybeamistakeinthetextorthecode—wewouldbegratefulifyoucouldreportthistous.Bydoingso,youcansaveotherreadersfromfrustrationandhelpusimprovesubsequentversionsofthisbook.Ifyoufindanyerrata,pleasereportthembyvisitinghttp://www.packtpub.com/submit-errata,selectingyourbook,clickingontheErrataSubmissionFormlink,andenteringthedetailsofyourerrata.Onceyourerrataareverified,yoursubmissionwillbeacceptedandtheerratawillbeuploadedtoourwebsiteoraddedtoanylistofexistingerrataundertheErratasectionofthattitle.

Toviewthepreviouslysubmittederrata,gotohttps://www.packtpub.com/books/content/supportandenterthenameofthebookinthesearchfield.TherequiredinformationwillappearundertheErratasection.

Page 40: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

PiracyPiracyofcopyrightedmaterialontheInternetisanongoingproblemacrossallmedia.AtPackt,wetaketheprotectionofourcopyrightandlicensesveryseriously.IfyoucomeacrossanyillegalcopiesofourworksinanyformontheInternet,pleaseprovideuswiththelocationaddressorwebsitenameimmediatelysothatwecanpursuearemedy.

Pleasecontactusat<[email protected]>withalinktothesuspectedpiratedmaterial.

Weappreciateyourhelpinprotectingourauthorsandourabilitytobringyouvaluablecontent.

Page 41: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

QuestionsIfyouhaveaproblemwithanyaspectofthisbook,youcancontactusat<[email protected]>,andwewilldoourbesttoaddresstheproblem.

Page 42: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 43: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Chapter1.NeedforYARNYARNstandsforYetAnotherResourceNegotiator.YARNisagenericresourceplatformtomanageresourcesinatypicalcluster.YARNwasintroducedwithHadoop2.0,whichisanopensourcedistributedprocessingframeworkfromtheApacheSoftwareFoundation.

In2012,YARNbecameoneofthesubprojectsofthelargerApacheHadoopproject.YARNisalsocoinedbythenameofMapReduce2.0.ThisissinceApacheHadoopMapReducehasbeenre-architecturedfromthegrounduptoApacheHadoopYARN.

ThinkofYARNasagenericcomputingfabrictosupportMapReduceandotherapplicationparadigmswithinthesameHadoopcluster;earlier,thiswaslimitedtobatchprocessingusingMapReduce.ThisreallychangedthegametorecastApacheHadoopasamuchmorepowerfuldataprocessingsystem.WiththeadventofYARN,Hadoopnowlooksverydifferentcomparedtothewayitwasonlyayearago.

YARNenablesmultipleapplicationstorunsimultaneouslyonthesamesharedclusterandallowsapplicationstonegotiateresourcesbasedonneed.Therefore,resourceallocation/managementiscentraltoYARN.

YARNhasbeenthoroughlytestedatYahoo!sinceSeptember2012.Ithasbeeninproductionacross30,000nodesand325PBofdatasinceJanuary2013.

Recently,ApacheHadoopYARNwontheBestPaperAwardatACMSymposiumonCloudComputing(SoCC)in2013!

Page 44: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

TheredesignideaInitially,HadoopwaswrittensolelyasaMapReduceengine.Sinceitrunsonacluster,itsclustermanagementcomponentswerealsotightlycoupledwiththeMapReduceprogrammingparadigm.

TheconceptsofMapReduceanditsprogrammingparadigmweresodeeplyingrainedinHadoopthatonecouldnotuseitforanythingelseexceptMapReduce.MapReducethereforebecamethebaseforHadoop,andasaresult,theonlythingthatcouldberunonHadoopwasaMapReducejob,batchprocessing.InHadoop1.x,therewasasingleJobTrackerservicethatwasoverloadedwithmanythingssuchasclusterresourcemanagement,schedulingjobs,managingcomputationalresources,restartingfailedtasks,monitoringTaskTrackers,andsoon.

TherewasdefinitelyaneedtoseparatetheMapReduce(specificprogrammingmodel)partandtheresourcemanagementinfrastructureinHadoop.YARNwasthefirstattempttoperformthisseparation.

Page 45: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

LimitationsoftheclassicalMapReduceorHadoop1.xThemainlimitationsofHadoop1.xcanbecategorizedintothefollowingareas:

Limitedscalability:

LargeHadoopclustersreportedsomeseriouslimitationsonscalability.ThisiscausedmainlybyasingleJobTrackerservice,whichultimatelyresultsinaseriousdeteriorationoftheoverallclusterperformancebecauseofattemptstore-replicatedataandoverloadlivenodes,thuscausinganetworkflood.AccordingtoYahoo!,thepracticallimitsofsuchadesignarereachedwithaclusterof~5,000nodesand40,000tasksrunningconcurrently.Therefore,itisrecommendedthatyoucreatesmallerandlesspowerfulclustersforsuchadesign.

Lowclusterresourceutilization:

TheresourcesinHadoop1.xoneachslavenode(datanode),aredividedintermsofafixednumberofmapandreduceslots.ConsiderthescenariowhereaMapReducejobhasalreadytakenupalltheavailablemapslotsandnowwantsmorenewmaptaskstorun.Inthiscase,itcannotrunnewmaptasks,eventhoughallthereduceslotsarestillempty.Thisnotionofafixednumberofslotshasaseriousdrawbackandresultsinpoorclusterutilization.

Lackofsupportforalternativeframeworks/paradigms:

ThemainfocusofHadooprightfromthebeginningwastoperformcomputationonlargedatasetsusingparallelprocessing.Therefore,theonlyprogrammingmodelitsupportedwasMapReduce.Withthecurrentindustryneedsintermsofnewusecasesintheworldofbigdata,manynewandalternativeprogrammingmodels(suchApacheGiraph,ApacheSpark,Storm,Tez,andsoon)arecomingintothepictureeachday.ThereisdefinitelyanincreasingdemandtosupportmultipleprogrammingparadigmsbesidesMapReduce,tosupportthevariedusecasesthatthebigdataworldisfacing.

Page 46: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

YARNasthemodernoperatingsystemofHadoopTheMapReduceprogrammingmodelis,nodoubt,greatformanyapplications,butnotforeverythingintheworldofcomputation.ThereareusecasesthatarebestsuitedforMapReduce,butnotall.

MapReduceisessentiallybatch-oriented,butsupportforreal-timeandnearreal-timeprocessingaretheemergingrequirementsinthefieldofbigdata.

YARNtookclusterresourcemanagementcapabilitiesfromtheMapReducesystemsothatnewenginescouldusethesegenericclusterresourcemanagementcapabilities.ThislighteneduptheMapReducesystemtofocusonthedataprocessingpart,whichitisgoodatandwillideallycontinuetobeso.

YARNthereforeturnsintoadataoperatingsystemforHadoop2.0,asitenablesmultipleapplicationstocoexistinthesamesharedcluster.Refertothefollowingfigure:

YARNasamodernOSforHadoop

Page 47: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

WhatarethedesigngoalsforYARNThissectiontalksaboutthecoredesigngoalsofYARN:

Scalability:

Scalabilityisakeyrequirementforbigdata.Hadoopwasprimarilymeanttoworkonaclusterofthousandsofnodeswithcommodityhardware.Also,thecostofhardwareisreducingyear-on-year.YARNisthereforedesignedtoperformefficientlyonthisnetworkofamyriadofnodes.

Highclusterutilization:

InHadoop1.x,theclusterresourcesweredividedintermsoffixedsizeslotsforbothmapandreducetasks.Thismeansthattherecouldbeascenariowheremapslotsmightbefullwhilereduceslotsareempty,orviceversa.Thiswasdefinitelynotanoptimalutilizationofresources,anditneededfurtheroptimization.YARNfine-grainedresourcesintermsofRAM,CPU,anddisk(containers),leadingtoanoptimalutilizationoftheavailableresources.

Localityawareness:

ThisisakeyrequirementforYARNwhendealingwithbigdata;movingcomputationischeaperthanmovingdata.Thishelpstominimizenetworkcongestionandincreasetheoverallthroughputofthesystem.

Multitenancy:

WiththecoredevelopmentofHadoopatYahoo,primarilytosupportlarge-scalecomputation,HDFSalsoacquiredapermissionmodel,quotas,andotherfeaturestoimproveitsmultitenantoperation.YARNwasthereforedesignedtosupportmultitenancyinitscorearchitecture.Sinceclusterresourceallocation/managementisattheheartofYARN,sharingprocessingandstoragecapacityacrossclusterswascentraltothedesign.YARNhasthenotionofpluggableschedulersandtheCapacitySchedulerwithYARNhasbeenenhancedtoprovideaflexibleresourcemodel,elasticcomputing,applicationlimits,andothernecessaryfeaturesthatenablemultipletenantstosecurelysharetheclusterinanoptimizedway.

Supportforprogrammingmodel:

TheMapReduceprogrammingmodelisnodoubtgreatformanyapplications,butnotforeverythingintheworldofcomputation.Astheworldofbigdataisstillinitsinceptionphase,organizationsareheavilyinvestinginR&Dtodevelopnewandevolvingframeworkstosolveavarietyofproblemsthatbigdatabrings.

Aflexibleresourcemodel:

Page 48: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Besidesmismatchwiththeemergingframeworks’requirements,thefixednumberofslotsforresourceshadseriousproblems.ItwasstraightforwardforYARNtocomeupwithaflexibleandgenericresourcemanagementmodel.

Asecureandauditableoperation:

AsHadoopcontinuedtogrowtomanagemoretenantswithamyriadofusecasesacrossdifferentindustries,therequirementsforisolationbecamemoredemanding.Also,theauthorizationmodellackedstrongandscalableauthentication.ThisisbecauseHadoopwasdesignedwithparallelprocessinginmind,withnocomprehensivesecurity.Securitywasanafterthought.YARNunderstandsthisandaddssecurity-relatedrequirementsintoitsdesign.

Reliability/availability:

Althoughfaulttoleranceisinthecoredesign,inrealitymaintainingalargeHadoopclusterisatedioustask.Allissuesrelatedtohighavailability,failures,failuresonrestart,andreliabilitywerethereforeacorerequirementforYARN.

Backwardcompatibility:

Hadoop1.xhasbeeninthepictureforawhile,withmanysuccessfulproductiondeploymentsacrossmanyindustries.ThismassiveinstallationbaseofMapReduceapplicationsandtheecosystemofrelatedprojects,suchasHive,Pig,andsoon,wouldnottoleratearadicalredesign.Therefore,thenewarchitecturereusedasmuchcodefromtheexistingframeworkaspossible,andnomajorsurgerywasconductedonit.ThismadeMRv2abletoensuresatisfactorycompatibilitywithMRv1applications.

Page 49: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 50: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

SummaryInthischapter,youlearnedwhatYARNisandhowithasturnedouttobethemodernoperatingsystemforHadoop,makingitamultiapplicationplatform.

InChapter2,YARNArchitecture,wewillbetalkingaboutthearchitecturedetailsofYARN.

Page 51: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 52: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Chapter2.YARNArchitectureThischapterdivesdeepintoYARNarchitectureitscorecomponents,andhowtheyinteracttodeliveroptimalresourceutilization,betterperformance,andmanageability.ItalsofocusesonsomeimportantterminologyconcerningYARN.

Inthischapter,wewillcoverthefollowingtopics:

CorecomponentsofYARNarchitectureInteractionandflowofYARNcomponentsResourceManagerschedulingpoliciesRecentdevelopmentsinYARN

ThemotivationbehindtheYARNarchitectureistosupportmoredataprocessingmodels,suchasApacheSpark,ApacheStorm,ApacheGiraph,ApacheHAMA,andsoon,thanjustMapReduce.YARNprovidesaplatformtodevelopandexecutedistributedprocessingapplications.Italsoimprovesefficiencyandresource-sharingcapabilities.

ThedesigndecisionbehindYARNarchitectureistoseparatetwomajorfunctionalities,resourcemanagementandjobschedulingormonitoringofJobTracker,intoseparatedaemons,thatis,aclusterlevelResourceManager(RM)andanapplication-specificApplicationMaster(AM).YARNarchitecturefollowsamaster-slavearchitecturalmodelinwhichtheResourceManageristhemasterandnode-specificslaveNodeManager(NM).TheglobalResourceManagerandper-nodeNodeManagerbuildsamostgeneric,scalable,andsimpleplatformfordistributedapplicationmanagement.TheResourceManageristhesupervisorcomponentthatmanagestheresourcesamongtheapplicationsinthewholesystem.Theper-applicationApplicationMasteristheapplication-specificdaemonthatnegotiatesresourcesfromResourceManagerandworksinhandwithNodeManagerstoexecuteandmonitortheapplication’stasks.

ThefollowingdiagramexplainshowJobTrackerisreplacedbyagloballevelResourceManagerandApplicationManagerandaper-nodeTaskTrackerisreplacedbyanapplication-levelApplicationMastertomanageitsfunctionsandresponsibilities.JobTrackerandTaskTrackeronlysupportMapReduceapplicationswithlessscalabilityandpoorclusterutilization.Now,YARNsupportsmultipledistributeddataprocessingmodelswithimprovedscalabilityandclusterutilization.

Page 53: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

TheResourceManagerhasacluster-levelschedulerthathasresponsibilityforresourceallocationtoalltherunningtasksaspertheApplicationManager’srequests.TheprimaryresponsibilityoftheResourceManageristoallocateresourcestotheapplication(s).TheResourceManagerisnotresponsiblefortrackingthestatusofanapplicationormonitoringtasks.Also,itdoesn’tguaranteerestarting/balancingtasksinthecaseofapplicationorhardwarefailure.

Theapplication-levelApplicationMasterisresponsiblefornegotiatingresourcesfromtheResourceManageronapplicationsubmission,suchasmemory,CPU,disk,andsoon.Itisalsoresponsiblefortrackinganapplication’sstatusandmonitoringapplicationprocessesincoordinationwiththeNodeManager.

Let’shavealookatthehigh-levelarchitectureofHadoop2.0.Asyoucansee,moreapplicationscanbesupportedbyYARNthanjusttheMapReduceapplication.ThekeycomponentofHadoop2isYARN,forbetterclusterresourcemanagement,andtheunderlyingfilesystemremainsthesameasHadoopDistributedFileSystem(HDFS)andisshowninthefollowingimage:

Page 54: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

HerearesomekeyconceptsthatweshouldknowbeforeexploringtheYARNarchitectureindetail:

Application:Thisisthejobsubmittedtotheframework,forexampleaMapReducejob.Itcouldalsobeashellscript.Container:Thisisthebasicunitofhardwareallocation,forexampleacontainerthathas4GBofRAMandoneCPU.Thecontainerdoesoptimizedresourceallocation;thisreplacesthefixedmapandreduceslotsinthepreviousversionsofHadoop.

Page 55: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

CorecomponentsofYARNarchitectureHerearesomecorecomponentsofYARNarchitecturethatweneedtoknow:

ResourceManagerApplicationMasterNodeManager

Page 56: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

ResourceManagerResourceManageractsasaglobalresourceschedulerthatisresponsibleforresourcemanagementandschedulingaspertheApplicationMaster’srequestsfortheresourcerequirementsoftheapplication(s).Itisalsoresponsibleforthemanagementofhierarchicaljobqueues.TheResourceManagercanbeseeninthefollowingfigure:

TheprecedingdiagramgivesmoredetailsaboutthecomponentsoftheResourceManager.TheAdminandClientserviceisresponsibleforclientinteractions,suchasajobrequestsubmission,start,restart,andsoon.TheApplicationsManagerisresponsibleforthemanagementofeveryapplication.TheApplicationMasterServiceinteractswitheveryapplication.ApplicationMasterregardingresourceorcontainernegotiation,theResourceTrackerServicecoordinateswiththeNodeManagerandResourceManager.TheApplicationMasterLauncherserviceisresponsibleforlaunchingacontainerfortheApplicationMasteronjobsubmissionfromtheclient.TheSchedulerandSecurityarethecorepartsoftheResourceManager.Asalreadyexplained,theSchedulerisresponsibleforresourcenegotiationandallocationtotheapplicationsaspertherequestoftheApplicationMaster.Therearethreedifferentpoliciesofscheduler,FIFO,Fair,andCapacity,whichwillbeexplainedindetaillaterinthischapter.Thesecuritycomponentisresponsibleforgeneratinganddelegatingan/theApplicationTokenandContainerTokentoaccesstheapplicationandcontainer,respectively.

Page 57: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

ApplicationMaster(AM)TheApplicationMasterisataper-applicationlevel.Itisresponsiblefortheapplication’slifecyclemanagementandfornegotiatingtheappropriateresourcesfromtheScheduler,trackingtheirstatusandprogressmonitoring,forexample,MapReduceApplicationMaster.

Page 58: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

NodeManager(NM)NodeManageractsasaper-machineagentandisresponsibleformanagingthelifecycleofthecontainerandformonitoringtheirresourceusage.ThecorecomponentsoftheNodeManagerareshowninthefollowingdiagram:

ThecomponentresponsibleforcommunicationbetweentheNodeManagerandResourceManageristheNodeStatusUpdater.TheContainerManageristhecorecomponentoftheNodeManager;itmanagesallthecontainersthatrunonthenode.NodeHealthCheckerServiceistheservicethatmonitorsthenode’shealthandcommunicatesthenode’sheartbeattotheResourceManagerviatheNodeStatusUpdaterservice.TheContainerExecutoristheprocessresponsibleforinteractingwithnativehardwareorsoftwaretostartorstopthecontainerprocess.ManagementofAccessControlList(ACL)andaccesstokenverificationisperformedbytheSecuritycomponent.

Let’stakealookatonescenariotounderstandYARNarchitectureindetail.Refertothefollowingdiagram:

Page 59: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Saywehavetwoclientrequests:onewantstoexecuteasimpleshellscript,whileanotheronewantstoexecuteacomplexMapReducejob.TheShellScriptisrepresentedinmarooncolor,whiletheMapReducejobisrepresentedinlightgreencolorintheprecedingdiagram.

TheResourceManagerhastwomaincomponents,theApplicationManagerandtheScheduler.TheApplicationManagerisresponsibleforacceptingtheclient’sjobsubmissionrequests,negotiatingthecontainerstoexecutetheapplicationsspecifictotheApplicationMaster,andprovidingtheservicestorestarttheApplicationMasteronfailure.TheresponsibilityoftheScheduleristoallocateresourcestothevariousrunningapplicationswithrespecttotheapplicationresourcerequirementsandavailableresources.TheSchedulerisapureschedulerinthesensethatitprovidesnomonitoringortrackingfunctionsfortheapplication.Also,itdoesn’tofferanyguaranteesforrestartingafailedtaskeitherduetofailureintheapplicationorinthehardware.TheSchedulerperformsitsschedulingtasksbasedontheresourcerequirementsoftheapplication(s);itdoessobasedontheabstractnotionoftheresourcecontainer,whichincorporateselementssuchasCPU,memory,disk,andsoon.

TheNodeManageristheper-machineframeworkdaemonthatisresponsibleforthecontainers’lifecycles.Itisalsoresponsibleformonitoringtheirresourceusage,forexample,memory,CPU,disk,network,andsoon,andforreportingthistotheResourceManageraccordingly.Theapplication-levelApplicationMasterisresponsiblefornegotiatingtherequiredresourcecontainersfromthescheduler,trackingtheirstatus,andmonitoringprogress.Intheprecedingdiagram,youcanseethatbothjobs,ShellScriptand

Page 60: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

MapReduce,haveanindividualApplicationMasterthatallocatesresourcesforjobexecutionandtotrack/monitorthejobexecutionstatus.

Now,takealookattheexecutionsequenceoftheapplication.Refertotheprecedingapplicationflowdiagram.

AclientsubmitstheapplicationtotheResourceManager.Intheprecedingdiagram,client1submitsaShellScriptRequest(marooncolor),andclient2submitsaMapReducerequest(greencolor):

1. Then,theResourceManagerallocatesacontainertostartuptheApplicationMasteraspertheapplicationsubmittedbytheclient:oneApplicationMasterfortheshellscriptandonefortheMapReduceapplication.

2. WhilestartingtheApplicationMaster,theResourceManagerregisterstheapplicationwiththeResourceManager.

3. AfterthestartupoftheApplicationMaster,itnegotiateswiththeResourceManagerforappropriateresourcesaspertheapplicationrequirement.

4. Then,afterresourceallocationfromtheResourceManager,theApplicationMasterrequeststhattheNodeManagerlaunchesthecontainersallocatedbytheResourceManager.

5. Onsuccessfullaunchingofthecontainers,theapplicationcodeexecuteswithinthecontainer,andtheApplicationManagerreportsbacktotheResourceManagerwiththeexecutionstatusoftheapplication.

6. Duringtheexecutionoftheapplication,theclientcanrequesttheApplicationMasterortheResourceManagerdirectlyfortheapplicationstatus,progressupdates,andsoon.

7. Onexecutionoftheapplication,theApplicationMasterrequeststhattheResourceManagerunregistersandshutdownsitsowncontainerprocess.

Page 61: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 62: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

YARNschedulerpoliciesAsexplainedintheprevioussection,theResourceManageractsasapluggableglobalschedulerthatmanagesandcontrolsallthecontainers(resources).Therearethreedifferentpoliciesthatcanbeappliedoverthescheduler,asperrequirementsandresourceavailability.Theyareasfollows:

TheFIFOschedulerTheFairschedulerTheCapacityscheduler

Page 63: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

TheFIFO(FirstInFirstOut)schedulerFIFOmeansFirstInFirstOut.Asthenameindicates,thejobsubmittedfirstwillgetprioritytoexecute;inotherwords,thejobrunsintheorderofsubmission.FIFOisaqueue-basedscheduler.Itisaverysimpleapproachtoschedulinganditdoesnotguaranteeperformanceefficiency,aseachjobwoulduseawholeclusterforexecution.Sootherjobsmaykeepwaitingtofinishtheirexecution,althoughasharedclusterhasagreatcapabilitytooffermore-than-enoughresourcestomanyusers.

Page 64: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

ThefairschedulerFairschedulingisthepolicyofschedulingthatassignsresourcesfortheexecutionoftheapplicationsothatallapplicationsgetanequalshareofclusterrecoursesoveraperiodoftime.Forexample,ifasinglejobisrunning,itwouldgetalltheresourcesavailableinthecluster,andasthejobnumberincreases,freerecourseswillbegiventothejobssothateachuserwillgetafairshareofthecluster.Iftwousershavesubmittedtwodifferentjobs,ashortjobthatbelongstoauserwouldcompleteinasmalltimespanwhilealongerjobsubmittedbytheotheruserkeepsrunning,solongjobswillstillmakesomeprogress.

InaFairschedulingpolicy,alljobsareplacedintojobpools,specifictousers;accordingly,eachusergetstheirownjobpool.Theuserwhosubmitsmorejobsthantheotheruserwillnotgetmoreresourcesthanthefirstuseronaverage.Youmayevendefineyourowncustomizedjobpoolswithspecifiedconfigurations.Fairschedulingisapreemptivescheduling,asifapoolhasnotreceivedfairresourcestorunaparticulartaskforacertainperiodoftime.Inthiscase,theschedulerwillkillthetasksinpoolsthatrunoutofcapacity,toreleaseresourcestothepoolsthatrunundercapacity.

Inadditiontofairscheduling,theFairschedulerallocatesaguaranteedminimumshareofresourcestothepools.Thisisalwayshelpfulfortheusers,groups,orapplications,astheyalwaysgetsufficientresourcesforexecution.

Page 65: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

ThecapacityschedulerTheCapacityschedulerisdesignedtoallowapplicationstoshareclusterresourcesinapredictableandsimplefashion.Thesearecommonlyknownas“jobqueues”.Themainideabehindcapacityschedulingistoallocateavailableresourcestotherunningapplications,basedonindividualneedsandrequirements.Thereareadditionalbenefitswhenrunningtheapplicationusingcapacityscheduling,astheycanaccesstheexcesscapacityresourcesthatarenotbeingusedbyanyotherapplications.

Theabstractionprovidedbythecapacityscheduleristhequeue.Itprovidescapacityguaranteesforsupportformultiplequeueswhereajobissubmittedtothequeue,andqueuesareallocatedacapacityinthesensethatacertaincapacityofresourceswillbeattheirdisposal.Allthejobssubmittedtothequeuewillaccesstheresourcesallocatedtothejobqueue.Adminscancontrolthecapacityofeachqueue.

Herearesomebasicfeaturesofthecapacityscheduler:

Security:EachqueuehasstrictACLsthattakecontroloftheauthorizationandauthenticationofuserswhocansubmitjobstoindividualqueues.Elasticity:Freeresourcesareallocatedtoanyqueuebeyonditscapacity.Ifthereisdemandfortheseresourcesfromqueuesthatrunbelowcapacity,thenassoonasthetaskscheduledontheseresourceshascompleted,theywillbeassignedtojobsonqueuesthatrunundercapacity.Operability:Theadmincan,atanypointintime,changequeuedefinitionsandproperties.Multitenancy:Allsetsoflimitsareprovidedtopreventasinglejob,user,andqueuefromobtainingtheresourcesofthequeueorcluster.Thisistoensurethatthesystem,specificallyapreviousversionofHadoop,isnotsuppressedbytoomanytasks.Resource-basedscheduling:Intensivejobsupport,asjobscanspecificallydemandforhigherresourcerequirementsthandefault.Jobpriorities:Thesejobqueuescansupportjobpriorities.Withinthequeue,jobswithhighpriorityhaveaccesstoresourcesbeforejobswithlowerpriority.

Page 66: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 67: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

RecentdevelopmentsinYARNarchitectureTheResourceManagerisasinglepointoffailureandrestartbecauseofvariousreasons:bugs,hardwarefailure,deliberatedowntimeforupgrading,andsoon.

WealreadysawhowcrucialtheroleoftheResourceManagerinYARNarchitectureis.TheResourceManagerhasbecomeasinglepointoffailure;iftheResourceManagerinaclustergoesdown,everythingonthatclusterwillbelost.

SoinarecentdevelopmentofYARN,ResourceManagerHAbecameahighpriority.ThisrecentdevelopmentofYARNnotonlycoversResourceManagerHA,butalsoprovidestransparencytousersanddoesnotrequirethemtomonitorsucheventsexplicitlyandresubmitthejobs.

OverlycomplexinMRv1forthefactthatJobTrackerhastosavetoomuchofmeta-data:bothclusterstateandper-applicationrunningstate.ThismeansthatifJob-Trackerdies,thenalltheapplicationsinarunningstatewillbelost.

ThedevelopmentofResourceManagerrecoverywillbedoneintwophases:

1. RMRestartPhaseI:Inthisphase,alltheapplicationswillbekilledwhilerestartingtheResourceManageronfailure.Nostateoftheapplicationcanbestored.Developmentofthisphaseisalmostcompleted.

2. RMRestartPhaseII:AsinPhaseII,theapplicationwillstorethestateonRMfailure;thismeansthatapplicationsarenotkilled,andtheyreporttherunningstatebacktotheRMaftertheRMcomesbackup.

TheResourceManagerwillbeusedonlytosaveanapplication’ssubmissionmetadataandcluster-levelinformation.Applicationstatepersistenceandtherecoveryofspecificinformationwillbemanagedbytheapplicationitself.

Page 68: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Asshownintheprecedingdiagram,inthenextversion,wewillgetapluggablestatestore,suchasZookeeperandHDFS,thatcanstorethestateoftherunningapplications.ResourceManagerHAwouldcontainsynchronizedactive-passiveResourceManagerarchitecturalmodelsmanagedbyZookeeper;asonegoesdown,theothercantakeoverclusterresponsibilitywithouthaltingandlosinginformation.

Page 69: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 70: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

SummaryInthischapter,wecoveredthearchitecturalcomponentsofYARN,theirresponsibilities,andtheirinteroperations.Wealsofocusedonsomemajordevelopmentworkgoingoninthecommunitytoovercomethedrawbacksofthecurrentrelease.Inthenextchapter,wewillcovertheinstallationstepsofYARN.

Page 71: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 72: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Chapter3.YARNInstallationInthissection,we’llcovertheinstallationofHadoopandYARNandtheirconfigurationforasingle-nodeandsingle-clustersetup.Now,wewillconsiderHadoopastwodifferentcomponents:oneisHadoopDistributedFileSystem(HDFS),theotherisYARN.TheYARNcomponentstakecareofresourceallocationandtheschedulingofthejobsthatrunoverthedatastoredinHDFS.We’llcovermostoftheconfigurationstomakeYARNdistributedcomputingmoreoptimizedandefficient.

Inthischapter,wewillcoverthefollowingtopics:

HadoopandYARNsingle-nodeinstallationHadoopandYARNfully-distributedmodeinstallationOperatingHadoopandYARNclusters

Page 73: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Single-nodeinstallationLet’sstartwiththestepsforHadoop’ssingle-nodeinstallations,asit’seasytounderstandandsetup.Thisway,wecanquicklyperformsimpleoperationsusingHadoopMapReduceandtheHDFS.

Page 74: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

PrerequisitesHerearesomeprerequisitesneededforHadoopinstallations;makesurethattheprerequisitesarefulfilledtostartworkingwithHadoopandYARN.

PlatformGNU/UnixissupportedforHadoopinstallationasadevelopmentaswellasaproductionplatform.TheWindowsplatformisalsosupportedforHadoopinstallation,withsomeextraconfigurations.Now,we’llfocusmoreonLinux-basedplatforms,asHadoopismorewidelyusedwiththeseplatformsandworksmoreefficientlywithLinuxcomparedtoWindowssystems.Herearethestepsforsingle-nodeHadoopinstallationforLinuxsystems.IfyouwanttoinstallitonWindows,refertotheHadoopwikipagefortheinstallationsteps.

SoftwareHere’ssomesoftware;makesurethattheyareinstalledbeforeinstallingHadoop.

Javamustbeinstalled.ConfirmwhethertheJavaversioniscompatiblewiththeHadoopversionthatistobeinstalledbycheckingtheHadoopwikipage(http://wiki.apache.org/hadoop/HadoopJavaVersions).

SSHandSSHDmustbeinstalledandrunning,astheyareusedbyHadoopscriptstomanageremoteHadoopdaemons.

Now,downloadtherecentstablereleaseoftheHadoopdistributionfromApachemirrorsandarchivesusingthefollowingcommand:

$$wgethttp://mirrors.ibiblio.org/apache/hadoop/common/hadoop-

2.6.0/hadoop-2.6.0.tar.gz

Notethatatthetimeofwritingthisbook,Hadoop2.6.0isthemostrecentstablerelease.Nowusethefollowingcommands:

$$mkdir–p/opt/yarn

$$cd/opt/yarn

$$tarxvzf/root/hadoop-2.6.0.tar.gz

Page 75: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

StartingwiththeinstallationNow,unzipthedownloaddistributionunderthe/etc/directory.ChangetheHadoopenvironmentalparametersasperthefollowingconfigurations.

SettheJAVA_HOMEenvironmentalparametertotheJAVArootinstalledbefore:

$$exportJAVA_HOME=etc/java/latest

SettheHadoophometotheHadoopinstallationdirectory:

$$exportHADOOP_HOME=etc/hadoop

TryrunningtheHadoopcommand.ItshoulddisplaytheHadoopdocumentation;thisindicatesasuccessfulHadoopconfiguration.

Now,ourHadoopsingle-nodesetupisreadytoruninthefollowingmodes.

Thestandalonemode(localmode)Bydefault,HadooprunsinstandalonemodeasasingleJavaprocess.Thismodeisusefulfordevelopmentanddebugging.

Thepseudo-distributedmodeHadoopcanrunonasinglenodeinpseudo-distributedmode,aseachdaemonisrunasaseparateJavaprocess.TorunHadoopinpseudo-distributedmode,followtheseconfigurationinstructions.First,navigatetothe/etc/hadoop/core-site.xml.

ThisconfigurationfortheNameNodesetupwillrunonlocalhostport9000.YoucansetthefollowingpropertyfortheNameNode:

<configuration>

<property>

<name>fs.defaultFS</name>

<value>hdfs://localhost:9000</value>

</property>

</configuration>

Nownavigateto/etc/hadoop/hdfs-site.xml.

Bysettingthefollowingproperty,weareensuringthatthereplicationfactorofeachdatablockis3(bydefault,thereplicationfactoris3):

<configuration>

<property>

<name>dfs.replication</name>

<value>3</value>

</property>

</configuration>

Then,formattheHadoopfilesystemusingthiscommand:

$$$HADOOP_HOME/bin/hdfsnamenode–format

Afterformattingthefilesystem,startthenamenodeanddatanodedaemonsusingthenext

Page 76: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

command.Youcanseelogsunderthe$HADOOP_HOME/logsdirectorybydefault:

$$$HADOOP_HOME/sbin/start-dfs.sh

Now,wecanseethenamenodeUIonthewebinterface.Hithttp://localhost:50070/inthebrowser.

CreatetheHDFSdirectoriesthatarerequiredtorunMapReducejobs:

$$$HADOOP_HOME/bin/hdfs-mkdir/user

$$$HADOOP_HOME/bin/hdfs-mkdir/user/{username}

ToMapReducejobonYARNinpseudo-distributedmode,youneedtostarttheResourceManagerandNodeManagerdaemons.Navigateto/etc/hadoop/mapred-site.xml:

<configuration>

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

</configuration>

Navigateto/etc/hadoop/yarn-site.xml:

<configuration>

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

</configuration>

Now,starttheResourceManagerandNodeManagerdaemonsbyissuingthiscommand:

$$sbin/start-yarn.sh

Bysimplynavigatingtohttp://localhost:8088/inyourbrowser,youcanseethewebinterfacefortheResourceManager.Fromhere,youcanstart,restart,orstopthejobs.

TostoptheYARNdaemons,youneedtorunthefollowingcommand:

$$$HADOOP_HOME/sbin/stop-yarn.sh

ThisishowwecanconfigureHadoopandYARNinasinglenodeinstandaloneandpseudo-distributedmodes.Movingforward,wewillfocusonfully-distributedmode.Asthebasicconfigurationremainsthesame,weonlyneedtodosomeextraconfigurationforfully-distributedmode.Single-nodesetupismainlyusedfordevelopmentanddebuggingofdistributedapplications,whilefully-distributedmodeisusedfortheproductionsetup.

Page 77: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 78: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Thefully-distributedmodeIntheprevioussection,wehighlightedthestandaloneHadoopandYARNconfigurations,andinthissectionwe’llfocusonthefully-distributedmodesetup.Thissectiondescribeshowtoinstall,configure,andmanageHadoopandYARNinfully-distributed,verylargeclusterswiththousandsofnodesinthem.

Inordertostartwithfully-distributedmode,wefirstneedtodownloadthestableversionofHadoopfromApachemirrors.InstallingHadoopindistributedmodegenerallymeansunpackingthesoftwaredistributiononeachmachineintheclusterorinstallingRedHatPackageManagers(RPMs).AsHadoopfollowsamaster-slavearchitecture,onemachineintheclusterisdesignatedastheNameNode(NN),oneastheResourceManager(RM),andtherestofthemachines,DataNodes(DN)andNodeManagers(NM),willtypicallyactsasslaves.

AfterthesuccessfulunpackingofsoftwaredistributiononeachclustermachineorRPMinstallation,youneedtotakecareofaveryimportantpartoftheHadoopinstallationphase,Hadoopconfiguration.

Hadooptypicallyhastwotypesofconfiguration:oneistheread-onlydefaultconfiguration(core-default.xml,hdfs-default.xml,yarn-default.xml,andmapred-default.xml),whiletheotheristhesite-specificconfiguration(core-site.xml,hdfs-site.xml,yarn-site.xml,andmapred-site.xml).Allthesefilearefoundunderthe$HADOOP_HOME/confdirectory.

Inadditiontotheprecedingconfigurationfiles,theHadoop-environmentandYARN-environmentspecificfileisfoundinconf/hadoop-env.shandconf/yarn-env.sh.AsfortheHadoopandYARNclusterconfiguration,youneedtosetupanenvironmentinwhichHadoopdaemonscanexecute.TheHadoop/YARNdaemonsaretheNameNode/ResourceManager(masters)andtheDataNode/NodeManager(slaves).

First,makesurethatJAVA_HOMEiscorrectlyspecifiedoneachnode.

Herearesomeimportantconfigurationparameterswithrespecttoeachdaemon:

NameNode:HADOOP_NAMENODE_OPTSDataNode:HADOOP_DATANODE_OPTSSecondaryNameNode:HADOOP_SECONDARYNAMENODE_OPTSResourceManager:YARN_RESOURCEMANAGER_OPTSNodeManager:YARN_NODEMANAGER_OPTSWebAppProxy:YARN_PROXYSERVER_OPTSMapReduceJobHistoryServer:HADOOP_JOB_HISTORYSERVER_OPTS

Forexample,toruntheNameNodeinparallelGCmode,thefollowinglineshouldbeaddedintohadoop-env.sh:

$$exportHADOOP_NAMENODE_OPTS="-XX:+UseParallelGC${HADOOP_NAMENODE_OPTS}"

Herearesomeimportantconfigurationparameterswithrespecttothedaemonanditsconfigurationfiles.

Page 79: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Navigatetoconf/core-site.xmlandconfigureitasfollows:

fs.defaultFS:NameNodeURI,hdfs://<hdfshost>:<hdfsport>

<property>

<name>fs.defaultFS</name>

<value>hdfs://$<hdfshostname>:<hdfsport></value>

<description>ItisaNameNodehostname</description>

</property>

Theio.file.buffer.size:4096,readandwritebuffersizeoffiles.

ThebuffersizeforI/O(read/write)operationonsequencefilesstoredindiskfiles,thatis,itdetermineshowmuchdataisbufferedinI/Opipesbeforetransferringittootheroperationsduringread/writeoperations.IshouldbemultipleofOSfilesystemblocksize.

<property>

<name>io.file.buffer.size</name>

<value>4096</value>

<description>readandwritebuffersizeoffiles</description>

</property>

Nownavigatetoconf/hdfs-site.xml.HereistheconfigurationfortheNameNode:

Parameter Description

dfs.namenode.name.dirThepathonthelocalfilesystemwheretheNameNodegeneratesthenamespaceandapplicationtransactionlogs.

dfs.namenode.hosts ThelistofpermittedDataNodes.

dfs.namenode.hosts.exclude ThelistofexcludedDataNodes.

dfs.blocksize Thedefaultvalueis268435456.TheHDFSblocksizeis256MBforlargefilesystems.

dfs.namenode.handler.countThedefaultvalueis100.MoreNameNodeserverthreadstohandleRPCsfromalargenumberofDataNodes.

TheconfigurationfortheDataNodeisasfollows:

Parameter Description

dfs.datanode.data.dir Comma-delimitedlistofpathsonthelocalfilesystemswheretheDataNodestorestheblocks

Nownavigatetoconf/yarn-site.xml.We’lltakealookattheconfigurationsrelatedtotheResourceManagerandNodeManager:

Parameter Description

yarn.acl.enable ValuesaretrueorfalsetoenableordisableACLs.Thedefaultvalueisfalse.

yarn.admin.aclThisreferstotheadminorACL.Thedefaultis*,whichmeansanyonecandoadmintasks.ACLsetsadminsonthecluster.Thiscouldbeacomma-delimitedusergrouptosetmorethanoneadmin.

yarn.log-

Page 80: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

aggregation-

enableThisistrueorfalsetoenableordisablelogaggregation.

Now,wewilltakelookatconfigurationsfortheResourceManagerintheconf/yarn-site.xmlfile:

Parameter Description

yarn.resourcemanager.address ThisistheResourceManagerhost:portforclientstosubmitjobs.

yarn.resourcemanager.scheduler.addressThisistheResourceManagerhost:portforApplicationMasterstotalktotheSchedulertoobtainresources.

yarn.resourcemanager.resource-

tracker.addressThisistheResourceManagerhost:portforNodeManagers.

yarn.resourcemanager.admin.addressThisistheResourceManagerhost:portforadministrativecommands.

yarn.resourcemanager.webapp.address ThisistheResourceManagerweb-uihost:port.

yarn.resourcemanager.scheduler.classThisistheResourceManagerSchedulerclass.ThevaluesareCapacityScheduler,FairScheduler,andFifoScheduler.

yarn.scheduler.minimum-allocation-mbThisistheminimumlimitofmemorytoallocatetoeachcontainerrequestintheResourceManager.

yarn.scheduler.maximum-allocation-mbThisisthemaximumlimitofmemorytoallocatetoeachcontainerrequestintheResourceManager.

yarn.resourcemanager.nodes.include-path/

yarn.resourcemanager.nodes.exclude-path

Thisisthelistofpermitted/excludedNodeManagers.Ifnecessary,usethesefilestocontrolthelistofpermittedNodeManagers.

NowtakelookatconfigurationsfortheNodeManagerinconf/yarn-site.xml:

Parameter Description

yarn.nodemanager.resource.memory-

mb

Thisreferstotheavailablephysicalmemory(MBs)fortheNodeManager.ItdefinesthetotalavailablememoryresourcesontheNodeManagertobemadeavailabletotherunningcontainers.

yarn.nodemanager.vmem-pmem-ratioThisreferstothemaximumratiobywhichvirtualmemoryusageoftasksmayexceedphysicalmemory.

yarn.nodemanager.local-dirsThisreferstothelistofdirectorypathsonthelocalfilesystemwhereintermediatedataiswritten.Thisshouldbeacomma-separatedlist.

yarn.nodemanager.log-dirs Thisreferstothepathonthelocalfilesystemwherelogsarewritten.

yarn.nodemanager.log.retain-

seconds

Thisreferstothetime(inseconds)topersistlogfilesontheNodeManager.Thedefaultvalueis10800seconds.Thisconfigurationisapplicableonlyiflogaggregationisenabled.

yarn.nodemanager.remote-app-log-

dir

ThisistheHDFSdirectorypathtowhichlogshavebeenmovedafterapplicationcompletion.Thedefaultpathis/logs.Thisconfigurationis

Page 81: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

applicableonlyiflogaggregationisenabled.

yarn.nodemanager.remote-app-log-

dir-suffix

Thisreferstothespecifiedsuffixappendedtotheremotelogdirectory.Thisconfigurationisapplicableonlyiflogaggregationisenabled.

yarn.nodemanager.aux-servicesThisreferstotheshuffleservicethatspecificallyneedstobesetforMapReduceapplications.

Page 82: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

HistoryServerTheHistoryServerallowsallYARNapplicationswithacentrallocationtoaggregatetheircompletedjobsforhistoricalreferenceanddebugging.ThesettingsfortheMapReduceJobHistoryServercanbefoundinthemapred-default.xmlfile:

mapreduce.jobhistory.address:MapReduceJobHistoryServerhost:port.Thedefaultportis10020.mapreduce.jobhistory.webapp.address:ThisistheMapReduceJobHistoryServerWebUIhost:port.Thedefaultportis19888.mapreduce.jobhistory.intermediate-done-dir:ThisisthedirectorywherehistoryfilesarewrittenbyMapReducejobs(inHDFS).Thedefaultis/mr-history/tmp.mapreduce.jobhistory.done-dir:ThisisthedirectorywherehistoryfilesaremanagedbytheMRJobHistoryServer(inHDFS).Thedefaultis/mr-history/done.

Page 83: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

SlavefilesWithrespecttotheHadoopslaveandYARNslavenodes,generallyonechoosesonenodeintheclusterastheNameNode(Hadoopmaster),anothernodeastheResourceManager(YARNmaster),andtherestofthemachineactsasbothHadoopslaveDataNodesandYarnslaveNodeManagers.Listalltheslaves,oneperlinehostnameorIPaddressesinyourHadoopconf/slavesfile.

Page 84: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 85: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

OperatingHadoopandYARNclustersThisisthefinalstageofHadoopandYARNclustersetupandconfiguration.HerearethecommandsthatneedtobeusedtostartandstoptheHadoopandYARNclusters.

Page 86: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

StartingHadoopandYARNclustersTostartHadoopandtheYARNcluster,usewiththefollowingprocedure:

1. FormataHadoopdistributedfilesystem:

$HADOOP_HOME/bin/hdfsnamenode-format<cluster_name>

2. ThefollowingcommandisusedtostartHDFS.RunitontheNameNode:

$HADOOP_HOME/sbin/hadoop-daemon.sh--config$HADOOP_CONF_DIR--script

hdfsstartnamenode

3. RunthiscommandtostartDataNodesonallslavesnodes:

$HADOOP_HOME/sbin/hadoop-daemon.sh--config$HADOOP_CONF_DIR--script

hdfsstartdatanode

4. StartYARNwiththefollowingcommandontheResourceManager:

$HADOOP_YARN_HOME/sbin/yarn-daemon.sh--config$HADOOP_CONF_DIRstart

resourcemanager

5. ExecutethiscommandtostartNodeManagersonallslaves:

$HADOOP_YARN_HOME/sbin/yarn-daemon.sh--config$HADOOP_CONF_DIRstart

nodemanager

6. StartastandaloneWebAppProxyserver.Thisisusedforload-balancingpurposesonamultiservercluster:

$HADOOP_YARN_HOME/sbin/yarn-daemonartproxyserver--config

$HADOOP_CONF_DIR

7. ExecutethiscommandonthedesignatedHistoryServer:

$HADOOP_HOME/sbin/mr-jobhistory-daemon.shstarthistoryserver--config

$HADOOP_CONF_DIR

Page 87: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

StoppingHadoopandYARNclustersTostopHadoopandtheYARNcluster,usewiththefollowingprocedure:

1. UsethefollowingcommandontheNameNodetostopit:

$HADOOP_HOME/sbin/hadoop-daemon.sh--config$HADOOP_CONF_DIR--script

hdfsstopnamenode

2. IssuethiscommandonalltheslavenodestostopDataNodes:

$HADOOP_HOME/sbin/hadoop-daemon.sh--config$HADOOP_CONF_DIR--script

hdfsstopdatanode

3. TostoptheResourceManager,issuethefollowingcommandonthespecifiedResourceManager:

$HADOOP_YARN_HOME/sbin/yarn-daemon.sh--config$HADOOP_CONF_DIRstop

resourcemanager

4. ThefollowingcommandisusedtostoptheNodeManageronallslavenodes:

$HADOOP_YARN_HOME/sbin/yarn-daemon.sh--config$HADOOP_CONF_DIRstop

nodemanager

5. StoptheWebAppProxyserver:

$HADOOP_YARN_HOME/sbin/yarn-daemon.shstopproxyserver--config

$HADOOP_CONF_DIR

6. StoptheMapReduceJobHistoryServerbyrunningthefollowingcommandontheHistoryServer:

$HADOOP_HOME/sbin/mr-jobhistory-daemon.shstophistoryserver--config

$HADOOP_CONF_DIR

Page 88: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 89: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

WebinterfacesoftheEcosystemIt’sallabouttheHadoopandYARNsetupandconfigurationsandcommandingoverHadoopandYARN.HerearesomewebinterfacesusedbyHadoopandYARNadministratorsforadmintasks:

TheURLfortheNameNodeishttp://<namenode_host>:<port>/andthedefaultHTTPportis50070.

TheURLfortheResourceManagerishttp://<resourcermanager_host>:<port>/andthedefaultHTTPportis8088.TheWebUIfortheNameNodecanbeseenasfollows:

Page 90: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

TheURLfortheMapReduceJobHistoryServerishttp://<jobhistoryserver_host>:<port>/andthedefaultHTTPportis19888.

Page 91: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 92: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

SummaryInthissection,wecoveredHadoopandYARNsingle-nodeandfully-distributedclustersetupandimportantconfigurations.WealsocoveredthebasicbutimportantcommandstoadministrateHadoopandYARNclusters.Inthenextchapter,we’lllookattheHadoopandYARNcomponentsinmoredetail.

Page 93: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 94: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Chapter4.YARNandHadoopEcosystemsThischapterdiscussesYARNwithrespecttoHadoop,sinceitisveryimportanttoknowwhereexactlyYARNfitsinHadoop2now.

Hadoop2hasundergoneacompletechangeintermsofarchitectureandcomponentscomparedtoHadoop1.

Inthischapter,wewillbecoverthefollowingtopics:

AshortintroductiontoHadoop1ThedifferencebetweenMRv1andMRv2WhereYARNfitsinHadoop2OldandnewMapReduceAPIsBackwardcompatibilityofMRv2APIsPracticalexamplesofMRv1andMRv2

Page 95: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

TheHadoop2releaseYARNcameintothepicturewiththereleaseofHadoop0.23onNovember11,2011.ThiswasthealphaversionoftheHadoop0.23majorrelease.

Themajordifferencebetween0.23andpre-0.23releasesisthatthe0.23releasehadundergoneacompleterevampintermsoftheMapReduceengineandresourcemanagement.This0.23releaseseparatedoutresourcemanagementandapplicationlifecyclemanagement.

Page 96: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 97: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

AshortintroductiontoHadoop1.xandMRv1WewillbrieflylookatthebasicApacheHadoop1.xanditsprocessingframework,MRv1(Classic),sothatwecangetaclearpictureofthedifferencesinApacheHadoop2.xMRv2(YARN)intermsofarchitecture,components,andprocessingframework.

ApacheHadoopisascalable,fault-tolerantdistributedsystemfordatastorageandprocessing.ThecoreprogrammingmodelinHadoopisMapReduce.

Since2004,Hadoophasemergedasthedefactostandardtostore,process,andanalyzehundredsofterabytesandevenpetabytesofdata.

ThemajorcomponentsinHadoop1.xareasfollows:

NameNode:Thiskeepsthemetadatainthemainmemory.DataNode:Thisiswherethedataresidesintheformofblocks.JobTracker:Thisassigns/reassignsMapReducetaskstoTaskTrackersintheclusterandtracksthestatusofeachTaskTracker.TaskTracker:ThisexecutesthetaskassignedbytheJobTrackerandsendsthestatusofthetasktotheJobTracker.

ThemajorcomponentsofHadoop1.xcanbeseenasfollows:

AtypicalHadoop1.xcluster(shownintheprecedingfigure)canconsistofthousandsof

Page 98: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

nodes.ItfollowstheMaster\Slavepattern,wheretheNameNodes\JobTrackersarethemastersandtheDataNodes\TaskTrackersaretheslaves.

ThemaindataprocessingisdistributedacrosstheclusterintheDataNodestoincreaseparallelprocessing.

ThemasterNameNodeprocess(masterforslaveDataNodes)managesthefilesystem,andthemasterJobTrackerprocess(masterforslaveTaskTrackers)managesthetasks.Thetopologyisseenasfollows:

AHadoopclustercanbeconsideredtobemainlymadeupoftwodistinguishableparts:

HDFS:Thisistheunderlyingstoragelayerthatactsasafilesystemfordistributeddatastorage.Youcanputdataofanyformat,schema,andtypeonit,suchasstructured,semi-structured,orunstructureddata.ThisflexibilitymakesHadoopfitforthedatalake,whichissometimescalledthebitbucketorthelandingzone.MapReduce:Thisistheexecutionlayerwhichistheonlydistributeddata-processingframework.

TipDownloadingtheexamplecode

Youcandownloadtheexamplecodefilesfromyouraccountathttp://www.packtpub.comforallthePacktPublishingbooksyouhavepurchased.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.

Page 99: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 100: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

MRv1versusMRv2MRv1(MapReduceversion1)ispartofApacheHadoop1.xandisanimplementationoftheMapReduceprogrammingparadigm.

TheMapReduceprojectitselfcanbebrokenintothefollowingparts:

End-userMapReduceAPI:ThisistheAPIneededtodeveloptheMapReduceapplication.MapReduceframework:Thisistheruntimeimplementationofvariousphases,suchasthemapphase,thesort/shuffle/mergeaggregationphase,andthereducephase.MapReducesystem:ThisisthebackendinfrastructurerequiredtorunMapReduceapplicationsandincludesthingssuchasclusterresourcemanagement,schedulingofjobs,andsoon.

Hadoop1.xwaswrittensolelyasanMRengine.Sinceitrunsonacluster,itsclustermanagementcomponentwasalsotightlycoupledwiththeMRprogrammingparadigm.TheonlythingthatcouldberunonHadoop1.xwasanMRjob.

InMRv1,theclusterwasmanagedbyasingleJobTrackerandmultipleTaskTrackersrunningontheDataNodes.

InHadoop2.x,theoldMRv1frameworkwasrewrittentorunontopofYARN.ThisapplicationwasnamedMRv2,orMapReduceversion2.ItisthefamiliarMapReduceexecutionunderneath,exceptthateachjobnowrunsonYARN.

ThecoredifferencebetweenMRv1andMRv2isthewaytheMapReducejobsareexecuted.

WithHadoop1.x,itwastheJobTrackerandTaskTrackers,butnowwithYARNonHadoop2.x,it’stheResourceManager,ApplicationMaster,andNodeManagers.

However,theunderlyingconcept,theMapReduceframework,remainsthesame.

Hadoop2hasbeenredefinedfromHDFS-plus-MapReducetoHDFS-plus-YARN.

Referringtothefollowingfigure,YARNtookcontroloftheresourcemanagementandapplicationlifecyclepartofHadoop1.x.

YARNtherefore,definitelyresultsinincreasedROIforHadoopinvestment,inthesensethatnowthesameHadoop2.xclusterresourcescanbeusedtodomultiplethings,suchasbatchprocessing,real-timeprocessing,SQLapplications,andsoon.

Earlier,runningthisvarietyofapplicationswasnotpossible,andpeoplehadtouseaseparateHadoopclusterforMapReduceandaseparateonetodosomethingelse.

Page 101: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 102: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 103: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

UnderstandingwhereYARNfitsintoHadoopIfwerefertoHadoop1.xinthefirstfigureofthischapter,thenitisclearthattheresponsibilitiesoftheJobTrackermainlyincludedthefollowing:

ManagingthecomputationalresourcesintermsofmapandreduceslotsSchedulingsubmittedjobsMonitoringtheexecutionsoftheTaskTrackersRestartingfailedtasksPerformingaspeculativeexecutionoftasksCalculatingtheJobCounters

Clearly,theJobTrackeralonedoesalotoftaskstogetherandisoverloadedwithlotsofwork.

ThisoverloadingoftheJobTrackerledtotheredesignoftheJobTracker,andYARNtriedtoreducetheresponsibilitiesoftheJobTrackerinthefollowingways:

ClusterresourcemanagementandSchedulingresponsibilitiesweremovedtotheglobalResourceManager(RM)Theapplicationlifecyclemanagement,thatis,jobexecutionandmonitoringwasmovedintoaper-applicationApplicationMaster(AM)

TheGlobalResourceManagerisseeninthefollowingimage:

Ifyoulookattheprecedingfigure,youwillclearlyseethedisappearanceofthesinglecentralizedJobTracker;itsplaceistakenbyaGlobalResourceManager.

Also,foreachjobatiny,dedicatedJobTrackeriscreated,whichmonitorsthetasksspecifictoitsjob.ThistinyJobTrackerisrunontheslavenode.

Page 104: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Thistiny,dedicatedJobTrackeristermedanApplicationMasterinthenewframework(refertothefollowingfigure).

Also,theTaskTrackersarereferredtoasNodeManagersinthenewframework.

Finally,lookingattheJobTrackerredesign(inthefollowingfigure),wecanclearlyseethattheJobTracker’sresponsibilitiesarebrokenintoaper-clusterResourceManagerandaper-applicationApplicationMaster:

Page 105: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

TheResourceManagertopologycanbeseenasfollows:

Page 106: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 107: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

OldandnewMapReduceAPIsThenewAPI(whichisalsoknownasContextObjects)wasprimarilydesignedtomaketheAPIeasiertoevolveinthefutureandistypeincompatiblewiththeoldone.

ThenewAPIcameintothepicturefromthe1.xreleaseseries.However,itwaspartiallysupportedinthisseries.So,theoldAPIisrecommendedfor1.xseries:

Feature\Release 1.x 0.23

OldMapReduceAPI Yes Deprecated

NewMapReduceAPI Partial Yes

MRv1runtime(Classic) Yes No

MRv2runtime(YARN) No Yes

TheoldandnewAPIcanbecomparedasfollows:

OldAPI NewAPI

TheoldAPIisintheorg.apache.hadoop.mapredpackageandisstillpresent.

ThenewAPIisintheorg.apache.hadoop.mapreducepackage.

TheoldAPIusedinterfacesforMapperandReducer. ThenewAPIusesAbstractClassesforMapperandReducer.

TheoldAPIusedtheJobConf,OutputCollector,andReporterobjecttocommunicatewiththeMapReducesystem.

ThenewAPIusesthecontextobjecttocommunicatewiththeMapReducesystem.

IntheoldAPI,jobcontrolwasdonethroughtheJobClient.

InthenewAPI,jobcontrolisperformedthroughtheJobclass.

IntheoldAPI,jobconfigurationwasdonewithaJobConfobject.

InthenewAPO,jobconfigurationisdonethroughtheConfigurationclassviasomeofthehelpermethodsonJob.

IntheoldAPI,boththemapandreduceoutputsarenamedpart-nnnnn.

InthenewAPI,themapoutputsarenamedpart-m-nnnnnandthereduceoutputsarenamedpart-r-nnnnn.

IntheoldAPI,thereduce()methodpassesvaluesasajava.lang.Iterator.

InthenewAPI,the.methodpassesvaluesasajava.lang.Iterable.

TheoldAPIcontrolsmappersbywritingaMapRunnable,butnoequivalentexistsforreducers.

ThenewAPIallowsbothmappersandreducerstocontroltheexecutionflowbyoverridingtherun()method.

Page 108: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 109: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

BackwardcompatibilityofMRv2APIsThissectiondiscussesthescopeandlevelofbackwardcompatibilitysupportedinApacheHadoopMapReduce2.x(MRv2).

Page 110: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Binarycompatibilityoforg.apache.hadoop.mapredAPIsBinarycompatibilityheremeansthatthecompiledbinariesshouldbeabletorunwithoutanymodificationonthenewframework.

ForthoseHadoop1.xuserswhousetheorg.apache.hadoop.mapredAPIs,theycansimplyruntheirMapReducejobsonYARNjustbypointingthemtotheirApacheHadoop2.xclusterviatheconfigurationsettings.

Theywillnotneedanyrecompilation.AlltheywillneedtodoispointtheirapplicationtotheYARNinstallationandpointHADOOP_CONF_DIRtothecorrespondingconfigurationdirectory.Theyarn-site.xml(configurationforYARN)andmapred-site.xmlfiles(configurationforMapReduceapps)arepresentintheconfdirectory.

Also,mapred.job.trackerinmapred-site.xmlisnolongernecessaryinApacheHadoop2.x.Instead,thefollowingpropertyneedstobeaddedinthemapred-site.xmlfiletomakeMRv1applicationsrunontopofYARN:

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

Page 111: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Sourcecompatibilityoforg.apache.hadoop.mapredAPIsSourceincompatibilitymeansthatsomecodechangesarerequiredforcompilation.Sourceincompatibilityisorthogonaltobinarycompatibility.

Binariesforanapplicationthatisbinarycompatiblebutnotsourcecompatiblewillcontinuetorunfineonthenewframework.However,codechangesarerequiredtoregeneratethesebinaries.

ApacheHadoop2.xdoesnotensurecompletebinarycompatibilitywiththeapplicationsthatuseorg.apache.hadoop.mapreduceAPIs,astheseAPIshaveevolvedalotsinceMRv1.However,itensuressourcecompatibilityfororg.apache.hadoop.mapreduceAPIsthatbreakbinarycompatibility.Inotherwords,youshouldrecompiletheapplicationsthatuseMapReduceAPIsagainstMRv2JARs.

ExistingapplicationsthatuseMapReduceAPIsaresourcecompatibleandcanrunonYARNwithnochanges,recompilation,and/orminorupdates.

IfanMRv1MapReduce-basedapplicationfailstorunonYARN,youarerequestedtoinvestigateitssourcecodeandcheckwhetherMapReduceAPIsarereferredtoornot.Iftheyarereferredto,youhavetorecompiletheapplicationagainsttheMRv2JARsthatareshippedwithHadoop2.

Page 112: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 113: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

PracticalexamplesofMRv1andMRv2WewillnowpresentaMapReduceexampleusingboththeoldandnewMapReduceAPIs.

WewillnowwriteaMapReduceprograminJavathatfindsalltheanagrams(aword,phrase,ornameformedbyrearrangingthelettersofanother,suchascinema,formedfromiceman)presentstheminaninputfile,andfinallyprintsalltheanagramsintheoutputfile.

HereistheAnagramMapperOldAPI.javaclassthatusestheoldMapReduceAPI:

importjava.io.IOException;

importjava.util.Arrays;

importorg.apache.hadoop.io.Text;

importorg.apache.hadoop.mapred.MapReduceBase;

importorg.apache.hadoop.mapred.Mapper;

importorg.apache.hadoop.mapred.OutputCollector;

importorg.apache.hadoop.mapred.Reporter;

importjava.util.StringTokenizer;

/**

*TheAnagrammapperclassgetsawordasalinefromtheHDFSinputand

sortsthe

*lettersinthewordandwritesitsbacktotheoutputcollectoras

*Key:sortedword(lettersinthewordsorted)

*Value:theworditselfasthevalue.

*Whenthereducerrunsthenwecangroupanagramstogetherbasedonthe

sortedkey.

*/

publicclassAnagramMapperOldAPIextendsMapReduceBaseimplements

Mapper<Object,Text,Text,Text>{

privateTextsortedText=newText();

privateTextoriginalText=newText();

@Override

publicvoidmap(ObjectkeyNotUsed,Textvalue,

OutputCollector<Text,Text>output,Reporterreporter)

throwsIOException{

Stringline=value.toString().trim().toLowerCase().replace(",","");

System.out.println("LINE:"+line);

StringTokenizerst=newStringTokenizer(line);

System.out.println("----Splitbyspace------");

while(st.hasMoreElements()){

Stringword=(String)st.nextElement();

char[]wordChars=word.toCharArray();

Arrays.sort(wordChars);

Page 114: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

StringsortedWord=newString(wordChars);

sortedText.set(sortedWord);

originalText.set(word);

System.out.println("\torig:"+word+"\tsorted:"+sortedWord);

output.collect(sortedText,originalText);

}

}

}

HereistheAnagramReducerOldAPI.javaclassthatusestheoldMapReduceAPI:

importjava.io.IOException;

importjava.util.Iterator;

importjava.util.StringTokenizer;

importorg.apache.hadoop.io.Text;

importorg.apache.hadoop.mapred.MapReduceBase;

importorg.apache.hadoop.mapred.OutputCollector;

importorg.apache.hadoop.mapred.Reducer;

importorg.apache.hadoop.mapred.Reporter;

publicclassAnagramReducerOldAPIextendsMapReduceBaseimplements

Reducer<Text,Text,Text,Text>{

privateTextoutputKey=newText();

privateTextoutputValue=newText();

publicvoidreduce(TextanagramKey,Iterator<Text>anagramValues,

OutputCollector<Text,Text>output,Reporterreporter)

throwsIOException{

Stringout="";

//Consideringwordswithlength>2

if(anagramKey.toString().length()>2){

System.out.println("ReducerKey:"+anagramKey);

while(anagramValues.hasNext()){

out=out+anagramValues.next()+"~";

}

StringTokenizeroutputTokenizer=newStringTokenizer(out,"~");

if(outputTokenizer.countTokens()>=2){

out=out.replace("~",",");

outputKey.set(anagramKey.toString()+"-->");

outputValue.set(out);

System.out.println("************Writingreduceroutput:"

+anagramKey.toString()+"-->"+out);

output.collect(outputKey,outputValue);

}

}

}

Page 115: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

}

Finally,toruntheMapReduceprogram,wehavetheAnagramJobOldAPI.javaclasswrittenusingtheoldMapReduceAPI:

importorg.apache.hadoop.fs.Path;

importorg.apache.hadoop.io.Text;

importorg.apache.hadoop.mapred.FileInputFormat;

importorg.apache.hadoop.mapred.FileOutputFormat;

importorg.apache.hadoop.mapred.JobClient;

importorg.apache.hadoop.mapred.JobConf;

publicclassAnagramJobOldAPI{

publicstaticvoidmain(String[]args)throwsException{

if(args.length!=2){

System.err.println("Usage:Anagram<inputpath><outputpath>");

System.exit(-1);

}

JobConfconf=newJobConf(AnagramJobOldAPI.class);

conf.setJobName("AnagramJobOldAPI");

FileInputFormat.addInputPath(conf,newPath(args[0]));

FileOutputFormat.setOutputPath(conf,newPath(args[1]));

conf.setMapperClass(AnagramMapperOldAPI.class);

conf.setReducerClass(AnagramReducerOldAPI.class);

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(Text.class);

JobClient.runJob(conf);

}

}

Next,wewillwritethesameMapper,Reducer,andJobclassesusingthenewMapReduceAPI.

HereistheAnagramMapper.javaclassthatusesthenewMapReduceAPI:

importjava.io.IOException;

importjava.util.Arrays;

importjava.util.StringTokenizer;

importorg.apache.hadoop.io.Text;

importorg.apache.hadoop.mapreduce.Mapper;

publicclassAnagramMapperextendsMapper<Object,Text,Text,Text>{

privateTextsortedText=newText();

privateTextorginalText=newText();

@Override

Page 116: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

publicvoidmap(Objectkey,Textvalue,Contextcontext)

throwsIOException,InterruptedException{

Stringline=value.toString().trim().toLowerCase().replace(",","");

System.out.println("LINE:"+line);

StringTokenizerst=newStringTokenizer(line);

System.out.println("----Splitbyspace------");

while(st.hasMoreElements()){

Stringword=(String)st.nextElement();

char[]wordChars=word.toCharArray();

Arrays.sort(wordChars);

StringsortedWord=newString(wordChars);

sortedText.set(sortedWord);

orginalText.set(word);

System.out.println("\torig:"+word+"\tsorted:"+sortedWord);

context.write(sortedText,orginalText);

}

}

}

HereistheAnagramReducer.javaclassthatusesthenewMapReduceAPI:

importjava.io.IOException;

importjava.util.StringTokenizer;

importorg.apache.hadoop.io.Text;

importorg.apache.hadoop.mapreduce.Reducer;

publicclassAnagramReducerextendsReducer<Text,Text,Text,Text>{

privateTextoutputKey=newText();

privateTextoutputValue=newText();

publicvoidreduce(TextanagramKey,Iterable<Text>anagramValues,

Contextcontext)throwsIOException,InterruptedException{

Stringout="";

if(anagramKey.toString().length()>2){

System.out.println("ReducerKey:"+anagramKey);

for(Textanagram:anagramValues){

out=out+anagram.toString()+"~";

}

StringTokenizeroutputTokenizer=newStringTokenizer(out,"~");

if(outputTokenizer.countTokens()>=2){

out=out.replace("~",",");

outputKey.set(anagramKey.toString()+"-->");

Page 117: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

outputValue.set(out);

System.out.println("******Writingreduceroutput:"

+anagramKey.toString()+"-->"+out);

context.write(outputKey,outputValue);

}

}

}

}

Finally,hereistheAnagramJob.javaclassthatusesthenewMapReduceAPI:

importorg.apache.hadoop.fs.Path;

importorg.apache.hadoop.io.Text;

importorg.apache.hadoop.mapreduce.Job;

importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;

importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

publicclassAnagramJob{

publicstaticvoidmain(String[]args)throwsException{

if(args.length!=2){

System.err.println("Usage:Anagram<inputpath><outputpath>");

System.exit(-1);

}

Jobjob=newJob();

job.setJarByClass(AnagramJob.class);

job.setJobName("AnagramJob");

FileInputFormat.addInputPath(job,newPath(args[0]));

FileOutputFormat.setOutputPath(job,newPath(args[1]));

job.setMapperClass(AnagramMapper.class);

job.setReducerClass(AnagramReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

System.exit(job.waitForCompletion(true)?0:1);

}

}

Page 118: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Preparingtheinputfile(s)1. Createa${Inputfile_1}filewiththefollowingcontents:

TheProjectGutenbergEtextofMobyWordIIbyGradyWard

hellotheredrawehllolemonsmelonssolemn

Also,bluestbluetsbustlesubletsubtle

2. Createanotherfile,${Inputfile_2},withthefollowingcontents:

Cinemaisanagramtoiceman

Secondisstop,tops,opts,pots,andspot

Stoolandtools

Secureandrescue

3. Copythesefilesinto${path_to_your_input_dir}.

Page 119: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

RunningthejobRuntheAnagramJobOldAPI.javaclassandpassthefollowingascommand-lineargs:

${path_to_your_input_dir}

${path_to_your_output_dir_old}

Now,runtheAnagramJob.javaclassandpassthefollowingascommand-lineargs:

${path_to_your_input_dir}

${path_to_your_output_dir_new}

Page 120: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

ResultThefinaloutputwrittentois${path_to_your_output_dir_old}and${path_to_your_output_dir_new}.

Thesearethecontentsthatwewillseeintheoutputfile:

aceimn-->cinema,iceman,

adn-->and,and,and,

adrw-->ward,draw,

belstu-->subtle,bustle,bluets,bluest,sublet,

ceersu-->rescue,secure,

ehllo-->hello,ehllo,

elmnos-->lemons,melons,solemn,

loost-->stool,tools,

opst-->pots,tops,stop,spot,opts,

Page 121: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 122: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

SummaryInthischapter,westartedwithabriefhistoryofHadoopreleases.Next,wecoveredthebasicsofHadoop1.xandMRv1.WethenlookedatthecoredifferencesbetweenMRv1andMRv2andhowYARNfitsintoaHadoopenvironment.WealsosawhowtheJobTracker’sresponsibilitieswerebrokendowninHadoop2.x.

WealsotalkedabouttheoldandnewMapReduceAPIs,theirorigin,differences,andsupportinYARN.Finally,weconcludedthechapterwithsomepracticalexamplesusingtheoldandnewMapReduceAPIs.

Inthenextchapter,youwilllearnabouttheadministrationpartofYARN.

Page 123: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 124: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Chapter5.YARNAdministrationInthissection,wewillfocusonYARN’sadministrativepartandontheadministratorrolesandresponsibilitiesofYARN.Wewillalsogainamoredetailedinsightintotheadministrationconfigurationsettingsandparameters,applicationcontainermonitoring,andoptimizedresourceallocations,aswellasschedulingandmultitenancyapplicationsupportinYARN.We’llalsocoverthebasicadministrationtoolsandconfigurationoptionsofYARN.

Thefollowingtopicswillbecoveredinthischapter:

YARNcontainerallocationandconfigurationsSchedulingpoliciesYARNmultitenancyapplicationsupportYARNadministrationandtools

Page 125: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

ContainerallocationAtaveryfundamentallevel,thecontaineristhegroupofphysicalresourcessuchasmemory,disk,network,CPU,andsoon.Therecanbeoneormorecontainersonasinglemachine;forexample,ifamachinehas16GBofRAMand8coreprocessors,thenasinglecontainercouldbe1CPUcoreand2GBofRAM.Thismeansthatthereareatotalof8containersonasinglemachine,ortherecouldbeasinglelargecontainerwithalltheoccupiedresources.So,acontainerisaphysicalnotationofmemory,CPU,network,disk,andsooninthecluster.Thecontainer’slifecycleismanagedbytheNodeManager,andtheschedulingisdonebytheResourceManager.Thecontainerallocationcanbeseenasfollows:

YARNisdesignedtoallocateresourcecontainerstotheindividualapplicationsinashared,secure,andmultitenantmanner.WhenanyjobortaskissubmittedtotheYARNframework,theResourceManagertakescareoftheresourceallocationstotheapplication,dependingonschedulingconfigurationsandtheapplication’sneedsandrequirementsviatheApplicationMaster.Toachievethisgoal,thecentralschedulermaintainsthemetadataaboutalltheapplication’sresourcerequirements;thisleadstoefficientschedulingdecisionsforalltheapplicationsthatrunintothecluster.

Let’stakealookathowcontainerallocationhappensinatraditionalHadoopsetup.InthetraditionalHadoopapproach,oneachnodethereisapredefinedandfixednumberofmapslotsandapredefinedandfixednumberofreduceslots.Themapandreducefunctionsareunabletoshareslots,astheyarepredefinedforspecificoperationsonly.Thisstaticallocationisnotefficient;forexample,oneclusterhasafixedtotalof32mapslotsand32reduceslots.WhilerunningaMapReduceapplication,ittookonly16mapslotsandrequiredmorethan32slotsforreduceoperations.Thereduceroperationisunabletousethe16freemapperslots,astheyarepredefinedformapperfunctionalitiesonly,sothereducefunctionhastowaituntilsomereduceslotsbecomefree.

Toovercomethisproblem,YARNhascontainerslots.Irrespectiveoftheapplication,allcontainersareabletorunallapplications;forexample,ifYARNhas64availablecontainersintheclusterandisrunningthesameMapReduceapplication,ifthemapperfunctiontakesonly16slotsandthereducerrequiresmoreresourceslots,thenallother

Page 126: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

freeresourcesintheclusterareallocatedtothereduceroperation.Thismakestheoperationmoreefficientandproductive.

Essentially,anapplicationdemandstherequiredresourcesfromtheResourceManagertosatisfyitsneedsviatheApplicationMaster.Then,byallocatingtherequestedresourcestoanapplication,theResourceManagerrespondstotheapplication’sResourceRequest.TheResourceRequestcontainsthenameoftheresourcethathasbeenrequested;priorityoftherequestwithinthevariousotherResourceRequestsofthesameapplication;resourcerequirementcapabilities,suchasRAM,disk,CPU,network,andsoon;andthenumberofresources.ContainerallocationfromtheResourceManagertotheapplicationmeansthesuccessfulfulfillmentofthespecificResourceRequest.

Page 127: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

ContainerallocationtotheapplicationNow,takealookatthefollowingsequencediagram:

ThediagramshowshowcontainerallocationisdoneforapplicationsviatheApplicationMaster.Itcanbeexplainedasfollows:

1. TheclientsubmitstheapplicationrequesttotheResourceManager.2. TheResourceManagerregisterstheapplicationwiththeApplicationManager,

generatestheApplicationID,andrespondstotheclientwiththesuccessfullyregisteredApplicationID.

3. Then,theResourceManagerstartstheclientApplicationMasterinaseparateavailablecontainer.Ifnocontainerisavailable,thisrequesthastowaituntilasuitablecontainerisfoundandthensendtheapplicationregistrationrequestforapplicationregistration.

4. TheResourceManagersharesalltheminimumandmaximumresourcecapabilitiesoftheclusterwiththeApplicationMaster.Then,theApplicationMasterdecideshowtoefficientlyusetheavailableresourcestofulfilltheapplication’sneeds.

5. DependingontheresourcecapabilitiessharedbytheResourceManager,theApplicationMasterrequeststhattheResourceManagerallocatesanumberofcontainersonbehalfoftheapplication.

6. TheResourceManagerrespondstotheResourceRequestbytheApplicationMasteraspertheschedulingpoliciesandresourceavailability.ContainerallocationbytheResourceManagermeansthesuccessfulfulfillmentoftheResourceRequestbytheApplicationMaster.

Whilerunningthejob,theApplicationMastersendstheheartbeatandjobprogress

Page 128: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

informationoftheapplicationtotheResourceManager.Duringtheruntimeoftheapplication,theApplicationMasterrequestsforthereleaseorallocationofmorecontainersfromtheResourceManager.Whenthejobfinishes,theApplicationMastersendsacontainerde-allocationrequesttotheResourceManagerandexitsitselffromrunningthecontainer.

Page 129: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 130: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

ContainerconfigurationsHerearethesomeimportantconfigurationsrelatedtoresourcecontainersthatareusedtocontrolcontainers.

Tocontrolthememoryallocationtoacontainer,theadministratorneedstosetthefollowingthreeparametersintheyarn-site.xmlconfigurationfile:

Parameter Description

yarn.nodemanager.resource.memory-

mb

ThisistheamountofmemoryinMBsthattheNodeManagercanuseforthecontainers.

yarn.scheduler.minimum-

allocation-mb

ThisisthesmallestamountofmemoryinMBsallocatedtothecontainerbytheResourceManager.Thedefaultvalueis1024MB.

yarn.scheduler.maximum-

allocation-mb

ThisisthelargestamountofmemoryinMBsallocatedtothecontainerbytheResourceManager.Thedefaultvalueis8192MB.

TheCPUcoreallocationstothecontainerarecontrolledbysettingthefollowingpropertiesintheyarn-site.xmlconfigurationfile:

Parameter Description

yarn.scheduler.minimum-allocation-

vcores

ThisistheminimumnumberofCPUcoresthatareallocatedtothecontainer.

yarn.scheduler.maximum-allocation-

vcores

ThisisthemaximumnumberofCPUcoresthatareallocatedtothecontainer.

yarn.nodemanager.resource.cpu-vcores Thisisthenumberofcoresthatthecontainercanrequestforthenode.

Page 131: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 132: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

YARNschedulingpoliciesTheYARNarchitecturehaspluggableschedulingpoliciesthatdependontheapplication’srequirementsandtheusecasedefinedfortherunningapplication.YoucanfindtheYARNschedulingconfigurationsintheyarn-site.xmlfile.Here,youcanspecifytheschedulingsystemaseitherFIFO,capacity,orfairschedulingaspertheapplication’sneeds.YoucanalsofindtherunningapplicationschedulinginformationintheResourceManagerUI.Manycomponentsoftheschedulingsystemaredefinedbrieflythere.

Asalreadymentioned,therearethreetypeofschedulingpoliciesthattheYARNschedulerfollows:

FIFOschedulerCapacityschedulerFairscheduler

Page 133: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

TheFIFO(FirstInFirstOut)schedulerThisistheschedulingpolicyintroducedintothesystemfromHadoop1.0.TheJobTrackerwasusedtobeFIFOschedulingpolicies.Asthenameindicates,FIFOmeansFirstinFirstOut,thatis,thejobsubmittedfirstwillexecutefirst.TheFIFOschedulerpolicydoesnotfollowanyapplicationpriorities;thispolicymightefficientlyworkforsmallerjobs,butwhileexecutinglargerjobs,FIFOworksveryinefficiently.Soforheavy-loadedclusters,thispolicyisnotrecommended.TheFIFOschedulercanbeseenasfollows:

TheFIFO(FirstInFirstOut)schedulerHereistheconfigurationpropertyfortheFIFOscheduler.Byspecifyingthisinyarn-site.xml,youcanenabletheFIFOschedulingpolicyinyourYARNcluster:

<property>

<name>yarn.resourcemanager.scheduler.class</name>

<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoSch

eduler</value>

</property>

Page 134: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

ThecapacityschedulerThecapacityschedulingpolicyisoneoftheveryfamouspluggableschedulerpoliciesthatallowsmultipleapplicationsorusergroupstosharetheHadoopclusterresourcesinasecureway.Nowadays,thisschedulingpolicyrunssuccessfullyonmanyofthelargestHadoopproductionclustersinanefficientway.

Thecapacityschedulingpolicyallowsauserorusergroupstoshareclusterresourcesinsuchawaythateachuserorgroupofuserswouldgetassignedacertaincapacityoftheclusterforsure.Toenablethispolicy,theclusteradministratorconfiguresoneormorequeueswithsomeprecalculatedsharesofthetotalclusterresourcecapacity;thisassignmentguaranteestheminimumresourcecapacityallocationtoeachqueue.Theadministratorcanalsoconfigurethemaximumandminimumconstraintsontheuseofclusterresources(capacity)oneachqueue.EachqueuehasitsownAccessControlList(ACL)policiesthatcanmanagewhichuserhaspermissiontosubmittheapplicationsonwhichqueues.ACLsalsomanagethereadandmodifypermissionsatthequeuelevelsothatuserscannotviewormodifytheapplicationssubmittedbyotherusers.

CapacityschedulerconfigurationsCapacityschedulerconfigurationscomewithHadoopYARNbydefault.Sometimes,itisnecessarytoconfigurethepolicyinYARNconfigurationfiles.Herearetheconfigurationpropertiesthatneedtobespecifiedinyarn-site.xmltoenablethecapacityschedulerpolicy:

<property>

<name>yarn.resourcemanager.scheduler.class</name>

<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.Cap

acityScheduler</value>

</property>

Thecapacityscheduler,bydefault,comeswithitsownconfigurationfilenamed$HADOOP_CONF_DIR/capacity-scheduler.xml;thisshouldbepresentintheclasspathsothattheResourceManagerisabletolocateitandloadthepropertiesforthisaccordingly.

Page 135: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

ThefairschedulerThefairschedulerisoneofthemostfamouspluggableschedulersforlargeclusters.Itenablesmemory-intensiveapplicationstoshareclusterresourcesinaveryefficientway.Fairschedulingisapolicythatenablestheallocationofresourcestoapplicationsinawaythatallapplicationsget,onaverage,anequalshareoftheclusterresourcesoveragivenperiod.

Inafairschedulingpolicy,ifoneapplicationisrunningonthecluster,itmightrequestallclusterresourcesforitsexecution,ifneeded.Ifotherapplicationsaresubmitted,thepolicycandistributethefreeresourcesamongtheapplicationsinsuchawaythateachapplicationgetsafairlyequalshareofclusterresources.AfairscheduleralsofollowsapreemptionwheretheResourceManagermightrequesttheresourcecontainersbackfromtheApplicationMaster,dependingonthejobconfigurations.Itmightbeahealthyoranunhealthypreemption.

Inthisschedulingmodel,everyapplicationispartofaqueue,soresourcesareassignedtothequeue.Bydefault,eachusersharesthequeuecalled‘DefaultQueue’.Afairschedulersupportsmanyfeaturesatthequeuelevel,suchasassigningweighttothequeue.Aheavyweightqueuewouldgetahighernumberofresourcesthanlightweightqueues,minimumandmaximumsharesthatqueuewouldgetFIFOpolicywithinthequeue.

Whilesubmittingtheapplication,usersmightspecifythenameofthequeuetheapplicationwantstouseresourcesfrom.Forexample,iftheapplicationrequiresahighernumberofresources,itcanspecifytheheavyweightqueuesothatitcangetalltherequiredresourcesthatareavailablethere.

Theadvantageofusingthefairschedulingpolicyisthateveryqueuewouldgetaminimumshareoftheclusterresources.Itisveryimportanttonotethatwhenaqueuecontainsapplicationsthatarewaitingfortheresources,theywouldgettheminimumresourceshare.Ontheotherhand,ifthequeuesresourcesaremorethanenoughfortheapplication,thentheexcessamountwouldbedistributedequallyamongtherunningapplications.

FairschedulerconfigurationsToenablethefairschedulingpolicyinyourYARNcluster,youneedtospecifythefollowingpropertyintheyarn-site.xmlfile:

<property>

<name>yarn.resourcemanager.scheduler.class</name>

<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairSch

eduler</value>

</property>

Thefairscheduleralsohasaspecificconfigurationfileforamoredetailedconfigurationsetup;youwillfinditat$HADOOP_CONF_DIR/fair-scheduler.xml.

Page 136: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 137: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

YARNmultitenancyapplicationsupportYARNcomeswithbuilt-inmultitenancysupport.Now,let’shavealookatwhatmultitenancymeans.Considerasocietythathasmultipleapartmentsinit,sotherearedifferenttypesoffamilylivingindifferentapartmentswithsecurityandprivacy,buttheyallsharethesociety’scommonareas,suchasthesocietygate,garden,playarea,andotheramenities.Theirapartmentsalsosharecommonwalls.ThesameconceptisfollowedinYARN:thethatrunrunningintotheclustersharetheclusterresourcesinamultitenantway.Theyshareclusterprocessingcapacity,clusterstoragecapacity,dataaccesssecurities,andsoon.Multitenancyisachievedintheclusterbydifferentiatingapplicationsintomultiplebusinessunits,forexample,differentqueuesandusersfordifferenttypesofapplications.

SecurityandprivacycanbeachievedbyconfiguringLinuxandHDFSpermissionstoseparatefilesanddirectoriestocreatetenantboundaries.ThiscanbeachievedbyintegratingwithLDAPorActiveDirectory.Securityisusedtoenforcethetenantapplicationboundaries,andthiscanbeintegratedwiththeKerberossecuritymodel.

ThefollowingdiagramwillexplainhowanapplicationrunsintheYARNclusterinamultitenantway:

IntheprecedingYARNcluster,youcanseethattwojobsarerunning:oneisStorm,andtheotheristheMapReducejob.Theyaresharingtheclusterscheduler,clusterprocessingcapacity,HDFSstorage,andclustersecurity.WecanalsoseethetwoapplicationsarerunningonasingleYARNcluster.TheMapReduceandStormjobsarerunningoverYARNandsharingthecommonclusterinfrastructure,CPU,RAM,andsoon.TheStorm

Page 138: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

ApplicationMaster,StormSupervisor,MapRedApplicationMaster,Mappers,andReducersarerunningovertheYARNclusterinamultitenantwaybysharingclusterresources.

Page 139: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 140: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

AdministrationofYARNNow,wewilltakealookatsomeYARNbasicadministrationconfigurations,basicallyfromHadoop2.0.YARNwasintroducedandmadechangesinHadoopconfigurationfiles.HadoopandYARNhavethefollowingbasicconfigurationfiles:

core-default.xml:Thisfilecontainspropertiesrelatedtothesystem.hdfs-default.xml:ThisfilecontainsHDFS-relatedconfigurations.mapred-default.xml:ThisconfigurationfilecontainspropertiesrelatedtotheYARNMapReduceframework.yarn-default.xml:ThisfilecontainsYARN-relatedproperties.

YouwillfindallthesepropertieslistedontheApachewebsite(http://hadoop.apache.org/docs/current/)intheconfigurationsection,withdetailedinformationoneachpropertyanditsdefaultandpossiblevalues.

Page 141: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

AdministrativetoolsYARNhasseveraladministrativetoolsbydefault;youcanfindthemusingthermadmincommand.HereisamoredetailedexplanationoftheResourceManageradmincommand:

$yarnrmadmin-help

ThermadmincommandisthecommandtoexecuteMapReduceadministrativecommands.Thefullsyntaxis:

hadooprmadmin[-refreshQueues][-refreshNodes]

[-refreshSuperUserGroupsConfiguration][-refreshUserToGroupsMappings]

[-refreshAdminAcls][-refreshServiceAcl][-getGroup[username]][-help

[cmd]]

Theprecedingcommandcontainsthefollowingfields:

-refreshQueues:Reloadsthequeues’acls,states,andscheduler-specificproperties.TheResourceManagerwillreloadthemapred-queuesconfigurationfile.-refreshNodes:Refreshesthehost’sinformationattheResourceManager.-refreshUserToGroupsMappings:Refreshesuser-to-groupsmappings.-refreshSuperUserGroupsConfiguration:Refreshessuperuserproxygroupsmappings.-refreshAdminAcls:RefreshesaclsfortheadministrationoftheResourceManager.-refreshServiceAcl:Reloadstheservice-levelauthorizationpolicyfile.ResourceManagerwillreloadtheauthorizationpolicyfile.-getGroups[username]:Getthegroupsthatthegivenuserbelongsto.-help[cmd]:Displayshelpforthegivencommand,orallcommandsifnoneisspecified.

Thegenericoptionssupportedareasfollows:

-conf<configurationfile>:Thiswillspecifyanapplicationconfigurationfile.-D<property=value>:Thiswillusethevalueforthegivenproperty.-fs<local|namenode:port>:ThiswillspecifyaNameNode.-jt<local|jobtracker:port>:ThiswillspecifyaJobTracker.-files<commaseparatedlistoffiles>:Thiswillspecifycomma-separatedfilestobecopiedtotheMapReducecluster.-libjars<commaseparatedlistofjars>:Thiswillspecifycomma-separatedJARfilestoincludeintheclasspath.-archives<commaseparatedlistofarchives>:Thiswillspecifycomma-separatedarchivestobeunarchivedonthecomputemachines.

Thegeneralcommandlinesyntaxis:

bin/hadoopcommand[genericOptions][commandOptions]

Page 142: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

AddingandremovingnodesfromaYARNclusterAYARNclusterishorizontallyscalable;youcanaddorremoveworkernodesinorfromtheclusterwithoutstoppingit.Toaddanewnode,allthesoftwareandconfigurationsmustbedoneoverthenewnode.

Thefollowingpropertyisusedtoaddanewnodetothecluster:

yarn.resourcemanager.nodes.include-path

Forremovingthenodefromthecluster,thefollowingpropertyisused:

yarn.resourcemanager.exclude-path

Theprecedingtwopropertiestakevaluesasalocalfilethatcontainsthelistofnodesthatneedtobeaddedorremovedfromthecluster.ThisfilecontainseitherthehostnamesortheIPsoftheworkernodesseparatedbyanewline,tab,orspace.

Afteraddingorremovingthenode,theYARNclusterdoesnotrequirearestart.ItjustneedstorefreshthelistofworkernodessothattheResourceManagergetsinformedaboutthenewlyaddedorremovednodes:

$yarnrmadmin-refreshNodes

Page 143: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

AdministratingYARNjobsThemostimportantYARNadmintaskisadministratingtherunningofYARNjobs.YoucanmanageYARNjobsusingtheyarnapplicationCLIcommand.

Usingtheyarnapplicationcommand,theadministratorcankillajob,listalljobs,andfindoutthestatusofajob.MapReducejobscanbecontrolledbythemapredjobcommand.

Hereistheusageoftheyarnapplicationcommand:

usage:application

-appTypes<Comma-separatedlistofapplicationtypes>Workswith--listto

filterapplicationsbasedontheirtype.

-helpDisplayshelpforallcommands.

-kill<ApplicationID>Killstheapplication.

-listListsapplicationsfromtheRM.Supportsoptionaluseof–appTypesto

filter

applicationsbasedonapplicationtype.

-status<ApplicationID>Printsthestatusoftheapplication.

Page 144: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

MapReducejobconfigurationsAsMapReducejobsarenowrunningonYARNcontainersinsteadoftraditionalMapReduceslots,it’snecessarytoconfigureMapReducepropertiesintomapred-site.xml.HerearesomepropertiesofMapReducejobsthatcouldbeconfiguredtorunMapReducejobsonYARNcontainers:

Properties Description

mapred.child.java.optsThispropertyisusedtosettheJavaheapsizeforchildJVMsofmaps,forexampleXmx4096m.

mapreduce.map.memory.mbThispropertyisusedtoconfiguretheresourcelimitformapfunctionsforexample,1536MB.

mapreduce.reduce.memory.mbThispropertyisusedtoconfiguretheresourcelimitforreducerfunctions,forexample3072MB.

mapreduce.reduce.java.optsThispropertyisusedtosettheJavaheapsizeforchildJVMsofreducers,forexampleXmx4096m.

Page 145: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

YARNlogmanagementThelogmanagementCLItoolisveryusefulforYARNapplicationlogmanagement.TheadministratorcanusethelogsCLIcommanddescribedhere:

$yarnlogs

RetrievelogsforcompletedYARNapplications.

usage:yarnlogs-applicationId<applicationID>[OPTIONS]

generaloptionsare:

-appOwner<ApplicationOwner>AppOwner(assumedtobecurrentuserif

notspecified)

-containerId<ContainerID>ContainerId(mustbespecifiedifnode

addressis

specified)

-nodeAddress<NodeAddress>NodeAddressintheformatnodename:port

(mustbespecifiedifcontainerIDisspecified)

Let’stakeanexample.Ifyouwantedtoprintallthelogsofaspecificapplication,usethefollowingcommand:

$yarnlogs-applicationId<applicationID>

Thiscommandwillprintallthelogsrelatedtotheapplication_IDspecifiedintheconsole’sinterface.

Page 146: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

YARNwebuserinterfaceIntheYARNwebuserinterface(http://localhost:8088/cluster),youcanfindinformationonclusternodes,containersconfiguredoneachnode,andapplicationsandtheirstatus.TheYARNwebinterfaceisasfollows:

UndertheSchedulersection,youcanseetheschedulinginformationofallthesubmitted,acceptedbythescheduler,runningapplications,withthetotalclustercapacity,usedandmaximumcapacity,andresourcesallocatedtotheapplicationqueue.Inthefollowingscreenshot,youcanseetheresourcesallocatedtothedefaultqueue:

UndertheToolssection,youcanfindtheYARNconfigurationfiledetails,schedulinginformation,containerconfigurations,locallogsofthejobs,andalotofotherinformation

Page 147: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

onthecluster.

Page 148: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 149: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

SummaryInthischapter,wecoveredYARNcontainerallocationsandconfigurations,schedulingpolicies,andconfigurations.WealsocoveredmultitenancyapplicationsupportinYARNandsomebasicYARNadministrativetoolsandsettings.Inthenextchapter,wewillcoversomeusefulpracticalexamplesaboutYARNandtheecosystem.

Page 150: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 151: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Chapter6.DevelopingandRunningaSimpleYARNApplicationInthepreviouschapters,wediscussedtheconceptsoftheYARNarchitecture,clustersetup,andadministration.Nowinthischapter,wewillfocusmoreonMapReduceapplicationswithYARNanditsecosystems,withsomehands-onexamples.YoupreviouslylearnedaboutwhenaclientsubmitsanapplicationrequesttotheYARNclusterandhowYARNregisterstheapplication,allocatestherequiredcontainersforitsexecution,andmonitorstheapplicationwhileit’srunning.Now,wewillseesomepracticalusecasesofYARN.

Inthischapter,wewilldiscuss:

RunningsampleapplicationsonYARNDevelopingYARNexamplesApplicationmonitoringandtracking

Now,let’sstartbyrunningsomeofthesampleapplicationsthatcomeasapartoftheYARNdistributionbundle.

Page 152: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

RunningsampleexamplesonYARNRunningtheavailablesampleMapReduceprogramsisasimpletaskwithYARN.TheHadoopversionshipswithsomebasicMapReduceexamples.Youcanfindtheminside$HADOOP_HOME/share/Hadoop/mapreduce/Hadoop-mapreduce-examples-

<HADOOP_VERSION>.jar.ThelocationofthefilemaydifferdependingonyourHadoopinstallationfolderstructure.

Let’sincludethisintheYARN_EXAMPLESpath:

$exportYARN_EXAMPLES=$HADOOP_HOME/share/Hadoop/mapreduce

Now,wehaveallthesampleexamplesintheYARN_EXAMPLESenvironmentalvariable.Youcanaccessalltheexamplesusingthisvariable;tolistalltheavailableexamples,trytypingthefollowingcommandontheconsole:

$yarnjar$YARN_EXAMPLES/hadoop-mapreduce-examples-2.4.0.2.1.1.0-385.jar

Anexampleprogrammustbegivenasthefirstargument.

Thevalidprogramnamesareasfollows:

aggregatewordcount:Thisisanaggregate-basedmap/reduceprogramthatcountsthewordsintheinputfilesaggregatewordhist:Thisisanaggregate-basedmap/reduceprogramthatcomputesthehistogramofthewordsintheinputfilesbbp:Thisisamap/reduceprogramthatusesBailey-Borwein-PlouffetocomputetheexactdigitsofPidbcount:Thisisanexamplejobthatcountsthepageviewcountsfromadatabasedistbbp:Thisisamap/reduceprogramthatusesaBBP-typeformulatocomputetheexactbitsofPigrep:Thisisamap/reduceprogramthatcountsthematchesofaregexintheinputjoin:Thisisajobthataffectsajoinoversorted,equally-partitioneddatasetsmultifilewc:Thisisajobthatcountswordsfromseveralfilespentomino:Thisisamap/reducetilethatlaysaprogramtofindsolutionstopentominoproblemspi:Thisisamap/reduceprogramthatestimatesPiusingaquasi-MonteCarlomethodrandomtextwriter:Thisisamap/reduceprogramthatwrites10GBofrandomtextualdatapernoderandomwriter:Thisisamap/reduceprogramthatwrites10GBofrandomdatapernodesecondarysort:Thisisanexamplethatdefinesasecondarysorttothereducesort:Thisisamap/reduceprogramthatsortsthedatawrittenbytherandomwritersudoku:Thisisasudokusolverteragen:Thisgeneratesdatafortheterasortterasort:Thisrunstheterasortteravalidate:Thischeckstheresultsofterasort

Page 153: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

wordcount:Thisisamap/reduceprogramthatcountsthewordsintheinputfileswordmean:Thisisamap/reduceprogramthatcountstheaveragelengthofthewordsintheinputfileswordmedian:Thisisamap/reduceprogramthatcountsthemedianlengthofthewordsintheinputfileswordstandarddeviation:Thisisamap/reduceprogramthatcountsthestandarddeviationofthelengthofthewordsintheinputfiles

ThesewerethesampleexamplesthatcomeaspartoftheYARNdistributionbydefault.Now,let’stryrunningsomeoftheexamplestoshowcaseYARNcapabilities.

Page 154: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

RunningasamplePiexampleTorunanyapplicationontopofYARN,youneedtofollowthisJavacommandsyntax:

$yarnjar<application_jar.jar><arg0><arg1>

TorunasampleexampletocalculatethevalueofPIwith16mapsand10,000samples,usethefollowingcommand:

$yarnjar$YARN_EXAMPLES/hadoop-mapreduce-examples-2.4.0.2.1.1.0-385.jar

PI1610000

Notethatweareusinghadoop-mapreduce-examples-2.4.0.2.1.1.0-385.jarhere.TheJARversionmaychangedependingonyourinstalledHadoopdistribution.

Onceyouhittheprecedingcommandontheconsole,youwillseethelogsgeneratedbytheapplicationontheconsole,asshowninthefollowingcommand.Thedefaultloggerconfigurationisdisplayedontheconsole.ThedefaultmodeisINFO,andyoumaychangeitbyoverwritingthedefaultloggersettingsbyupdatinghadoop.root.logger=WARN,consoleinconf/log4j.properties:

NumberofMaps=16

SamplesperMap=10000

WroteinputforMap#0

WroteinputforMap#1

WroteinputforMap#2

WroteinputforMap#3

WroteinputforMap#4

WroteinputforMap#5

WroteinputforMap#6

WroteinputforMap#7

WroteinputforMap#8

WroteinputforMap#9

WroteinputforMap#10

WroteinputforMap#11

WroteinputforMap#12

WroteinputforMap#13

WroteinputforMap#14

WroteinputforMap#15

StartingJob

11/09/1421:12:02INFOmapreduce.Job:map0%reduce0%

11/09/1421:12:09INFOmapreduce.Job:map25%reduce0%

11/09/1421:12:11INFOmapreduce.Job:map56%reduce0%

11/09/1421:12:12INFOmapreduce.Job:map100%reduce0%

11/09/1421:12:12INFOmapreduce.Job:map100%reduce100%

11/09/1421:12:12INFOmapreduce.Job:Jobjob_1381790835497_0003completed

successfully

11/09/1421:12:19INFOmapreduce.Job:Counters:44

FileSystemCounters

FILE:Numberofbytesread=358

FILE:Numberofbyteswritten=1365080

FILE:Numberofreadoperations=0

FILE:Numberoflargereadoperations=0

FILE:Numberofwriteoperations=0

HDFS:Numberofbytesread=4214

Page 155: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

HDFS:Numberofbyteswritten=215

HDFS:Numberofreadoperations=67

HDFS:Numberoflargereadoperations=0

HDFS:Numberofwriteoperations=3

JobCounters

Launchedmaptasks=16

Launchedreducetasks=1

Data-localmaptasks=14

Rack-localmaptasks=2

Totaltimespentbyallmapsinoccupiedslots

(ms)=184421

Totaltimespentbyallreducesinoccupiedslots

(ms)=8542

Map-ReduceFramework

Mapinputrecords=16

Mapoutputrecords=32

Mapoutputbytes=288

Mapoutputmaterializedbytes=448

Inputsplitbytes=2326

Combineinputrecords=0

Combineoutputrecords=0

Reduceinputgroups=2

Reduceshufflebytes=448

Reduceinputrecords=32

Reduceoutputrecords=0

SpilledRecords=64

ShuffledMaps=16

FailedShuffles=0

MergedMapoutputs=16

GCtimeelapsed(ms)=195

CPUtimespent(ms)=7740

Physicalmemory(bytes)snapshot=6143396896

Virtualmemory(bytes)snapshot=23142254400

Totalcommittedheapusage(bytes)=43340769024

ShuffleErrors

BAD_ID=0

CONNECTION=0

IO_ERROR=0

WRONG_LENGTH=0

WRONG_MAP=0

WRONG_REDUCE=0

FileInputFormatCounters

BytesRead=1848

FileOutputFormatCounters

BytesWritten=98

JobFinishedin23.144seconds

EstimatedvalueofPiis3.14127500000000000000

YoucancomparetheexamplethatrunsoverHadoop1.xandtheonethatrunsoverYARN.Youcanhardlydifferentiatebylookingatthelogs,butyoucanclearlyidentifythedifferenceinperformance.YARNhasbackward-compatibilitysupportwithMapReduce1.x,withoutanycodechange.

Page 156: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 157: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

MonitoringYARNapplicationswithwebGUINow,wewilllookattheYARNwebGUItomonitortheexamples.YoucanmonitortheapplicationsubmissionID,theuserwhosubmittedtheapplication,thenameoftheapplication,thequeueinwhichtheapplicationissubmitted,thestarttimeandfinishtimeinthecaseoffinishedapplications,andthefinalstatusoftheapplication,usingtheResourceManagerUI.TheResourceManagerwebUIdiffersfromtheUIoftheHadoop1.xversions.ThefollowingscreenshotshowstheinformationwecouldgetfromtheYARNwebUI(http://localhost:8088).

Currently,thefollowingwebUIisshowinginformationrelatedtothePIexampleweranintheprevioussection,exploringtheYARNwebUI:

ThefollowingscreenshotshowsthePIexamplerunningovertheYARNframeworkandthePIexamplesubmittedbytherootuserintothedefaultqueue.AnApplicationMasterisassignedtoit,whichiscurrentlyintherunningstate.Similarly,youcanalsomonitorallthesubmitted,acceptedandrunning,finished,andfailedjobs’statusesfromtheResourceManagerwebUI.

Page 158: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Ifyoudrilldownfurther,youcanseetheapplicationmaster-levelinformationofthesubmittedapplication,suchasthetotalcontainersallocatedtothemapandreducefunctionsandtheirrunningstatus.Forexample,thefollowingscreenshotshowsthatwealreadysubmittedaPIexamplewith16mappers.Sointhefollowingscreenshot,youcanseethatthetotalnumberofcontainersallocatedtothemapfunctionis16,outofwhich8arecompletedand8areintherunningstate.YoucanalsotrackthecontainersallocatedtothereducefunctionanditsprogressfromUI:

Youcanseealltheinformationdisplayedovertheconsolewhilerunningthejob.ThesameinformationwillalsobedisplayedonthewebUIinatabularformandinamoresophisticatedway:

AllthemapperandreducerjobsandfilesystemcounterswillbedisplayedunderthecountersectionoftheYARNapplicationwebGUI.Youcanalsoexploretheconfigurationsoftheapplicationintheconfigurationssection:

Page 159: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Thefollowingscreenshotshowsthestatisticsofthefinishedjob,suchasthetotalnumberofmappers,reducers,starttime,finishtime,andsoon:

ThefollowingscreenshotoftheYARNwebUIgivesschedulinginformationabouttheYARNcluster,suchastheclusterresourcecapacityandcontainersallocatedtotheapplicationorqueue:

Page 160: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Attheend,youwillseethejobsummarypage.Youmayalsoexaminethelogsbyclickingonthelogslinkprovidedonthejobsummarypage.

OnceauserreturnstothemainclusterUI,choosesanyfinishedapplications,andthenselectsajobwerecentlyran,theuserwillabletoseethesummarypage,asshowninfollowingscreenshot:

Thereareafewthingstonoteaswemovedthroughthewindowsdescribedearlier.First,asYARNmanagesapplications,allinputfromYARNreferstoanapplication.YARNhasnodataontheactualapplication.DatafromtheMapReducejobisprovidedbytheMapReduceframework.Therefore,therearetwoclearlydifferentdatastreamsthatarecombinedinthewebGUI,YARNapplicationsandMapReduceframeworkjobs.Iftheframeworkdoesnotprovidejobinformation,thencertainpartsofthewebGUIwillhavenothingtodisplay.

AveryimportantfactaboutYARNjobsisthedynamicnatureofthecontainerallocationstothemapperandreducertasks.TheseareexecutedasYARNcontainers,andtheirrespectivenumberalsochangesdynamicallyaspertheapplication’sneedsandrequirements.Thisfeatureprovidesmuchbetterclusterutilizationduetothedynamic

Page 161: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

container(“slots”intraditionallanguage)allocations.

Page 162: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 163: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

YARN’sMapReducesupportMapReducewastheonlyusecaseonwhichthepreviousversionsofHadoopweredeveloped.WeknowthatMapReduceismainlyusedfortheefficientandeffectiveprocessingofbigdata.Itisusedtoprocessagraphandmillionsofitsnodesandedges.Goingforwardwithtechnology,tocaterfortherequirementsofdatalocationavailability,faulttolerantsystems,andapplicationpriorities,YARNbuiltsupportforeverythingfromasimpleshellscriptapplicationtoacomplexMapReduceapplication.

Forthedatalocationavailability,MapReducer’sApplicationMasterhastofindoutthedatablocklocationsandallocationsofcontainerstoprocesstheseblocksaccordingly.Faulttolerantsystemmeanstheabilitytohandlefailedtasksandactonthemaccordingly,suchastohandlefailedmapandreducetasksandrerunthemwithothercontainersifneeded.Prioritiesareassignedtoeachapplicationinthequeue;thelogictohandlecomplexintra-applicationprioritiesformapandreducetaskshastobebuiltintotheApplicationMaster.Thereisnoneedtostartidlereducersbeforemappersfinishenoughdataprocessing.ReducersarenowunderthecontroloftheYARNApplicationMasterandarenotfixedastheyhadbeeninHadoopversion1.

Page 164: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

TheMapReduceApplicationMasterTheMapReduceApplicationMasterserviceismadeupofmultipleloosely-coupledservices;theseservicesinteractwitheachotherviaevents.Everyservicegetstriggeredonaneventandproducesanoutputastheeventtriggersanotherservice;thishappenshighlyconcurrentlyandwithoutsynchronization.Allservicecomponentsareregisteredwiththecentraldispatcherservice,andserviceinformationissharedbetweenthemultiplecomponentsviaApplicationContext(AppContext).

InHadoopversion1,alltherunningandsubmittedjobsarepurelydependentontheJobTracker,sothefailureofJobTrackerresultsinalossofalltherunningandsubmittedjobs.However,withYARN,theApplicationMasterisequivalenttotheJobTracker.TheApplicationMasterrunsandallocatesnodestoanapplication.Itmayfail,butYARNhasthecapabilitytorestarttheApplicationMasteraspecifiednumberoftimesandthecapabilitytorecovercompletedtasks.MorelikeJobTracker,theApplicationMasterkeepsthemetricsofthejobscurrentlyrunning.ThefollowingsettingsintheconfigurationfileenableMapReducerecoveryinYARN.

ToenabletherestartoftheApplicationMaster,executethefollowingsteps:

1. Insideyarn-site.xml,youcantunetheyarn.resourcemanager.am.max-retriesproperty.Thedefaultis2.

2. Insidemapred-site.xml,youcandirectlytunehowmanytimesaMapReduceApplicationMastershouldrestartwiththemapreduce.am.max-attemptsproperty.Thedefaultis2.

3. Toenablerecoveryofcompletedtasks,lookinsidethemapred-site.xmlfile.Theyarn.app.mapreduce.am.job.recovery.enablepropertyenablestherecoveryoftasks.Bydefault,itistrue.

Page 165: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

ExampleYARNMapReducesettingsYARNhasreplacedthefixedslotarchitectureformappersandreducerswithflexibledynamiccontainerallocation.TherearesomeimportantparameterstorunMapReduceefficiently,andtheycanbefoundinmapred-site.xmlandyarn-site.xml.Asanexample,thefollowingaresomesettingsthathavebeenusedtoruntheMapReduceapplicationonYARN:

Property Propertyfile Value

mapreduce.map.memory.mb mapred-site.xml 1536

mapreduce.reduce.memory.mb mapred-site.xml 2560

mapreduce.map.java.opts mapred-site.xml -Xmx1024m

mapreduce.reduce.java.opts mapred-site.xml -Xmx2048m

yarn.scheduler.minimum-allocation-mb yarn-site.xml 512

yarn.scheduler.maximum-allocation-mb yarn-site.xml 4096

yarn.nodemanager.resource.memory-mb yarn-site.xml 36864

yarn.nodemanager.vmem-pmem-ratio yarn-site.xml 2.1

YARNconfigurationallowsacontainersizebetween512MBto4GB.Ifnodeshave36GBofRAMwithavirtualmemoryof2.1,eachmapcanhavemax3225.6MB,andeachreducercanhave5376MBofvirtualmemory.So,thecomputenodeconfiguredfor36GBofcontainerspacecansupportupto24mapsand14reducers,oranycombinationofmapperandreducersallowedbytheavailableresourcesonthenode.

Page 166: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

YARN’scompatibilitywithMapReduceapplicationsForasmoothtransitionfromHadoopv1toYARN,applicationbackwardcompatibilityhasbeenthemajorgoaloftheYARNimplementationteamtoensurethatexistingMapReduceapplicationsthatwereprogrammedusingHadoopv1(MRv1)APIsandcompliedagainstthemcancontinuetorunoverYARN,withlittleenhancement.

YARNensuresfullbinarycompatibilitywithHadoopv1(MRv1)APIs;userswhousedtheorg.apache.hadoop.mapredAPIsprovidefullcompatibilitywiththeYARNframework,withoutrecompilation.YoucanuseyourMapReduceJARfileandbin/hadooptosubmitthemdirectlytoYARN.

YARNintroducednewAPIchangesforMapReduceapplicationsontopoftheYARNframeworkintoorg.apache.hadoop.mapreduce.

Ifanapplicationisdevelopedbyorg.apache.hadoop.mapreduceandcompliedbytheHadoopv1(MRv1)APIs,thenunfortunatelyYARNdoesn’tprovidecompatibilitywithit,asorg.apache.hadoop.mapreduceAPIshavegonethroughaYARNtransitionandshouldberecompiledagainstHadoopv2(MRv2)torunoverYARN.

Page 167: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

DevelopingYARNapplicationsTodevelopaYARNapplication,youneedtokeeptheYARNarchitectureinmind.YARNisaplatformthatallowsdistributedapplicationstotakefulladvantageoftheresourcesthatYARNhasdeployed.Currently,resourcescanbethingssuchasCPU,memory,anddata.Manydeveloperswhocomefromaserver-sideapplication-developmentbackgroundorfromaMapReducedeveloperbackgroundmaybeaccustomedtoacertainflowinthedevelopmentanddeploymentcycle.

Inthissection,we’lldescribethedevelopmentlifecycleofYARNapplications.Also,we’llfocusonthekeyareasofYARNapplicationdevelopment,suchashowYARNapplicationscanlaunchcontainers,howresourceallocationhasbeendonefortheapplications,andmanyotherareasindetail.

ThegeneralworkflowoftheYARNapplicationsubmissionisthattheYARNClientcommunicateswiththeResourceManagerthroughtheApplicationClientProtocoltogenerateanewApplicationID.ItthensubmitstheapplicationtotheResourceManagertorunviatheApplicationClientProtocol.Asapartoftheprotocol,theYARNClienthastoprovidealltherequiredinformationtotheResourceManagertolaunchtheapplication’sfirstcontainer,thatis,theApplicationMaster.TheYARNClientalsoneedstoprovideinformationdetailsofthedependencyJARs/filesfortheapplicationviacommand-linearguments.YoucanalsospecifythedependencyJARs/filesintheenvironmentvariables.

ThefollowingaresomeinterfaceprotocolsthattheYARNframeworkwilluseforintercomponentcommunication:

ApplicationClientProtocol:ThisprotocolisusedbyYARNforcommunicationbetweentheYARNClientandResourceManagertolaunchanewapplication,checkitsstatus,ortokilltheapplication.ApplicationMasterProtocol:ThisprotocolisusedbytheYARNframeworktocommunicatebetweentheApplicationMasterandResourceManager.ItisusedbytheApplicationMastertoregister/unregisteritselfto/fromtheResourceManagerandalsofortheresourceallocation/deallocationrequesttotheResourceManager.ContainerManagerProtocol:ThisprotocolisusedforcommunicationbetweentheApplicationMasterandNodeManagertostartandstopcontainersandtheirstatusupdates.

Page 168: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 169: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

TheYARNapplicationworkflowNow,takealookatthefollowingsequencediagramthatdescribestheYARNapplicationworkflowandalsoexplainshowcontainerallocationisdoneforanapplicationviatheApplicationMaster:

Refertotheprecedingdiagramforthefollowingdetails:

TheclientsubmitstheapplicationrequesttotheResourceManager.TheResourceManagerregisterstheapplicationwiththeApplicationManager,generatestheApplicationID,andrespondstotheclientwiththesuccessfullyregisteredApplicationID.Then,theResourceManagerstartstheclient’sApplicationMasterinaseparateavailablecontainer.Ifnocontainerisavailable,thenthisrequesthastowaittillasuitablecontainerisfound,andsendtheapplicationregistrationrequestforapplicationregistration.TheResourceManagersharesalltheminimumandmaximumresourcecapabilitiesoftheclusterwiththeApplicationMaster.Then,theApplicationMasterdecideshowtoefficientlyusetheavailableresourcestofulfillapplicationneeds.DependingontheresourcecapabilitiessharedbytheResourceManager,theApplicationMasterrequeststheResourceManagertoallocatethenumberof

Page 170: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

containersonbehalfoftheapplication.TheResourceManagerrespondstotheResourceRequestbytheApplicationMasteraspertheschedulingpoliciesandresourceavailabilities.ContainerallocationbytheResourceManagermeanssuccessfulfulfillingoftheResourceRequestbytheApplicationMaster.

Whilerunningthejob,theApplicationMastersendstheheartbeatandjobprogressinformationoftheapplicationtotheResourceManager.Duringtherunningtimeoftheapplication,theApplicationMasterrequestsforareleaseof,orallocatesmorecontainersto,theResourceManager.Whenthetimejobfinishes,theApplicationMastersendsacontainerdeallocationrequesttotheResourceManager,thusexitingitselffromtherunningcontainer.

Page 171: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

WritingtheYARNclientTheYARNclientisrequiredtosubmitthejobtotheYARNframework.ItisaplainJavaclass,simplyhavingmainasentrypointfunctioninto.ThemainfunctionoftheYARNclientistosubmittheapplicationtotheYARNenvironmentbyinstantiatingtheorg.apache.hadoop.yarn.conf.YarnConfigurationobject.TheYarnConfigurationobjectdependsonfindingtheyarn-default.xmlandyarn-site.xmlfilesinitsclasspath.AlltheserequirementsneedtobesatisfiedtoruntheYARNclientapplication.TheYARNclientprocessisshowninthefollowingimage:

OnceaYarnConfigurationobjectisinstantiatedinyourYARNclient,wehavetocreateanobjectoforg.apache.hadoop.client.api.YarnClientusingtheYarnConfigurationobjectthathasalreadybeeninstantiated.Thenewly-instantiatedYarnClientobjectwillbeusedtosubmittheapplicationstotheYARNframeworkusingthefollowingsteps:

1. CreateaninstanceofaYarnClientobjectusingYarnConfiguration.2. InitializetheYarnClientandtheYarnConfigurationobject.3. StartaYarnClient.4. GettheYARNcluster,node,andqueueinformation.5. GetAccessControlListinformationfortheuserrunningtheclient.6. Createtheclientapplication.7. SubmittheapplicationtotheYARNResourceManager.8. Getapplicationreportsaftersubmittingtheapplication.

Also,theYarnClientwillcreateacontextforapplicationsubmissionandfortheApplicationMaster’scontainerlaunch.TherunnableYarnClientwilltakethecommand-lineargumentsfromtheuserwhoisrequiredtorunthejob.We’llseethesimplecodesnippetfortheYARNapplicationclienttogetabetterideaaboutit.

ThefirststepoftheYARNClientistoconnectwiththeResourceManager.Thefollowingisthecodesnippetforit:

//DeclareApplicationClientProtocol

ApplicationClientProtocolapplicationsManager;

//InstamtiateYarnConfiguration

YarnConfigurationyarnConf=newYarnConfiguration(conf);

Page 172: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

//GettheResourceManagerIPaddress,ifnotprovidedusedefault

InetSocketAddressrmAddress=

NetUtils.createSocketAddr(yarnConf.get(

YarnConfiguration.RM_ADDRESS,

YarnConfiguration.DEFAULT_RM_ADDRESS));

LOGGER.info("ConnectingtoResourceManagerat"+rmAddress);

configurationappsManagerServerConf=newConfiguration(conf);

appsManagerServerConf.setClass(

YarnConfiguration.YARN_SECURITY_INFO,

ClientRMSecurityInfo.class,SecurityInfo.class);

//InitializeApplicationManagerhandle

applicationsManager=((ApplicationClientProtocol)rpc.getProxy(

ApplicationClientProtocol.class,rmAddress,

appsManagerServerConf));

OncetheconnectionbetweentheYARNClientandResourceManagerisestablished,theYARNClientneedstorequesttheApplicationIDfromtheResourceManager:

GetNewApplicationRequestnewRequest=

Records.newRecord(GetNewApplicationRequest.class);

GetNewApplicationResponsenewResponse=

applicationsManager.getNewApplication(newRequest);

TheresponsefromtheApplicationManageristhenewly-generatedApplicationIDfortheapplicationsubmittedbytheYARNClient.Youcanalsogettheinformationrelatedtotheminimumandmaximumresourcecapabilitiesofthecluster(usingtheGetNewApplicationResponseAPI).Usingthisinformation,developerscansettherequiredresourcesfortheApplicationMastercontainertolaunch.

TheYARNClientneedstosetupthefollowinginformationfortheApplicationSubmissionContextinitialization;thisinformationincludesalltherequiredinformationneededbytheResourceManagertolaunchtheApplicationMaster,asmentionedhere:

Applicationinformation,suchasApplicationIDgeneratedbythepreviousstepNameoftheapplicationQueueandpriorityinformation,suchasinwhichqueuetheapplicationneedstobesubmittedandtheprioritiesassignedtotheapplicationUserinformation,thatis,bywhomtheapplicationistobesubmittedContainerLaunchContext,thatis,theinformationneededbytheApplicationMastertolaunchlocalresources(suchasJARs,binaries,andfiles)

Italsocontainsthesecurity-relatedinformation(securitytokens)andenvironmentalvariables(classpathsettings)withthecommandtobeexecutedviatheApplicationMaster:

//CreateanewlaunchcontextforAppMaster

ApplicationSubmissionContextappContext=

Records.newRecord(ApplicationSubmissionContext.class);

//settheApplicationId

appContext.setApplicationId(appId);

//settheapplicationname

appContext.setApplicationName(appName);

Page 173: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

//CreateanewcontainerlaunchcontextfortheApplicationMaster

ContainerLaunchContextamContainer=

Records.newRecord(ContainerLaunchContext.class);

//setthelocalresourcesrequiredfortheApplicationMaster

//localfilesorarchivesasneeded(forexamplesjarfiles)

Map<String,LocalResource>localResources=

newHashMap<String,LocalResource>();

//CopyApplicationMasterjartothefilesystemandcreate

//localresourcetopointdestinationjarpath

FileSystemfs=FileSystem.get(conf);

Pathsrc=newPath(AppMaster.jar);

StringpathSuffix=appName+"/"+appId.getId()+

"/AppMaster.jar";

Pathdst=newPath(fs.getHomeDirectory(),pathSuffix);

//CopyfilefromsrctodestionationonHDFS

fs.copyFromLocal(false,true,src,dst);

//getHDFSfilestatusfromthepathwhereitcopied

FileStatusjarStatus=fs.getFileStatus(dst);

LocalResourceamJarResorce=Records.newRecord(LocalResource.class);

//Setthetypeofresource-fileorarchive

//archivesareuntarredatthedestinationbytheframework

amJarResorce.setType(LocalResourceType.FILE);

//Setvisibilityoftheresource

//Settingtomostprivateoption

amJarResorce.setVisibility(LocalResourceVisibility.APPLICATION);

//Settheresourcetobecopiedoverlocation

amJarResorce.setResource(ConverterUtils.getYarnUrlFromPath(dst));

//Settimestampandlengthoffilesothattheframework

//candobasicsanitychecksforthelocalresource

//afterithasbeencopiedovertoensureitisthesame

//resourcetheclientintendedtousewiththeapplication

amJarResorce.setTimestamp(jarStatus.getModificationTime());

amJarResorce.setSize(jarStatus.getLen());

localResources.put("AppMaster.jar",amJarResorce);

//Setthelocalresourcesintothelaunchcontext

amContainer.setLocalResources(localResources);

//setthesecuritytokensasneeded

//amContainer.setContainerTokens(containerToken);

//Setuptheenvironmentneededforthelaunchcontextwherethe

//ApplicationMastertoberun

Map<String,String>env=newHashMap<String,String>();

//Forexample,wecouldsetuptheclasspathneeded.

//incaseofshellscriptexample,putrequiredresources

env.put(DSConstants.SCLOCATION,HdfsSCLocation);

env.put(DSConstants.SCTIMESTAMP,Long.toString(HdfsSCTimeStamp));

env.put(DSConstants.SCLENGTH,Long.toString(HdfsSCLength));

//AddAppMaster.jarlocationtotheClasspath.

//Bydefault,allthehadoopspecificclasspathswillalreadybe

Page 174: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

//available

//in$CLASSPATH,soweshouldbecarefulnottooverwriteit.

StringBuilderclassPathEnv=newStringBuilder("$CLASSPATH:./*:");

for(Stringstr:

conf.get(YarnConfiguration.YARN_APPLICATION_CLASSPATH).split(",")){

classPathEnv.append(':');

classPathEnv.append(str.trim());

}

//addlog4jpropertiesintotheenvvariableifrequired

classPathEnv.append(":./log4j.properties");

env.put("CLASSPATH",classPathEnv);

//setenvironmentalvaribalesintothecontainer

amContainer.setEnvironment(env);

//setnecessarycommandtobeexecutetheApplicationMaster

vector<CharSequence>vargs=newVector<CharSequence>(30);

//setjavaexecutablecommand

vargs.add("${JAVA_HOME}"+"/bin/java");

//setmemoryXmxbasedonAMmemoryrequirements

vargs.add("-Xms"+amMemory+"m");

//setClassName

vargs.add(amMasterMainClass);

//Setparametersforapplicationmaster

vargs.add("--container_memory"+String.valueOf(containerMemory));

vargs.add("--num_containers"+String.valueOf(numContainers));

vargs.add("--priority"+String.valueOf(shellCmdPriority));

if(!shellCommand.isEmpty()){

vargs.add("--shell_command"+shellCommand+"");

}

if(!shellArgs.isEmpty()){

vargs.add("--shell_args"+shellArgs+"");

}

for(Map.Entry<String,String>entry:shellEnv.entrySet()){

vargs.add("--shell_env"+entry.getKey()+"="+

entry.getValue());

}

if(debugFlag){

vargs.add("--debug");

}

vargs.add("1>"+ApplicationConstants.LOG_DIR_EXPANSION_VAR+

"/AppMaster.stdout");

vargs.add("2>"+ApplicationConstants.LOG_DIR_EXPANSION_VAR+

"/AppMaster.stderr");

//Getfinalcommand

StringBuildercommand=newStringBuilder();

for(CharSequencestr:vargs){

Page 175: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

command.append(str).append("");

}

List<String>commands=newArrayList<String>();

commands.add(command.toString());

//Setthecommandarrayintothecontainerspec

amContainer.setCommands(commands);

//ForlaunchinganAMcontainer,settinguserhereisnot

//needed

//amContainer.setUser(amUser);

Resourcecapability=Records.newRecord(Resource.class);

//Fornowonlymemoryissupported,sowesetthememory

capability.setMemory(amMemory);

amContainer.setResource(capability);

//Setthecontainerlaunchcontentintothe

ApplicationSubmissionContext

appContext.setAMContainerSpec(amContainer);

Nowthesetupprocessiscomplete,andourYARNClientisreadytosubmittheapplicationtotheApplicationManager:

//CreatetheApplicationrequesttosendtotheApplicationsManager

SubmitApplicationRequestappRequest=

Records.newRecord(SubmitApplicationRequest.class);

appRequest.setApplicationSubmissionContext(appContext);

//SubmittheapplicationtotheApplicationsManager

//Ignoretheresponseaseitheravalidresponseobjectis

//returnedon

//successoranexceptionthrowntodenotethefailure

applicationsManager.submitApplication(appRequest);

Duringthisprocess,theResourceManagerwillacceptalltherequestsofapplicationsubmissionandallocatecontainerstotheApplicationMastertorun.TheprogressofthetasksubmittedbytheclientcanbetrackedbycommunicatingwiththeResourceManagerandrequestinganapplicationstatusreportviatheApplicationClientProtocol:

GetApplicationReportRequestreportRequest=

Records.newRecord(GetApplicationReportRequest.class);

reportRequest.setApplicationId(appId);

GetApplicationReportResponsereportResponse=

applicationsManager.getApplicationReport(reportRequest);

ApplicationReportreport=reportResponse.getApplicationReport();

TheresponsetothereportrequestreceivedfromtheResourceManagercontainsgeneralapplicationinformation,suchastheApplicationID,thequeueinformationinwhichtheapplicationisrunning,andinformationontheuserwhosubmittedtheapplication.ItalsocontainstheApplicationMasterdetails,thehostonwhichtheApplicationMasterisrunning,andapplication-trackinginformationtomonitortheprogressoftheapplication.

Page 176: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Theapplicationreportalsocontainstheapplicationstatusinformation,suchasSUBMITTED,RUNNING,FINISHED,andsoon.

Also,theclientcandirectlyquerytheApplicationMastertogetreportinformationviahost:rpc_portobtainedfromtheApplicationReport.

Sometimes,theapplicationmaybewronglysubmittedinanotherqueueormaytakelongerthanusual.Insuchcases,theclientmaywanttokilltheapplication.TheApplicationClientProtocolsupportstheforcefullykilloperationthatcansendakillsignaltotheApplicationMasterviatheResourceManager:

KillApplicationRequestkillRequest=

Records.newRecord(KillApplicationRequest.class);

killRequest.setApplicationId(appId);

applicationsManager.forceKillApplication(killRequest);

Page 177: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

WritingtheYARNApplicationMasterThistaskistheheartofthewholeprocess.ThiswouldbelaunchedbytheResourceManager,andallthenecessaryinformationwillbeprovidedbytheclient.AstheApplicationMasterislaunchedinthefirstcontainerallocatedbytheResourceManager,severalparametersaremadeavailablebytheResourceManagerviaenvironment.TheseparametersincludecontainerIDfortheApplicationMastercontainer,applicationsubmissiontimeanddetailsabouttheNodeManagerandthehostonwhichtheApplicationMasterisrunning.InteractionsbetweentheApplicationMasterandtheResourceManagerwouldrequiretheApplicationAttemptID.ThiswillbeobtainedfromtheApplicationMaster’sContainerID:

Map<String,String>envs=System.getenv();

StringcontainerIdString=

envs.get(ApplicationConstants.AM_CONTAINER_ID_ENV);

if(containerIdString==null){

thrownewIllegalArgumentException(

"ContainerIdnotsetintheenvironment");

}

ContainerIdcontainerId=

ConverterUtils.toContainerId(containerIdString);

ApplicationAttemptIdappAttemptID=

containerId.getApplicationAttemptId();

AfterthesuccessfulinitializationoftheApplicationMaster,itneedstoberegisteredwiththeResourceManagerviatheApplicationMasterProtocol.TheApplicationMasterandResourceManagercommunicateviatheSchedulerinterface:

//ConnecttotheResourceManagerandreturnhandlewithRM

YarnConfigurationyarnConf=newYarnConfiguration(conf);

InetSocketAddressrmAddress=

NetUtils.createSocketAddr(yarnConf.get(

YarnConfiguration.RM_SCHEDULER_ADDRESS,

YarnConfiguration.DEFAULT_RM_SCHEDULER_ADDRESS));

LOG.info("ConnectingtoResourceManagerat"+rmAddress);

ApplicationMasterProtocolresourceManager=

(ApplicationMasterProtocol)

rpc.getProxy(ApplicationMasterProtocol.class,rmAddress,conf);

//RegistertheApplicationMastertotheResourceManager

//Settherequiredinfointotheregistrationrequest:

//ApplicationAttemptId,

//hostonwhichtheappmasterisrunning

//rpcportonwhichtheappmasteracceptsrequestsfromtheclient

//trackingurlfortheclienttotrackappmasterprogress

RegisterApplicationMasterRequestappMasterRequest=

Records.newRecord(RegisterApplicationMasterRequest.class);

appMasterRequest.setApplicationAttemptId(appAttemptID);

appMasterRequest.setHost(appMasterHostname);

appMasterRequest.setRpcPort(appMasterRpcPort);

appMasterRequest.setTrackingUrl(appMasterTrackingUrl);

RegisterApplicationMasterResponseresponse=

Page 178: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

resourceManager.registerApplicationMaster(appMasterRequest);

TheApplicationMastersendsstatustotheResourceManagerviaheartbeatsignals,andthetimeoutexpiryintervalsattheResourceManageraredefinedbyconfigurationsettingsintheYarnConfiguration.TheApplicationMasterProtocolcommunicateswiththeResourceManagertosendheartbeatsandapplicationprogressinformation.

Dependingonapplicationrequirements,theApplicationMastercanrequestfromtheResourceManagerthenumberofcontainerresourcestobeallocated.Forthisrequest,theApplicationMasterwillusetheResourceRequestAPItodefinecontainerspecifications.TheResourceRequestwillcontainthehostnameifthecontainersneedtobehostedonspecifichosts,orthe*wildcardcharacterwhichimpliesthatanyhostcanfulfilltheresourcecapabilities,suchasthememorytobeallocatedtothecontainer.Itwillalsocontainpriorities,tosetcontainersthatcanbeallocatedtospecifictasksonhigherpriority.Forexample,inmap-reducetasks,higherpriorityforacontainerisallocatedtothemaptaskandlowerpriorityforthecontainersisallocatedtothereducetask:

//ResourceRequest

ResourceRequestrequest=Records.newRecord(ResourceRequest.class);

//setuprequirementsforhosts

//whetheraparticularrack/hostisexpected

//Refertoapisunderorg.apache.hadoop.netformoredetailson

//using*asanyhostwilldo

request.setHostName("*");

//setnumberofcontainers

request.setNumContainers(numContainers);

//setthepriorityfortherequest

Prioritypri=Records.newRecord(Priority.class);

pri.setPriority(requestPriority);

request.setPriority(pri);

//Setupresourcetyperequirements

//Fornow,onlymemoryissupportedsowesetmemoryrequirements

Resourcecapability=Records.newRecord(Resource.class);

capability.setMemory(containerMemory);

request.setCapability(capability);

Afterdefiningthecontainerrequests,theApplicationMasterhastobuildanallocationrequestfortheResourceManager.TheAllocationRequestconsistsoftherequestedcontainers,containerstobereleased,theResponseID(theIDoftheresponsethatwouldbesentbackfromtheallocatecall)andprogressupdateinformation:

List<ResourceRequest>requestedContainers;

List<ContainerId>releasedContainers

AllocateRequestreq=Records.newRecord(AllocateRequest.class);

//Theresponseidsetintherequestwillbesentbackin

//theresponsesothattheApplicationMastercan

//matchittoitsoriginalaskandactappropriately.

req.setResponseId(rmRequestID);

Page 179: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

//SetApplicationAttemptId

req.setApplicationAttemptId(appAttemptID);

//AddthelistofcontainersbeingaskedbytheAM

req.addAllAsks(requestedContainers);

//ApplicationMastercanrequestResourceManagertodeallocation

//ofthecontainerifnolongerrequires.

req.addAllReleases(releasedContainers);

//ApplicationMastercantrackitsprogressbysettingprogess

req.setProgress(currentProgress);

AllocateResponseallocateResponse=resourceManager.allocate(req);

TheresponsetothecontainerallocationrequestfromtheApplicationMastertotheResourceManagercontainstheinformationonthecontainersallocatedtotheApplicationMaster,thenumberofhostsavailableinthecluster,andmanymoresuchdetails.

ContainersarenotimmediatelyassignedtotheApplicationMasterbytheResourceManager.However,whenthecontainerrequestissenttotheResourceManager,theApplicationMasterwilleventuallygetthecontainersbasedoncluster-capacity,prioritiesandcluster-schedulingpolicy:

//Retrievelistofallocatedcontainersfromtheresponse

List<Container>allocatedContainers=

allocateResponse.getAllocatedContainers();

for(ContainerallocatedContainer:allocatedContainers){

LOG.info("Launchingshellcommandonanewcontainer."

+",containerId="+allocatedContainer.getId()

+",containerNode="+allocatedContainer.getNodeId().getHost()

+":"+allocatedContainer.getNodeId().getPort()

+",containerNodeURI="+allocatedContainer.getNodeHttpAddress()

+",containerState"+allocatedContainer.getState()

+",containerResourceMemory"

+allocatedContainer.getResource().getMemory());

LaunchContainerRunnablerunnableLaunchContainer=

newLaunchContainerRunnable(allocatedContainer);

ThreadlaunchThread=newThread(runnableLaunchContainer);

launchThreads.add(launchThread);

launchThread.start();

}

//Checkwhatthecurrentavailableresourcesinthecluster

ResourceavailableResources=allocateResponse.getAvailableResources();

LOG.info("Currentavailableresourcesinthecluster"+

availableResources);

//Basedonthisinformation,anApplicationMastercanmake

//appropriatedecisions

//Checkthecompletedcontainers

Page 180: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

List<ContainerStatus>completedContainers=

allocateResponse.getCompletedContainersStatuses();

for(ContainerStatuscontainerStatus:completedContainers){

LOG.info("GotcontainerstatusforcontainerID="

+containerStatus.getContainerId()

+",state="+containerStatus.getState()

+",exitStatus="+containerStatus.getExitStatus()

+",diagnostics="+containerStatus.getDiagnostics());

intexitStatus=containerStatus.getExitStatus();

if(0!=exitStatus){

//containerfailed

if(-100!=exitStatus){

//applicationjoboncontainerreturnedanon-zeroexit

//codecountsascompleted

numCompletedContainers.incrementAndGet();

numFailedContainers.incrementAndGet();

}

else{

//somethingelsebadhappened

//appjobdidnotcompleteforsomereason

//weshouldre-tryasthecontainerwaslostforsome

//reason

numRequestedContainers.decrementAndGet();

//wedonotneedtoreleasethecontainerasthathas

//alreadybeendonebytheResourceManager/NodeManager.

}

}

else{

//nothingtodo

//containercompletedsuccessfully

numCompletedContainers.incrementAndGet();

LOG.info("Containercompletedsuccessfully."+",

containerId="+containerStatus.getContainerId());

}

}

}

AftercontainerallocationissuccessfullyperformedfortheApplicationMaster,ithastosetuptheContainerLaunchContextforthetasksonwhichitwillrun.OncetheContainerLaunchContextisset,theApplicationMastercanrequesttheContainerManagertostarttheallocatedcontainer:

//AssuminganallocatedContainerobtainedfromAllocateResponse

//andhasbeenalreadyinitializationofcontainerisdone

Containercontainer;

LOG.debug("ConnectingtoContainerManagerforcontainerid="+

container.getId());

//ConnecttoContainerManagerontheallocatedcontainer

StringcmIpPortStr=container.getNodeId().getHost()+":"

+container.getNodeId().getPort();

InetSocketAddresscmAddress=NetUtils.createSocketAddr(cmIpPortStr);

LOG.info("ConnectingtoContainerManagerat"+cmIpPortStr);

ContainerManagercm=((ContainerManager)

rpc.getProxy(ContainerManager.class,cmAddress,conf));

Page 181: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

//NowwesetupaContainerLaunchContext

LOG.info("Settingupcontainerlaunchcontainerforcontainerid="+

container.getId());

ContainerLaunchContextctx=

Records.newRecord(ContainerLaunchContext.class);

ctx.setContainerId(container.getId());

ctx.setResource(container.getResource());

try{

ctx.setUser(UserGroupInformation.getCurrentUser().getShortUserName());

}catch(IOExceptione){

LOG.info(

"Gettingcurrentuserfailedwhentryingtolaunchthe

container",+e.getMessage());

}

//Settheenvironment

Map<String,String>unixEnv;

//Setuptherequiredenv.

//Pleasenotethatthelaunchedcontainerdoesnotinherit

//theenvironmentoftheApplicationMastersoallthe

//necessaryenvironmentsettingswillneedtobere-setup

//forthisallocatedcontainer.

ctx.setEnvironment(unixEnv);

//Setthelocalresources

Map<String,LocalResource>localResources=

newHashMap<String,LocalResource>();

//Again,thelocalresourcesfromtheApplicationMasterisnotcopied

over

//bydefaulttotheallocatedcontainer.Thus,itisthe

responsibility

//oftheApplicationMastertosetupallthenecessarylocal

resources

//neededbythejobthatwillbeexecutedontheallocated

container.

//Assumethatweareexecutingashellscriptontheallocated

container

//andtheshellscript'slocationinthefilesystemisknowntous.

PathshellScriptPath;

LocalResourceshellRsrc=Records.newRecord(LocalResource.class);

shellRsrc.setType(LocalResourceType.FILE);

shellRsrc.setVisibility(LocalResourceVisibility.APPLICATION);

shellRsrc.setResource(

ConverterUtils.getYarnUrlFromURI(newURI(shellScriptPath)));

shellRsrc.setTimestamp(shellScriptPathTimestamp);

shellRsrc.setSize(shellScriptPathLen);

localResources.put("MyExecShell.sh",shellRsrc);

ctx.setLocalResources(localResources);

//Setthenecessarycommandtoexecuteontheallocatedcontainer

Page 182: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Stringcommand="/bin/sh./MyExecShell.sh"

+"1>"+ApplicationConstants.LOG_DIR_EXPANSION_VAR+"/stdout"

+"2>"+ApplicationConstants.LOG_DIR_EXPANSION_VAR+"/stderr";

List<String>commands=newArrayList<String>();

commands.add(command);

ctx.setCommands(commands);

//SendthestartrequesttotheContainerManager

StartContainerRequeststartReq=

Records.newRecord(StartContainerRequest.class);

startReq.setContainerLaunchContext(ctx);

try{

cm.startContainer(startReq);

}catch(YarnRemoteExceptione){

LOG.info("Startcontainerfailedfor:"+",containerId="+

container.getId());

e.printStackTrace();

}

TheApplicationMasterwillgettheapplicationstatusinformationviatheApplicationMasterProtocol.Also,itmaymonitorbyqueryingtheContainerManagerfortheapplicationstatus:

GetContainerStatusRequeststatusReq=

Records.newRecord(GetContainerStatusRequest.class);

statusReq.setContainerId(container.getId());

GetContainerStatusResponsestatusResp;

try{

statucResp=cm.getContainerStatus(statusReq);

LOG.info("ContainerStatus"

+",id="+container.getId()

+",status="+statusResp.getStatus());

}catch(YarnRemoteExceptione){

e.printStackTrace();

}

ThiscodesnippetexplainshowtowritetheYARNClientandApplicationMasteringeneral.Actually,theApplicationMasteristheapplication-specificentity;eachapplicationorframeworkthatwantstorunoverYARNhasadifferentApplicationMaster,buttheflowisthesame.FormoredetailsontheYARNClientandApplicationMasterfordifferentframeworks,visittheApacheFoundationwebsite.

ResponsibilitiesoftheApplicationMasterTheApplicationMasteristheapplication-specificlibraryandisresponsiblefornegotiatingresourcesfromtheResourceManageraspertheclientapplication’srequirementsandneeds.TheApplicationMasterworkswiththeNodeManagertoexecuteandmonitorthecontainerandtracktheapplication’sprogress.TheApplicationMasteritselfrunsinoneofthecontainersallocatedbytheResourceManager,andtheResourceManagertrackstheprogressoftheApplicationMaster.

TheApplicationMasterprovidesscalabilitytotheYARNframework,asthe

Page 183: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

ApplicationMastercanprovideafunctionalitythatismuchsimilartothatofthetraditionalResourceManager,sotheYARNclusterisabletoscalewithmanyhardwarechanges.Also,bymovingalltheapplication-specificcodeintotheApplicationMaster,YARNgeneralizesthesystemsothatitcansupportmultipleframeworks,justbywritingtheApplicationMaster.

Page 184: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 185: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

SummaryInthischapter,youlearnedhowtousebundledapplicationsthatcomewiththeYARNframework,howtodeveloptheYARNClientandApplicationMaster,thecorepartsoftheYARNframework,howtosubmitanapplicationtoYARN,howtomonitoranapplication,andtheresponsibilitiesoftheApplicationMaster.

Inthenextchapter,youwilllearntowritesomereal-timepracticalexamples.

Page 186: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 187: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Chapter7.YARNFrameworksIt’sthedawnof2015,andbigdataisstillinitsboomingstage.Manynewstart-upsandgiantsareinvestingahugeamountintodevelopingPOCsandnewframeworkstocatertoanewandemergingvarietyofproblems.Theseframeworksarethenewcutting-edgetechnologiesorprogrammingmodelsthattendtosolvetheproblemsacrossindustriesintheworldofbigdata.Asthecorporationsaretryingtousebigdata,theyarefacinganewanduniquesetofproblemsthattheyneverfacedbefore.Hence,tosolvethesenewproblems,manyframeworksandprogrammingmodelsarecomingontothemarket.

YARN’ssupportformultipleprogrammingmodelsandframeworksmakesitidealtobeintegratedwiththesenewandemergingframeworksorprogrammingmodels.WithYARNtakingresponsibilityforresourcemanagementandothernecessarythings(schedulingjobs,faulttolerance,andsoon),itallowsthesenewapplicationframeworkstofocusonsolvingtheproblemsthattheywerespecificallymeantfor.

Atthetimeofwritingthisbook,manynewandemergingopensourceframeworksarealreadyintegratedwithYARN.

Inthischapter,wewillcoverthefollowingframeworksthatrunonYARN:

ApacheSamzaStormonYARNApacheSparkApacheTezApacheGiraphHoya(HBaseonYARN)KOYA(KafkaonYARN)

WewilltalkindetailaboutApacheSamzaandStormonYARN,wherewewilldevelopandrunsomesampleapplications.Forotherframeworks,wewillhaveabriefdiscussion.

Page 188: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

ApacheSamzaSamzaisanopensourceprojectfromLinkedInandiscurrentlyanincubationprojectattheApacheSoftwareFoundation.Samzaisalightweightdistributedstream-processingframeworktodoreal-timeprocessingofdata.TheversionthatisavailablefordownloadfromtheApachewebsiteisnottheproductionversionthatLinkedInuses.

Samzaismadeupofthefollowingthreelayers:

AstreaminglayerAnexecutionlayerAprocessinglayer

Samzaprovidesout-of-the-boxsupportforalltheprecedingthreelayers:

Streaming:ThislayerissupportedbyKafka(anotheropensourceprojectfromLinkedIn)Execution:supportedbyYARNProcessing:supportedbySamzaAPI

ThefollowingthreepiecesfittogethertoformSamza:

ThefollowingarchitectureshouldbefamiliartoanyonewhohasusedHadoop:

Beforegoingintoeachofthesethreelayersindepth,itshouldbenotedthatSamza’ssupportisnotlimitedtothesesystems.BothSamza’sexecutionandstreaminglayersarepluggableandallowdeveloperstoimplementalternativesasrequired.

Samzaisastream-processingsystemtoruncontinuouscomputationoninfinitestreamsofdata.

Samzaprovidesasystemtoprocessstreamdatafrompublish-subscribesystemssuchasApacheKafka.Thedeveloperwritesastream-processingtaskandexecutesitasaSamzajob.Samzathenroutesmessagesbetweenthestream-processingtasksandthepublish-

Page 189: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

subscribesystemsthatthemessagesareaddressedto.

SamzaworksalotlikeStorm,theTwitter-developedstream-processingtechnology,exceptthatSamzarunsonKafka,LinkedIn’sownmessagingsystem.Samzawasdevelopedwithapluggablearchitecture,enablingdeveloperstousethesoftwarewithothermessagingsystems.

ApacheSamzaisbasicallyacombinationofthefollowingtechnologies:

Kafka:SamzausesApacheKafkaasitsunderlyingmessagepassingsystemApacheYARN:SamzaalsousesApacheYARNfortaskschedulingZooKeeper:BothYARNandKafka,inturn,relyonApacheZooKeeperforcoordination

Moreinformationisavailableontheofficialsiteathttp://samza.incubator.apache.org/.

Wewillusethehello-samzaprojecttodevelopasampleexampletoprocesssomereal-timestreamprocessing.

WewillwriteaKafkaproducerusingtheJavaKafkaAPIstopublishacontinuousstreamofmessagestoaKafkatopic.Finally,wewillwriteaSamzaconsumerusingtheSamzaAPItoprocessthesestreamsfromtheKafkatopicinrealtime.Forsimplicity,wewilljustprintamessageandrecordeachtimeamessageisreceivedintheKafkatopic.

Page 190: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

WritingaKafkaproducerLet’sfirstwriteaKafkaproducertopublishmessagestoaKafkatopic(namedstorm-sentence):

importjava.io.BufferedReader;

importjava.io.File;

importjava.io.FileInputStream;

importjava.io.FileNotFoundException;

importjava.io.FileReader;

importjava.io.IOException;

importjava.io.PrintStream;

importjava.util.Properties;

importkafka.javaapi.producer.Producer;

importkafka.producer.KeyedMessage;

importkafka.producer.ProducerConfig;

/**

*AsimpleJavaClasstopublishmessagesintoKAFKA.

*

*

*@authornirmal.kumar

*

*/

publicclassKafkaStringProducerService{

publicProducer<String,String>producer;

publicProducer<String,String>getProducer(){

returnthis.producer;

}

publicvoidsetProducer(Producer<String,String>producer){

this.producer=producer;

}

publicKafkaStringProducerService(Propertiesprop){

setProducer(newProducer(newProducerConfig(prop)));

}

/**

*Changethelocationofproducer.propertiesaccordinglyinLineNo.123

*

*Loadtheproducer.propertieshavingfollowingproperties:

*kafka.zk.connect=192.xxx.xxx.xxx

*serializer.class=kafka.serializer.StringEncoder

*producer.type=async

*queue.buffering.max.ms=5000000

*queue.buffering.max.messages=1000000

*metadata.broker.list=192.xxx.xxx.xxx:9092

*

*@paramfilepath

*@return

*/

privatestaticPropertiesgetConfiguartionProperties(Stringfilepath){

Filepath=newFile(filepath);

Page 191: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Propertiesproperties=newProperties();

try{

properties.load(newFileInputStream(path));

}catch(FileNotFoundExceptione){

e.printStackTrace();

}catch(IOExceptione){

e.printStackTrace();

}

returnproperties;

}

/**

*PublisheseachmessagetoKAFKA

*

*@paraminput

*@paramii

*/

publicvoidexecute(Stringinput,intii){

KeyedMessagedata=newKeyedMessage("storm-sentence",input);

this.producer.send(data);

//LogstoSystemConsoletheno.ofmessagespublished(each100000)

if((ii!=0)&&(ii%100000==0))

System.out.println("$$$$$$$PUBLISHED"+ii+"messages@"

+System.currentTimeMillis());

}

/**

*Readseachlinefromtheinputmessagefile

*

*@paramfile

*@return

*@throwsIOException

*/

privatestaticStringreadFile(Stringfile)throwsIOException{

BufferedReaderreader=newBufferedReader(newFileReader(file));

Stringline=null;

StringBuilderstringBuilder=newStringBuilder();

Stringls=System.getProperty("line.separator");

while((line=reader.readLine())!=null){

stringBuilder.append(line);

stringBuilder.append(ls);

}

returnstringBuilder.toString();

}

/**

*mainmethodforinvokingtheJavaapplication

*Needtopasscommandlineargument:theabsolutefilepathcontaining

Stringmessages.

*

*@paramargs

*/

Page 192: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

publicstaticvoidmain(String[]args){

intii=0;

intnoOfMessages=Integer.parseInt(args[1]);

Strings=null;

try{

s=readFile(args[2]);

}catch(IOExceptione){

e.printStackTrace();

}

/**

*instantiatetheMainclass.

*Changethelocationofproducer.propertiesaccordingly

*/

KafkaStringProducerServiceservice=newKafkaStringProducerService(

getConfiguartionProperties("/home/cloud/producer.properties"));

System.out.println("********START:Publishing"+noOfMessages

+"messages@"+System.currentTimeMillis());

while(ii<=noOfMessages){

//invoketheexecutemethodtopublishmessagesintoKAFKA

service.execute(s,ii);

ii++;

}

System.out.println("#######END:Published"+noOfMessages

+"messages@"+System.currentTimeMillis());

try{

service.producer.close();

}catch(Exceptione){

e.printStackTrace();

}

}

}

CreatetheProducer.propertiesfilesomewherein/home/cloud/producer.propertiesandspecifythelocationinthepreviousKafkaproducerJavaclass.

TheProducer.propertiesfilewillhavethefollowinginformation:

Page 193: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Writingthehello-samzaprojectLet’snowwriteaSamzaconsumerandpackageitwiththehello-samzaproject:

1. Downloadandbuildthehello-samzaproject.Checkoutthehello-samzaproject:

gitclonegit://git.apache.org/incubator-samza-hello-samza.githello-

samza

cdhello-samza

Theoutputoftheprecedingcodecanbeseenhere:

2. Next,wewillwriteaSamzaconsumerusingtheSamzaAPItoprocesstheseNmessagesfromaKafkatopic.Gottohello-samza/samza-wikipedia/src/main/java/samza/examples/wikipedia/taskandwritetheYarnEssentialsSamzaConsumer.javafileasfollows:

3. AfterwritingtheSamzaconsumerclassinthehello-samzaproject,youwillneedtobuildtheproject:

mvncleanpackage

4. Createasamzadirectoryinsidethedeploydirectory:

mkdir-pdeploy/samza

5. Finally,createtheSamzajobpackage:

Page 194: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

tar-xvf./samza-job-package/target/samza-job-package-0.7.0-dist.tar.gz

-Cdeploy/samza

6. ForSamzaconsumerproperties,goto/home/cloud/hello-samza/deploy/samza/config.

7. Writeasamza-test-consumer.propertiesfileasfollows:

Thispropertiesfilewillmainlycontainthefollowinginformation:

job.name:ThisisthenameoftheSamzajobyarn.package.path:ThisisthepathoftheSamzajobpackagetask.class:ThisistheclassoftheactualSamzaconsume.task.inputs:ThisistheKafkatopicnamewherethepublishedwillbereadfromsystems.kafka.consumer.zookeeper.connect:ThisistheZooKeeper-relatedinformation

StartingagridASamzagridusuallycomprisesthreedifferentsystems:YARN,Kafka,andZooKeeper.Thehello-samzaprojectcomeswithascriptcalledgridtohelpyousetupthesesystems.Startbyrunningthefollowingcommand:

bin/gridbootstrap

Thiscommandwilldownload,install,andstartZooKeeper,Kafka,andYARN.ItwillalsocheckoutthelatestversionofSamzaandbuildit.Allthepackagefileswillbeputinasubdirectorycalleddeployinsidethehello-samzaproject’srootfolder.Theresultoftheprecedingcommandisshownhere:

Page 195: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

ThefollowingscreenshotshowsthatZookeeper,YARN,andKafkaarebeingstarted:

Oncealltheprocessesareupandrunningyoucanchecktheprocesses,asshowninthisscreenshot:

TheYARNResourceManagerwebUIwilllooklikethis:

Page 196: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

TheYARNNodeManagerwebUIwilllooklikethis:

Sincewestartedthegrid,let’snowdeploytheSamzajobtoit:

deploy/samza/bin/run-job.sh--config-

factory=org.apache.samza.config.factories.PropertiesConfigFactory--config-

path=file:/home/cloud/hello-samza/deploy/samza/config/samza-test-

consumer.properties

ChecktheapplicationprocessesandRMUI.Asyoucanseeinthefollowingscreenshot,runningtheSamzajobfirstcreatesaSamzaAppMasterandthenaSamzaContainertorun

Page 197: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

theconsumerthatwewrote:

TheResourceManagerwebUInowshowstheSamzaapplicationupandrunning:

TheApplicationMasterUIlooksasfollows:

Page 198: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

ThefollowingscreenshotshowstheApplicationMasterUIinterface:

Page 199: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

SincenowourSamzaconsumerisupandrunningandlisteningforanymessagesintheKafkatopic(namedstorm-sentence),let’spublishsomemessagestotheKafkatopicusingtheKafkaproducerwewroteinitially.ThefollowingJavacommandisusedtoinvoketheKafkaproducerthathastwocommand-linearguments:

N:ThisisthenumberoftimesthemessageispublishedintoKafka{pathOfFileNameHavingMessage}:Thisistheactualstringmessage

Createanyfilehavingastringmessage(strmsg10K.txt)andpassthisfilenameandpathasthesecondcommand-lineargumenttotheJavacommand,asshowninthefollowingscreenshot:

AssoonasthesemessagesarepublishedintheKafkatopic,theSamzaconsumerconsumesitandprintsthetimestamp,aswrittenintheSamzaconsumercode.

TheresultaftercheckingtheSamzaconsumerlogsisasfollows:

Page 200: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 201: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 202: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Storm-YARNApacheStormisanopensourcedistributedreal-timecomputationsystemfromTwitter.

Stormhelpsinprocessingunboundedstreamsofdatainareliablemanner.Stormcanbeusedwithanyprogramminglanguage.SomeofthemostcommonusecasesofStormarereal-timeanalytics,real-timemachinelearning,continuouscomputation,ETL,andmanymore.

Storm-YARNisaprojectfromYahoothatenablestheStormclustertobedeployedandmanagedbyYARN.Earlier,aseparateclusterwasneededforHadoopandStorm.

Onemajorbenefitthatcomeswiththisintegrationiselasticity.Batchprocessing(HadoopMapReduce)isusuallydoneonthebasisofneed,andreal-timeprocessing(Storm)isanongoingprocessing.WhentheHadoopclusterisidle,youcanleverageitforanyreal-timeprocessingwork.

Inatypicalreal-timeprocessingusecase,constantandpredictableloadsareveryrare.Storm,therefore,willneedmoreresourcesduringpeaktimewhentheloadisgreater.Atpeaktime,Stormcanstealresourcesfromthebatchjobsandgivethembackwhentheloadisless.

Thisway,theoverallresourceutilizationcanscaleupanddowndependingontheloadanddemand.Thiselasticityis,therefore,usefulforutilizingtheavailableresourcesonthebasisofdemandbetweenreal-timeandbatchprocessing.

AnotherbenefitisthatthisintegrationreducesthephysicaldistanceofdatatransfersbetweenStormandHadoop.ManyapplicationsusebothStormandHadooponseparateclusterswhilesharingdatabetweenthem(MapReduce).Forsuchascenario,Storm-YARNreducesnetworktransfers,andinturnthetotalcostofacquiringthedata,astheysharethesamecluster,asshowninthefollowingimage:

Page 203: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Referringtotheprecedingdiagram,Storm-YARNasksYARN’sResourceManagertolaunchaStormApplicationMaster.TheStormApplicationMasterthenlaunchesaStormNimbusserverandaStormUIserverlocally.ItalsousesYARNtoallocateresourcesforthesupervisorsandfinallylaunchthem.

WewillnowinstallStorm-YARNonaHadoopYARNclusteranddeploysomeStormtopologiestothecluster.

Page 204: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

PrerequisitesThefollowingaretheprerequisitesforStorm-YARN.

HadoopYARNshouldbeinstalledRefertotheHadoopYARNinstallationathttp://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/SingleCluster.html.

TheMasterThriftserviceofStorm-on-YARNusesport9000,andifStorm-YARNislaunchedfromtheNameNode,therewillbeaportcrash.

Inthiscase,youwillneedtochangetheportoftheNameNodeinyourHadoopinstallation.Typically,thefollowingprocessesshouldbeupandrunninginHadoop:

ApacheZooKeepershouldbeinstalledAtthetimeofwritingthisbook,theStorm-on-YARNApplicationMasterimplementationdoesnotincluderunningZookeeperonYARN.Therefore,itispresumedthatthereisaZookeeperclusteralreadyrunningtoenablecommunicationbetweenNimbusandworkers.

Thereisanopenissuethatthisthoughtathttps://github.com/yahoo/storm-yarn/issues/22.

InstallingZookeeperisverystraightforwardandeasy.

Refertohttp://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html.

Page 205: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

SettingupStorm-YARNStorm-YARNisbasicallyanimplementationoftheYARNclientandApplicationMasterforStorm.

TheclientgetsanewapplicationIDforStormandsubmitstheapplication,andtheApplicationMastersetsuptheStormcomponents(Nimbus,Supervisor,andsoon)onYARNusingthecontainersthattheApplicationMasterrequestsfromtheResourceManager.

NotethatStorm-on-YARNisnotanewimplementationofStormthatworksonYARN.Frameworks(thatisSamza,Storm,Spark,Tez,andsoon)themselvesdonotneedtobemodifiedtobeabletorunonYARN.OnlytheApplicationMasterandtheYARNclientcodeneedtobewrittenforeachoftheframeworkssothattheyrunonYARNasanapplicationjustlikeanyother.Now,proceedwiththefollowingsteps:

1. ClonetheStorm-YARNrepositoryfromGit:

cdstorm-on-yarn-poc/

gitclonehttps://github.com/yahoo/storm-yarn.git

cdstorm-yarn

TheStormclientmachinereferstothemachinethatwillsubmittheYARNclientandApplicationMastertotheResourceManager.

Asofnow,thereissinglereleaseofStorm-on-YARNfromYahoothatcontainsbothStorm-YARNandStormversions(0.9.0-wip21).TheStormreleaseispresentinthelibdirectoryoftheextractedStorm-on-YARNrelease.

2. BuildStorm-YARNusingMaven:

mvnpackageormvnpackage-DskipTests

3. Wewillgetthefollowingoutput:

[INFO]Scanningforprojects…

[INFO]

[INFO]Usingthebuilder

org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThread

edBuilderwithathreadcountof1

[INFO]

[INFO]----------------------------------------------------------------

--------

[INFO]Buildingstorm-yarn1.0-alpha

[INFO]----------------------------------------------------------------

--------

[INFO]

[INFO]Compiling5sourcefilesto/home/nirmal/storm-on-yarn-

poc/storm-yarn-master/target/test-classes

[INFO]

[INFO]---maven-jar-plugin:2.4:jar(default)@storm-yarn---

[INFO]

[INFO]---maven-surefire-plugin:2.10:test(default-test)@storm-yarn

---

Page 206: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

[INFO]Testsareskipped.

[INFO]

[INFO]---maven-jar-plugin:2.4:jar(default-jar)@storm-yarn---

[INFO]----------------------------------------------------------------

--------

[INFO]BUILDSUCCESS

[INFO]----------------------------------------------------------------

--------

[INFO]Totaltime:10.153s

[INFO]Finishedat:2014-11-12T15:57:49+05:30

[INFO]FinalMemory:10M/118M

[INFO]----------------------------------------------------------------

--------

[INFO]FinalMemory:14M/152M

[INFO]----------------------------------------------------

4. Next,youwillneedtocopythestorm.zipfilefromstorm-yarn/libtoHDFS.ThisissinceStorm-on-YARNwilldeployacopyofStormcodethroughoutallthenodesoftheYARNclusterusingHDFS.However,thelocationofwheretofetchthiscopyoftheStormcodeishardcodedintotheStorm-on-YARNclient.Copythestorm.zipfiletoHDFSusingthefollowingcommand:

hdfsdfs-mkdir-p/lib/storm/0.9.0-wip21

Alternatively,youcanalsousethefollowingcommand:

hadoopfs–mkdir-p/lib/storm/0.9.0-wip21

hdfsdfs-put/home/nirmal/storm-on-yarn-poc/storm-yarn-

master/lib/storm.zip/lib/storm/0.9.0-wip21/storm.zip

Youcanalsousethefollowingcommand:

hadoopfs-put/home/nirmal/storm-on-yarn-poc/storm-yarn-

master/lib/storm.zip/lib/storm/0.9.0-wip21/storm.zip

TheexactversionofStormmightdiffer,inyourcase,from0.9.0-wip21.

5. CreateadirectorytoholdourStormconfiguration:

mkdir-p/home/nirmal/storm-on-yarn-poc/storm-data/

cp/home/nirmal/storm-on-yarn-poc/storm-yarn-master/lib/storm.zip

/home/nirmal/storm-on-yarn-poc/storm-data/

cd/home/nirmal/storm-on-yarn-poc/storm-data

unzipstorm.zip

6. Addthefollowingconfigurationinthestorm.yamlfilelocatedat/home/nirmal/storm-on-yarn-poc/storm-data/storm-0.9.0-wip21/conf.Youcanchangethefollowingvaluesasperyoursetup:

storm.zookeeper.servers:localhostnimbus.host:localhostmaster.initial-num-supervisors:2master.container.size-mb:1024

Page 207: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

7. Addthestorm-yarn/binfoldertoyourpathvariable:

exportPATH=$PATH:/home/nirmal/storm-on-yarn-poc/storm-data/storm-

0.9.0-wip21/bin:/home/nirmal/storm-on-yarn-poc/storm-yarn-master/bin

8. Finally,launchStorm-YARNusingthefollowingcommand:

storm-yarnlaunch/home/nirmal/storm-on-yarn-poc/storm-data/storm-

0.9.0-wip21/conf/storm.yaml

LaunchingStorm-YARNexecutestheStorm-YARNclientthatgetsanappIDfromYARN’sResourceManagerandstartsrunningtheStorm-YARNApplicationMaster.TheApplicationMasterthenstartstheNimbus,Workers,andSupervisorservices.Youwillgetanoutputsimilartotheoneshowninthefollowingscreenshot:

9. WecanretrievethestatusofourapplicationusingthefollowingYARNcommand:

yarnapplication-list

Wewillgetthestatusofourapplicationasfollows:

10. YoucanalsoseeStorm-YARNrunningonthefollowingResourceManagerwebUIathttp://localhost:8088/cluster/:

Page 208: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

11. Nimbusshouldalsoberunningnow,andyoushouldbeabletoseeitthroughtheNimbuswebUIathttp://localhost:7070/.Thislooksasfollows:

12. Thefollowingprocessesshouldbeupandrunning:

Page 209: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 210: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Gettingthestorm.yamlconfigurationofthelaunchedStormclusterThemachinethatwillusetheStormclientcommandtosubmitanewtopologytoStormneedsthestorm.yamlconfigurationfileofthelaunchedStormclusteronYARNtobestoredin/home/nirmal/.storm/storm.yaml.

Normally,whenStormisnotrunonYARN,thisconfigurationfileismanuallyedited,soyoushouldknowtheIPaddressesoftheStormcomponents.However,sincethelocationofwheretheStormcomponentswillberunonYARNdependsonthelocationoftheallocatedcontainers,Storm-on-YARNisresponsibleforsettingstorm.yamlforus.Youcanfetchthisstorm.yamlfilefromtherunningStorm-on-YARN:

$cd

$mkdir.storm/

$storm-yarngetStormConfig-appId(checktheappIdontheYARNapplication

UIatport8088)-output/home/nirmal/.storm/storm.yaml

Page 211: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

BuildingandrunningStorm-StarterexamplesInthissection,wewillseehowtogettheexamplecodefromGitHub,builditusingMaven,andfinally,runtheexamples.Toperformthesetasks,you’llhavetoexecutethefollowingsteps:

1. GetthecodefromGitHub.Wewillusethestorm-starterfromGitHub:

gitclonehttps://github.com/nathanmarz/storm-starter

Cloninginto'storm-starter'...

remote:Countingobjects:756,done.

remote:Total756(delta0),reused0(delta0)

Receivingobjects:100%(756/756),171.81KiB|56.00KiB/s,done.

Resolvingdeltas:100%(274/274),done.

Checkingconnectivity…done

2. Next,gotothedownloadedstorm-starterdirectory:

cdstorm-starter/

3. Checkthecontentusingthefollowingcommands:

ls-ltr

-rw-r--r--1nirmalnirmal171Nov1212:58README.markdown

-rw-r--r--1nirmalnirmal5047Nov1212:58m2-pom.xml

drwxr-xr-x3nirmalnirmal4096Nov1212:58multilang

-rw-r--r--1nirmalnirmal580Nov1212:58LICENSE

drwxr-xr-x4nirmalnirmal4096Nov1212:58src

-rw-r--r--1nirmalnirmal929Nov1212:58project.clj

drwxr-xr-x3nirmalnirmal4096Nov1212:58test

-rw-r--r--1nirmalnirmal8042Nov1212:58storm-starter.iml

4. Buildthestorm-starterprojectusingMaven:

mvn-fm2-pom.xmlpackageormvn-fm2-pom.xmlpackage-DskipTests

5. Youwillseeanoutputsimilartothefollowingcommands:

[INFO]Scanningforprojects…

[INFO]Usingthebuilder

org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThread

edBuilderwithathreadcountof1

[INFO]

[INFO]----------------------------------------------------------------

--------

[INFO]Buildingstorm-starter0.0.1-SNAPSHOT

[INFO]----------------------------------------------------------------

--------

[INFO]META-INF/MANIFEST.MFalreadyadded,skipping

[INFO]META-INF/alreadyadded,skipping

[INFO]META-INF/maven/alreadyadded,skipping

[INFO]Buildingjar:/home/nirmal/storm-on-yarn-poc/storm-

starter/target/storm-starter-0.0.1-SNAPSHOT-jar-with-dependencies.jar

[INFO]META-INF/MANIFEST.MFalreadyadded,skipping

[INFO]META-INF/alreadyadded,skipping

[INFO]META-INF/maven/alreadyadded,skipping

Page 212: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

[INFO]----------------------------------------------------------------

--------

[INFO]BUILDSUCCESS

[INFO]----------------------------------------------------------------

--------

[INFO]Totaltime:05:21min

[INFO]Finishedat:2014-11-12T13:05:40+05:30

[INFO]FinalMemory:30M/191M

[INFO]----------------------------------------------------------------

--------

6. Afterthebuildissuccessful,youwillseethefollowingJARfilebeingcreatedunderthetargetdirectory:

storm-starter-0.0.1-SNAPSHOT-jar-with-dependencies.jar

7. RuntheStormtopologyexampleontheStorm-YARNcluster:

stormjarstorm-starter-0.0.1-SNAPSHOT-jar-with-dependencies.jar

storm.starter.WordCountTopologyword-count-topology

Theoutputcanbeseeninthefollowingscreenshot:

8. Clickonthetopology,asshowninthefollowingscreenshot:

Page 213: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 214: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 215: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

ApacheSparkApacheSparkisafastandgeneralengineforlarge-scaledataprocessing.Itwasoriginallydevelopedin2009inUCBerkeley’sAMPLabandopensourcedin2010.

ThemainfeaturesofSparkareasfollows:

Speed:SparkenablesapplicationsinHadoopclusterstorunupto100xfasterinmemoryand10xfasterevenwhenrunningondisk.Easeofuse:SparkletsyouquicklywriteapplicationsinJava,Scala,orPython.YoucanuseitinteractivelytoquerybigdatasetsfromtheScalaandPythonshells.Runseverywhere:SparkrunsonHadoop,Mesos,instandalonemode,orinthecloud.Itcanaccessdiversedatasources,includingHDFS,Cassandra,HBase,andS3.YoucanrunSparkreadilyusingitsstandaloneclustermode,onEC2,orrunitonHadoopYARNorApacheMesos.ItcanreadfromHDFS,HBase,Cassandra,andanyHadoopdatasource.Generality:Sparkpowersastackofhigh-leveltools,includingSparkSQL,MLlibformachinelearning,GraphX,andSparkStreaming.Youcancombinetheseframeworksseamlesslyinthesameapplication.

Page 216: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

WhyrunonYARN?YARNenablesSparktoruninasingleclusteralongsideotherframeworks,suchasTez,Storm,HBase,andothers.ThisavoidstheneedtocreateandmanageseparateanddedicatedSparkclusters.

Typically,customerswanttorunmultipleworkloadsonasingledatasetinasinglecluster.YARN,asagenericresourcemanagementandsingledataplatformforalldifferentframeworks/engines,makesithappen.

YARN’sbuilt-inmultitenancysupportallowsdynamicandoptimalsharingofthesamesharedclusterresourcesbetweendifferentframeworksthatrunonYARN.

YARNhaspluggableschedulerstocategorize,isolate,andprioritizeworkloads.

Page 217: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 218: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

ApacheTezApacheTezispartoftheStingerinitiativeledbyHortonworkstomaketheHiveenterprisereadyandsuitableforinteractiveSQLqueries.TheTezdesignisbasedonresearchdonebyMicrosoftonparallelanddistributedcomputing.

TezenteredtheApacheIncubatorinFebruary2013andgraduatedtoatop-levelprojectinJuly2014.

Tezisbasicallyanembeddableandextensibleframeworktobuildhigh-performancebatchandinteractivedata-processingapplicationsthatneedtointegrateeasilywithYARN.

ConfusionoftenariseswhenTezisthoughtofasanengine.Tezisnotageneral-purposeengine,butmoreofaframeworkfortoolstoexpresstheirpurpose-builtneeds.Tez,forexample,enablesHive,Pig,andotherstobuildtheirownpurpose-builtenginesandembedtheminthosetechnologiestoexpresstheirpurpose-builtneeds.ProjectssuchasHive,Pig,andCascadingnowhavesignificantimprovementsinresponsetimeswhentheyuseTezinsteadofMapReduce.

TezgeneralizestheMapReduceparadigmtoamorepowerfulframeworkbasedonexpressingcomputationsasadataflowgraph.TezexiststoaddresssomeofthelimitationsofMapReduce.Forexample,inatypicalMapReduce,alotoftemporarydataisstored(suchaseachmapper’soutput,whichisadiskI/O),whichisanoverhead.InthecaseofTez,thisdiskI/Ooftemporarydataissaved,therebyresultinginhigherperformancecomparedtotheMapReducemodel.

Also,Tezcanadjusttheparallelismofreducetasksatruntime,dependingontheactualdatasizecomingoutoftheprevioustask.Ontheotherhand,inMapReducethenumberofreducersisstaticandhastobedecidedbytheuserbeforethejobissubmittedtothecluster.

TheprocessingdonebymultipleMapReducejobscannowbedonebyasingleTezjob,asfollows:

Page 219: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Referringtotheprecedingdiagram,earlier(withPIG/HIVE),weusedtoneedmultipleM/Rjobstodosomeprocessing.However,now,inTez,asingleM/Rjobdoesthesame,thatis,thereducers(thegreenboxes)ofthepreviousstepfeedthemappers(theblueboxes)ofthenextstep.

Theprecedingimageistakenfromhttp://www.infoq.com/articles/apache-tez-saha-murthy.

Tezisnotmeantdirectlyforendusers;infact,itenablesdeveloperstobuildend-userapplicationswithmuchbetterperformanceandflexibility.Traditionally,Hadoophasbeenabatch-processingplatformtoprocesslargeamountsofdata.However,therearealotofusecasesfornear-real-timeperformanceofqueryprocessing.Therearealsoseveralworkloads,suchasmachinelearning,thatdonotfitintotheMapReduceparadigm.TezhelpsHadoopaddresstheseusecases.

Tezprovidesanexpressivedataflow-definitionAPIthatletsdeveloperscreatetheirownuniquedata-processinggraphs(DAGs)torepresenttheirapplications’data-processingflows.Oncethedeveloperdefinesaflow,TezthenprovidesadditionalAPIstoinjectcustombusinesslogicthatwillruninthatflow.TheseAPIsthencombineinputs(thatreaddata),outputs(thatwritedata),andprocessors(thatprocessdata)toprocesstheflow.

TezcanalsorunanyexistingMRjobwithoutanymodification.FormoreinformationonTez,refertohttp://tez.apache.org/.

Page 220: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 221: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

ApacheGiraphApacheGiraphisagraph-processingsystemthatusestheMapReducemodeltoprocessgraphs.Currently,itisinincubationattheApacheSoftwareFoundation.

ItisbasedonGoogle’sPregel,whichisusedtocalculatepagerank.

Currently,GiraphisbeingusedbyFacebook,Twitter,andLinkedIntocreatesocialgraphsoftheirusers.BothGiraphandPregelarebasedontheBulkSynchronousParallel(BSP)modelofdistributedcomputation,whichwasintroducedbyLeslieValiant.

SupportforYARNisfromrelease1.1.0.Formoreinformation,refertotheofficialsiteathttp://giraph.apache.org/.

Page 222: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 223: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

HOYA(HBaseonYARN)HoyaisbasicallyrunningHBaseonYARN.ItiscurrentlyhostedonGithub,butthereareplanstomoveittotheApacheFoundation.

HoyacreatesHBaseclustersontopofYARN.ItdoesthiswithaclientapplicationcalledHoyaclient;thisapplicationcreatesthepersistentconfigurationfiles,setsuptheHBaseclusterXMLfiles,andthenasksYARNtocreateanApplicationMaster,whichistheHoyaAMhere.

Formoreinformation,refertohttps://github.com/hortonworks/hoya,http://hortonworks.com/blog/introducing-hoya-hbase-on-yarn/andhttp://hortonworks.com/blog/hoya-hbase-on-yarn-application-architecture/.

Page 224: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 225: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

KOYA(KafkaonYARN)OnNovember5,2014,DataTorrent,acompanyfoundedbyex-Yahoo!,announcedanewprojecttobringthefault-tolerant,high-performance,scalableApacheKafkamessagingsystemtoYARN.

Theso-calledKafkaonYARN(KOYA)projectplanstoleverageYARNforKafkabrokermanagement,automaticbrokerrecovery,andmore.Plannedfeaturesincludeafully-HAApplicationMaster,stickyallocationofcontainers(sothatarestartcanaccesslocaldata),awebinterfaceforKafka,andmore.

TheexpectedreleasetotheopensourcecommunityissomewhereinQ22015.

Moreinformationisavailableathttps://www.datatorrent.com/introducing-koya-apache-kafka-on-apache-hadoop-2-0-yarn/.

Page 226: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 227: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

SummaryThischaptertalkedaboutthedifferentframeworksandprogrammingmodelsthatcanberunonYARN.WediscussedApacheSamzaandStormonYARNindetail.

WiththewideacceptanceofYARNintheindustry,moreandmoreframeworkswillsupportYARN,takingcompleteadvantageofYARN’sgenericfeatures.

WelookedattheexistingframeworksthatareintegratedwithYARNatthemoment.

ThereisalotmoreworkgoingonintheindustrytomakeexistingandnewapplicationsrunonYARN.

InChapter8,FailuresinYARN,wewilldiscusshowfaults,failuresatvariouslevels,arehandledinYARN.

Page 228: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 229: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Chapter8.FailuresinYARNDealingwithfailuresindistributedsystemsiscomparativelymorechallengingandtimeconsuming.Also,theHadoopandYARNframeworksrunoncommodityhardwareandclustersizenowadays;thissizecanvaryfromseveralnodestoseveralthousandnodes.Sohandlingfailurescenariosanddealingwithever-growingscalingissuesisveryimportant.Inthissection,wewillfocusonfailuresintheYARNframework:thecausesoffailuresandhowtoovercomethem.

Inthischapter,wewillcoverthefollowingtopics:

ResourceManagerfailuresApplicationMasterfailuresNodeManagerfailuresContainerfailuresHardwarefailures

Wewillbedealingwiththerootcausesofthesefailuresandthesolutionstothem.

Page 230: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

ResourceManagerfailuresIntheinitialversionsoftheYARNframework,ResourceManagerfailuresmeantatotalclusterfailure,asitwasasinglepointoffailure.TheResourceManagerstoresthestateofthecluster,suchasthemetadataofthesubmittedapplication,informationonclusterresourcecontainers,informationonthecluster’sgeneralconfigurations,andsoon.Therefore,iftheResourceManagergoesdownbecauseofsomehardwarefailure,thenthereisnowaytoavoidmanuallydebuggingtheclusterandrestartingtheResourceManager.DuringthetimetheResourceManagerisdown,theclusterisunavailable,andonceitgetsrestarted,alljobswouldneedarestart,sothehalf-completedjobsloseanydataandneedtoberestartedagain.Inshort,arestartoftheResourceManagerusedtorestartalltherunningApplicationMasters.

ThelatestversionsofYARNaddressthisproblemintwoways.Onewayisbycreatinganactive-passiveResourceManagerarchitecture,sothatwhenonegoesdown,anotherbecomesactiveandtakesresponsibilityforthecluster.TheResourceManagerRMstatecanbeseeninthefollowingimage:

AnotherwayisbyusingtheZookeeperResourceManagerquorum,sothattheResourceManagerstateisstoredexternallyovertheZookeeper,andoneResourceManagerisinanactivestateandoneormoreResourceManagersareinpassivemode,waitingforsomethingtohappenthatbringsthemtoanactivestate.TheResourceManager’sstatecanbeseeninthefollowingimage:

Page 231: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Intheprecedingdiagram,youcanseethattheResourceManager’sstateismanagedbytheZookeeper.Wheneverthereisafailurecondition,theResourceManager’sstateissharedwiththepassiveResourceManager(s)tochangetoanactivestateandtakeoverresponsibilityforthecluster,withoutanydowntime.

Page 232: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 233: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

ApplicationMasterfailuresTorecovertheapplication’sstateafteritsrestartbecauseofanApplicationMasterfailureistheresponsibilityoftheApplicationMasteritself.WhentheApplicationMasterfails,theResourceManagersimplystartsanothercontainerwithanewApplicationMasterrunninginitforanotherapplicationattempt.ItistheresponsibilityofthenewApplicationMastertorecoverthestateoftheolderApplicationMaster,andthisispossibleonlywhenApplicationMasterspersisttheirstatesintheexternallocationsothatitcanbeusedforfuturereference.AnyApplicationMastercanrunanyapplicationfromscratchinsteadofrecoveringitsstateandrerunningagain.

Forexample,anApplicationMastercanrecoveritscompletedjobs.However,ifthejobsthatarerunningandcompletedduringtheApplicationMaster’srecoverytimeframegethaltedforsomereason,theirstatewillbediscardedandtheApplicationMasterwillsimplyrerunthemfromscratch.

TheYARNframeworkiscapableofrerunningtheApplicationMasteraspecifiednumberoftimesandrecoveringthecompletedtasks.

Page 234: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 235: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

NodeManagerfailuresAlmostallnodesintheclusterrunsaNodeManagerservicedaemon.TheNodeManagertakescareofexecutingacertainpartofaYARNjoboneveryindividualmachine,whileotherpartsareexecutedonothernodes.Fora1000nodeYARNcluster,thereareprobablyaround999nodemanagersrunning.Sonodemanagersareindeedaper-nodeagentandtakescareoftheindividualnodesdistributedinthecluster.

IfaNodeManagerfails,theResourceManagerdetectsthisfailureusingatime-out(thatis,stopsreceivingtheheartbeatsfromtheNodeManager).TheResourceManagerthenremovestheNodeManagerfromitspoolofavailableNodeManagers.Italsokillsallthecontainersrunningonthatnode&reportsthefailuretoallrunningAMs.AMsarethenresponsibleforreactingtonodefailures,byredoingtheworkdonebyanycontainersrunningonthatnodeduringthefault.

Ifthefaultcausingthetime-outistransientthentheNodeManagerwillresynchronizeswiththeResourceManager.OnthesimilarlinesifanewNodeManagerjoinsthecluster,theResourceManagernotifiesallApplicationMastersabouttheavailabilityofnewresources.

Page 236: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 237: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

ContainerfailuresWheneveracontainerfinishes,theApplicationMasterisinformedofthiseventbytheResourceManager.SotheApplicationMasterinterpretsthatthecontainerstatusreceivedthroughtheResourceManageristhesuccessorfailurefromcontainerexitstatus.TheApplicationMasterhandlesthefailuresofthejobcontainers.

Itistheresponsibilityoftheapplicationframeworkstomanagethecontainer’sfailures,andtheresponsibilityoftheYARNframeworkistoprovideinformationtotheapplicationframework.AsapartofallocatingtheAPI’sresponse,theResourceManagercollectsinformationonthefinishedcontainersfromtheApplicationMaster,asthecontainersreturnallthisinformationtothecorrespondingApplicationMaster.ItistheresponsibilityoftheApplicationMastertovalidatethecontainer’sstatus,exitcode,anddiagnosticinformationandappropriateactiononit,forexamplewhentheMapReduceApplicationMasterretriesthemapandreducetasksbyrequestingnewcontainers,untiltheconfigurednumberoftasksfailforasinglejob.

Toaddresscontainerallocationfailurescenarios,theResourceManagercollectscontainerinformationbyexecutingtheAllocatecall,andtheAllocateResponseusuallydoesnotreturnanycontainers.However,theAllocatecallshouldbemadeperiodicallytoensurethatallcontainersareassigned.Whenthecontainerarrives,itisforsurethattheframeworkwillhavesufficientresources,andtheApplicationMasterwillnotreceivemorecontainersthanitaskedfor.Also,theApplicationMastercanmakeseparatecontainerrequests,ResourceRequests,typicallyonepersecond.

Page 238: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 239: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

HardwareFailuresAstheHadoopandYARNframeworksusecommodityhardwarefortheclustersetupandscalingfromseveralnodestoseveralthousandnodes,allthecomponentsofHadooporYARNaredesignedontheassumptionthathardwarefailuresareverycommon.Therefore,thesefailureswouldbeautomaticallyhandledbytheframeworksothatimportantdataisnotlostbecauseofthem.Forthis,Hadoopprovidesdatareplicationacrossthenodes/rackssothatevenifthewholerackfails,datawouldberecoveredfromanothernodeonanotherrack,andjobswouldberestartedoveranotherreplicadatasettocomputetheresults.

Page 240: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 241: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

SummaryInthischapter,wediscussedYARNfailurescenariosandhowtheseareaddressedintheYARNframework.Inthenextchapter,wewillbefocusingonalternativesolutionsfortheYARNframework.WewillalsoseeabriefoverviewofthemostcommonframeworksthatarecloselyrelatedtoYARN.

Page 242: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 243: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Chapter9.YARN–AlternativeSolutionsDuringthedevelopmentofYARN,manyotherorganizationssimultaneouslyidentifiedthelimitationsofHadoop1.xandwereactivelyinvolvedindevelopingalternativesolutions.

ThischapterwillbrieflytalkaboutsuchalternatesolutionsandcomparethemtoYARN.AmongthemostcommonframeworksthatarecloselyrelatedtoYARNare:

MesosOmegaCorona

Page 244: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

MesosMesoswasoriginallydevelopedattheUniversityofCaliforniaatBerkeleyandlaterbecameopensourceundertheApacheSoftwareFoundation.

Mesoscanbethoughtofasahighly-availableandfault-tolerantoperatingsystemkernelforyourclusters.It’saclusterresourcemanagerthatprovidesefficientresourceisolationandsharingacrossmultiplediversecluster-computingorframeworks.

MesoscanbecomparedtoYARNinsomeaspectsbutacompletequantitativecomparisonisliterallynotpossible.

WewilltalkaboutthearchitectureofMesosandcomparesomeofthearchitecturaldifferenceswithrespecttoYARN.Thiswaywewillhaveahighlevelunderstandingofthemaindifferencebetweenthetwoframeworks.

TheprecedingfigureshowsthemaincomponentsofMesos.Itbasicallyconsistsofamasterprocessthatmanagesslaveprocessesrunningoneachclusternodeandmesosapplications(alsocalledframeworks)thatruntasksontheseslaves.

Formoreinformationpleaserefertotheofficialsiteathttp://mesos.apache.org/.

Herearethehigh-leveldifferencesbetweenMesosandYARN:

Mesos YARN

MesosusesLinuxcontainergroups(http://lxc.sourceforge.net).

Linuxcontainergroupsareastrongerisolationbutmayhavesomeadditionaloverhead.

YARNusessimpleUnixprocesses.

MesosisprimarilywritteninC++. YARNisprimarilywritteninJavawithbitsofnativecode.

Page 245: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

MesossupportsbothmemoryandCPUscheduling.

Currently,YARNonlysupportsmemoryscheduling(forexample,yourequestxcontainersofyMBeach),butthereareplanstoextendittootherresourcessuchasnetworkanddiskI/Oresources.

Mesosintroducesadistributedtwo-levelschedulingmechanismcalledresourceoffers.Mesosdecideshowmanyresourcestooffereachframework,whileframeworksdecidewhichresourcestoacceptandwhichcomputationstorunonthem.

YARNhasarequest-basedapproach.ItallowstheApplicationMastertoaskforresourcesbasedonvariouscriteria,includinglocations,andalsoallowstherequestertomodifyfuturerequestsbasedonwhatwasgivenandonthecurrentusage.

Mesosleveragesapoolofcentralschedulers(forexample,classicHadooporMPI).

YARNontheotherhandhasaperjobscheduler.AlthoughYARNenableslatebindingofcontainerstotasks,whereeachindividualjobcanperformlocaloptimizations,theper-jobApplicationMastermightresultingreateroverheadthantheMesosapproach.

Page 246: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 247: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

OmegaOmegaisGoogle’snextgenerationclustermanagementsystem.

Omegaisspecificallyfocusedonaclusterschedulingarchitecturethatusesparallelism,sharedstate,andoptimisticconcurrencycontrol.

Fromthepastexperience,Googlenoticedthatastheclustersandtheirworkloadsincrease,theschedulerisatriskofbecomingascalabilitybottleneck.

Google’sproductionjobschedulerhasexperiencedallofthis.Overtheyears,ithasevolvedintoacomplicated,sophisticatedsystemthatishardtochange.

Aschematicoverviewoftheschedulingarchitecturescanbeseeninthefollowingfigure:

contribprojecttoHadoop0.20branchandisnotaverylargecodebase.Coronaisintegratedwiththefair-scheduler.YARNismoreinterestedinthecapacityscheduler.

Googleidentifiedthefollowingtwoprevalentschedulerarchitecturesshownintheprecedingfigure:

Monolithicschedulers:Thisusesasingle,centralizedschedulingalgorithmforalljobs(ourexistingschedulerisoneofthese).Theydonotmakeiteasytoaddnewpoliciesandspecializedimplementations,andmaynotscaleuptotheclustersizesoneisplanningforinthefuture.Two-levelschedulers:Thiswillhaveasingleactiveresourcemanagerthatofferscomputeresourcestomultipleparallel,independentschedulerframeworks,asinMesosandHadoopOnDemand(HOD).Theirarchitecturesdoappeartoprovideflexibilityandparallelism,butinpracticetheirconservativeresourcevisibilityandlockingalgorithmslimitboth,andmakeithardtoplacedifficultto-schedule“picky”

Page 248: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

jobsortomakedecisionsthatrequireaccesstothestateoftheentirecluster.

ThesolutionisOmega—anewparallelschedulerarchitecturebuiltaroundthesharedstate,usinglock-freeoptimisticconcurrencycontrol,toachievebothimplementationextensibilityandperformancescalability.

Omega’sapproachreflectsagreaterfocusonscalability,butmakesithardertoenforceglobalproperties,suchascapacity,fairness,anddeadlines.

Formoreinformation,refertohttp://research.google.com/pubs/pub41684.html.

Page 249: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 250: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

CoronaCoronaisanotherworkfromFacebook,whichisnowopen-sourcedandhostedontheGitHubrepositoryathttps://github.com/facebookarchive/hadoop-20/tree/master/src/contrib/corona.

Facebook,withitshugepeta-scalequantityofdata,sufferedseriousperformance-relatedissueswiththeclassicMapReduceframeworkbecauseofthesingleJobTrackertakingcareofthousandsofjobsanddoingalotofworkalone.

Inordertosolvetheseissues,FacebookcreatedCorona,whichseparatedclusterresourcemanagementfromjobcoordination.

InHadoopCorona,theclusterresourcesaretrackedbyacentralClusterManager.EachjobgetsitsownCoronaJobTrackerwhichtracksjustthatparticularjob.

CoronahasentirelyredesignedMapReducearchitecturetobringbetterclusterutilizationandjobscheduling,justlikeYARNdid.

Facebook’sgoalsinre-writingtheHadoopschedulingframeworkwerenotthesameasYARN’s.FacebookwantedquickimprovementsinMapReduce,butonlythepartthattheywereusing.TheyhadnointerestinrunningmultipleheterogeneousframeworkssuchasYARNdoesorotherkeydesignconsiderationsofYARN.

ForFacebook,doingaquickrewriteoftheschedulerseemedfeasibleandlowrisk,comparedtogoingwithYARN,gettingfeaturesthatwerenotneeded,understandingit,fixingitsproblemsandthenlandingupwithsomethingthatdidn’taddresstheprimarygoalofloweringlatency.

Thefollowingaresomeofthekeydifferences:

Coronadoespush-basedschedulingandhasanevent-driven,callback-orientedmessageflow.Thiswascriticaltoachievingfast,low-latencyscheduling.PollingisabigpartofwhytheHadoopschedulerisslowandhasscalabilityissues.YARNdoesnotdocallback-basedmessageflow.InCorona,JobTrackercanrunonthesameJVMastheJobClient(thatisHive).FacebookhadfatclientmachineswithtonsofRAMandCPU.Toreducelatency,maximumprocessingontheclientmachineispreferred.InYARN,JobTrackerhastobescheduledwithinthecluster.Thismeansthatthere’soneextrastepbetweenstartingaqueryandgettingitrunning.CoronaisstructuredasacontribprojecttoHadoop0.20branchandisnotaverylargecodebase.Coronaisintegratedwiththefair-scheduler.YARNismoreinterestedinthecapacityscheduler.

FormoreinformationonCorona,refertohttps://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920.

Page 251: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 252: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

SummaryWetalkedaboutvariousworksrelatedtoYARNthatareavailableonthemarkettoday.Thesesystemssharecommoninspiration/requirements,andthehigh-levelgoalofimprovingscalability,latency,fault-tolerance,andprogramming-modelflexibility.Thevariedarchitecturaldifferencesareduetothediverseandvarieddesignpriorities.Inthenextchapter,wewilltalkaboutYARN’sfutureandsupportintheindustry.

Page 253: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 254: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Chapter10.YARN–FutureandSupportYARNisthenewmoderndataoperatingsystemforHadoop2.YARNactsasacentralorchestratortosupportmixedworkloads/programmingmodels,runningmultipleengines,andmultipleaccesspatternssuchasbatchprocessing,interactive,streaming,andreal-time,inHadoop2.

Inthischapter,wewilltalkaboutYARN’sjourneyanditspresentandfutureinthebigdataindustry.

Page 255: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

WhatYARNmeanstothebigdataindustryItcanbesaidthatYARNisaboontothebigdataindustry.WithoutYARNtheentirebigdataindustrywouldhavebeenatseriousrisk.Astheindustrystartedplayingwithbigdata,newandemergingvarietiesofproblemscameintothepictureandhencenewframeworks.

YARN’ssupporttorunthesenewandemergingframeworksallowstheseframeworkstofocusonsolvingtheproblemsforwhichtheywerespecificallymeantfor,whileYARNtakescareofresourcemanagementandothernecessarythings(resourceallocation,schedulingjobs,faulttolerance,andsoon).

HadtherebeennoYARN,theseframeworkswouldhavehadtodoalltheresource-managementontheirown.Therearemanybigdataprojectsthatfailedinthepastduetounrealisticexpectationsonimmaturetechnologies.

YARNistheenablerforportingmatureandenterprise-classtechnologiesdirectlyontoHadoop.WithoutYARN,theonlythinginHadoopwastouseMapReduce.

Page 256: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 257: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Journey–presentandfutureAroundtwoyearsback,YARNwasintroducedwiththeHadoop0.23releaseon11Nov,2011.

Sincethen,therewasnolookingbackandtherewereanumberofreleases.

Finally,onOctober15,2013ApacheHadoop2.2.0wastheGA(GeneralAvailability)releaseofApacheHadoop2.x.

InOctober2013,ApacheHadoopYARNwontheBestPaperawardatACMSoCC(SymposiumonCloudComputing)2013.

ApacheHadoop2.x,poweredbyYARN,isnodoubtthebestplatformforalloftheHadoopecosystemcomponentssuchasMapReduce,ApacheHive,ApachePig,andsoonthatuseHDFSastheunderlyingdatastorage.

YARNwasalsohonoredbyotheropensourcecommunitiesforframeworkssuchasApacheGiraph,ApacheTez,ApacheSpark,ApacheFlink,andmanyothers.

VendorssuchasHP,Microsoft,SAS,Teradata,SAP,RedHat,andthelistgoeson,aremovingtowardsYARNtoruntheirexistingproductsandservicesonHadoop.

PeoplewillingtomodifyapplicationscanalreadyuseYARNdirectly,buttherearemanycustomers/vendorswhodon’twanttomodifytheirexistingapplication.Forthem,thereisApacheSlider,anotheropensourceprojectfromHortonworks,whichcandeployanyexistingdistributedapplicationswithoutrequiringthemtobeportedtoYARN.

ApacheSliderallowsyoutobridgeexistingalways-onservicesandmakessuretheyworkreallywellontopofYARN,withouthavingtomodifytheapplicationitself.

Sliderfacilitatesmanylong-runningservicesandapplicationssuchasApacheStorm,ApacheHBase,ApacheAccumulo,andsoonrunningonYARN.

ThisinitiativewilldefinitelyexpandthespectrumofapplicationsandusecasesthatonecanactuallyusewithHadoopandYARNinfuture.

Page 258: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Presenton-goingfeaturesNow,let’sdiscussthepresenton-goingworksinYARN.

LongRunningApplicationsonSecureClusters(YARN-896)

Supportlong-livedapplicationsandlong-livedcontainers.Refertohttps://issues.apache.org/jira/browse/YARN-896.

ApplicationTimelineServer(YARN-321,YARN-1530)

Currently,wehaveaJobHistoryServerforMapReducehistory.TheMapReducejobhistoryservercurrentlyneedstobedeployedasatrustedserverinsyncwiththeMapReduceruntime.Everynewapplicationwouldneedasimilarapplicationhistoryserver.HavingtodeployO(T*V)(whereTisthenumberoftypeofapplication,Visthenumberofversionofapplication)trustedserversisclearlynotscalable.

ThisJIRAistocreateonlyonetrustedapplicationhistoryserver,whichcanhaveagenericUI.Refertothefollowinglinksformoreinformation:

https://issues.apache.org/jira/browse/YARN-321https://issues.apache.org/jira/browse/YARN-1530

Diskscheduling(YARN-2139)

SupportfordiskasaresourceinYARN.YARNshouldconsiderdiskasanotherresourceforschedulingtasksonnodes,isolationatruntime,andspindlelocality.Refertohttps://issues.apache.org/jira/browse/YARN-2139.

Reservation-basedscheduling(YARN-1051)

ToextendtheYARNRMtohandletimeexplicitly,allowinguserstoreservecapacityovertime.ThisisanimportantsteptowardsSLAs,long-runningservices,workflows,andhelpsingangscheduling.

Page 259: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

FuturefeaturesLet’sdiscussthefutureworksinYARN.

ContainerResizing(YARN-1197)

ThecurrentYARNresourcemanagementlogicassumesthattheresourcesallocatedtoacontainerarefixedduringitslifetime.Whenuserswanttochangetheresourcesofanallocatedcontainer,theonlywayisreleasingitandallocatinganewcontainerwiththeexpectedsize.Allowingruntimechangestotheresourcesofanallocatedcontainerwillgiveusbettercontrolofresourceusageontheapplicationside.Refertohttps://issues.apache.org/jira/browse/YARN-1197.

Adminlabels(YARN-796)

Supportforadminstospecifylabelsfornodes.TheexamplesoflabelsareOS,processorarchitecture,andsoon.Refertohttps://issues.apache.org/jira/browse/YARN-796.

ContainerDelegation(YARN-1488)

Allowcontainerstodelegateresourcestoanothercontainer.ThiswouldallowexternalframeworkstosharenotjustYARN’sresource-managementcapabilities,butalsoitsworkload-managementcapabilities.

ThisalsoshowsthatYARNisnotonlyfocusedontheApacheHadoopecosystemcomponents,butalsoonanyexistingexternalnon-HadoopproductsandservicesthatwanttouseHadoop.

Also,workisgoingoninbringingtogethertheworldsofDataandPaaSbyusingDocker,GoogleKubernetes,andRedHatOpenShiftonYARNsothatacommonresourcemanagementcanbedoneacrossdataandPaaSworkloads.

Page 260: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 261: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

YARN-supportedframeworksThefollowingisthecurrentlistofframeworksthatrunsontopofYARN,andthislistwillgoongettinglongerinthefuture:

ApacheHadoopMapReduceanditsecosystemcomponentsApacheHAMAOpenMPIApacheS4ApacheSparkApacheTezImpalaStormHOYA(HBaseonYARN)ApacheSamzaApacheGiraphApacheAccumuloApacheFlinkKOYA(KafkaonYARN)Solr

Page 262: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all
Page 263: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

SummaryInthischapter,webrieflytalkedaboutYARN’sjourneysinceitsinception.YARNhascompletelychangedHadoopfromthewayitwasearlierintheHadoop1.xversion.NowYARNisafirst-classresourcemanagementframeworkforsupportingmixedworkloads/processingframeworks.

Fromwhatcanbeenseenandpredicted,YARNissurelyahitinthebigdataindustryandhasmanymorenewandpromisingfeaturestocomeinthefuture.Currently,YARNhandlesmemoryandCPUandwillcoordinateadditionalresourcessuchasdiskandnetworkI/Ointhefuture.

Page 264: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

IndexA

AccessControlList(ACL)about/NodeManager(NM),Thecapacityscheduler

administrativetoolsabout/Administrativetoolscommands/Administrativetoolsgenericoptions,supporting/Administrativetools

/Administrativetoolsanagrams/PracticalexamplesofMRv1andMRv2ApacheGiraph

about/ApacheGiraphURL/ApacheGiraph

ApacheHadoop2.2.0about/Journey–presentandfuture

ApacheSamzaabout/ApacheSamzaKafka/ApacheSamzaApacheYARN/ApacheSamzaZooKeeper/ApacheSamzaKafkaproducer,writing/WritingaKafkaproducerhello-samzaproject,writing/Writingthehello-samzaproject

ApacheSamza,layersprocessinglayer/ApacheSamzastreaminglayer/ApacheSamzaexecutionlayer/ApacheSamza

ApacheSliderabout/Journey–presentandfuture

ApacheSoftwareFoundationabout/Mesos

ApacheSparkabout/ApacheSparkfeatures/ApacheSparkrunning,onYARN/WhyrunonYARN?

ApacheTezabout/ApacheTezURL/ApacheTez

ApplicationContext(AppContext)/TheMapReduceApplicationMasterApplicationMaster

about/TheMapReduceApplicationMasterApplicationMaster(AM)/ApplicationMaster(AM)

restarting/TheMapReduceApplicationMaster

Page 265: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

writing/WritingtheYARNApplicationMasterresposibilities/ResponsibilitiesoftheApplicationMasterfailures/ApplicationMasterfailures

ApplicationMasterLauncherserviceabout/ResourceManager

ApplicationMasterServiceabout/ResourceManager

ApplicationsManagerabout/ResourceManager

Page 266: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Bbackwardcompatibility,MRv2APIs

about/BackwardcompatibilityofMRv2APIsbinarycompatibility,oforg.apache.hadoop.mapredAPIs/Binarycompatibilityoforg.apache.hadoop.mapredAPIssourcecompatibility,oforg.apache.hadoop.mapredAPIs/Sourcecompatibilityoforg.apache.hadoop.mapredAPIs

BulkSynchronousParallel(BSP)about/ApacheGiraph

Page 267: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Ccapacityscheduler

about/Thecapacityscheduler,Thecapacityschedulerbenefits/Thecapacityschedulerfeatures/Thecapacityschedulerconfigurations/Capacityschedulerconfigurations

clusterschedulingarchitectureabout/Omega

configurationparametersabout/Thefully-distributedmode

containerfailures/Containerfailures

containerallocationabout/Containerallocationtoapplication/Containerallocationtotheapplication

containerconfigurationsabout/Containerconfigurationsparameters/Containerconfigurations

ContainerExecutorabout/NodeManager(NM)

ContainerManagerabout/NodeManager(NM)

ContextObjects/OldandnewMapReduceAPIsCorona

about/CoronaandFacebook,differences/CoronaURL/Corona

Page 268: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Ddata-processinggraphs(DAGs)

about/ApacheTezDataNodes(DN)/Thefully-distributedmode

configuring/Thefully-distributedmodeDocker

about/Futurefeatures

Page 269: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

EEcoSystem

webinterfaces/WebinterfacesoftheEcosystem

Page 270: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

FFacebook

about/CoronaandCorona,differences/Corona

Fairscheduler/Thefairschedulerabout/Thefairschedulerconfigurations/Fairschedulerconfigurations

FIFOscheduler/TheFIFO(FirstInFirstOut)schedulerabout/TheFIFO(FirstInFirstOut)schedulerconfigurations/TheFIFO(FirstInFirstOut)scheduler

fully-distributedmodeabout/Thefully-distributedmodeHistoryServer/HistoryServerslavefiles/Slavefiles

Page 271: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

GGoogleKubernetes

about/Futurefeaturesgrid

starting/Startingagrid

Page 272: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

HHadoop

URL/SoftwareYARN,usingin/UnderstandingwhereYARNfitsintoHadoop

Hadoop0.23about/Journey–presentandfuture

Hadoop1.xabout/AshortintroductiontoHadoop1.xandMRv1components/AshortintroductiontoHadoop1.xandMRv1

Hadoop2releaseabout/TheHadoop2release

HadoopandYARNclusteroperating/OperatingHadoopandYARNclustersstarting/StartingHadoopandYARNclustersstopping/StoppingHadoopandYARNclusters

HadoopclusterHDFS/AshortintroductiontoHadoop1.xandMRv1MapReduce/AshortintroductiontoHadoop1.xandMRv1

HadoopOnDemand(HOD)/Omegahello-samzaproject

writing/Writingthehello-samzaprojectproperties/Writingthehello-samzaprojectgrid,starting/Startingagrid

HistoryServer/HistoryServerHOYA(HBaseonYARN)

about/HOYA(HBaseonYARN)URL/HOYA(HBaseonYARN)

Page 273: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

KKafkaproducer

writing/WritingaKafkaproducerKOYA(KafkaonYARN)

about/KOYA(KafkaonYARN)URL/KOYA(KafkaonYARN)

Page 274: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

MMapReduce,YARN

about/YARN’sMapReducesupportApplicationMaster/TheMapReduceApplicationMastersettings,example/ExampleYARNMapReducesettingsYARNapplications,developing/DevelopingYARNapplications

MapReduceapplicationsYARN,compatiblewith/YARN’scompatibilitywithMapReduceapplications

MapReducejobconfigurations/MapReducejobconfigurationsproperties/MapReducejobconfigurations

MapReduceJobHistoryServersettings/HistoryServer

MapReduceprojectEnd-userMapReduceAPI/MRv1versusMRv2MapReduceframework/MRv1versusMRv2MapReducesystem/MRv1versusMRv2

Mesosabout/MesosandYARN,differencebetween/MesosURL/Mesos

modernoperatingsystem,ofHadoopYARN,usedas/YARNasthemodernoperatingsystemofHadoop

monolithicschedulers/OmegaMRv1

about/AshortintroductiontoHadoop1.xandMRv1versusMRv2/MRv1versusMRv2examples/PracticalexamplesofMRv1andMRv2,Runningthejob

MRv2versusMRv1/MRv1versusMRv2examples/PracticalexamplesofMRv1andMRv2,Preparingtheinputfile(s)

Page 275: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

NNameNode(NN)/Thefully-distributedmode

configuring/Thefully-distributedmodenewMapReduceAPI

about/OldandnewMapReduceAPIsversusoldMapReduceAPI/OldandnewMapReduceAPIs

NodeHealthCheckerServiceabout/NodeManager(NM)

NodeManager(NM)/NodeManager(NM)configuring/Thefully-distributedmodeparameters/Thefully-distributedmode

NodeManagers(NM)/Thefully-distributedmodeNodeStatusUpdater

about/NodeManager(NM)

Page 276: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

OoldMapReduceAPI

about/OldandnewMapReduceAPIsversusnewMapReduceAPI/OldandnewMapReduceAPIs

Omegaabout/Omega

Page 277: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

PPiexample

running/RunningasamplePiexampleprerequisites,single-nodeinstallation

platform/Platformsoftwares/Software

prerequisites,Storm-YARNHadoopYARN,installing/HadoopYARNshouldbeinstalledApacheZooKeeper,installing/ApacheZooKeepershouldbeinstalled

programnamesaggregatewordcount/RunningsampleexamplesonYARNaggregatewordhist/RunningsampleexamplesonYARNbbp/RunningsampleexamplesonYARNdbcount/RunningsampleexamplesonYARNdistbbp/RunningsampleexamplesonYARNgrep/RunningsampleexamplesonYARNjoin/RunningsampleexamplesonYARNmultifilewc/RunningsampleexamplesonYARNpentomino/RunningsampleexamplesonYARNpi/RunningsampleexamplesonYARNrandomtextwriter/RunningsampleexamplesonYARNrandomwriter/RunningsampleexamplesonYARNsecondarysort/RunningsampleexamplesonYARNsort/RunningsampleexamplesonYARNsudoku/RunningsampleexamplesonYARNteragen/RunningsampleexamplesonYARNterasort/RunningsampleexamplesonYARNteravalidate/RunningsampleexamplesonYARNwordcount/RunningsampleexamplesonYARNwordmean/RunningsampleexamplesonYARNwordmedian/RunningsampleexamplesonYARNwordstandarddeviation/RunningsampleexamplesonYARN

pseudo-distributedmode/Thepseudo-distributedmodepush-basedscheduling/Corona

Page 278: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Rredesignidea

about/TheredesignideaMapReduce,limitations/LimitationsoftheclassicalMapReduceorHadoop1.xHadoop1.x,limitations/LimitationsoftheclassicalMapReduceorHadoop1.x

RedHatOpenShiftabout/Futurefeatures

RedHatPackageManagers(RPMs)/Thefully-distributedmodeResourceManager/ResourceManagerResourceManager(RM)

scheduler/ResourceManagersecurity/ResourceManagerRMRestartPhaseI/RecentdevelopmentsinYARNarchitectureRMRestartPhaseII/RecentdevelopmentsinYARNarchitectureabout/Thefully-distributedmodeconfiguring/Thefully-distributedmodeparameters/Thefully-distributedmodefailures/ResourceManagerfailures

ResourceManager(RM),componentsApplicationManager/NodeManager(NM)Scheduler/NodeManager(NM)

Page 279: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Sschedulerarchitectures

monolithicschedulers/Omegatwo-levelschedulers/Omega

single-nodeinstallationabout/Single-nodeinstallationprerequisites/Prerequisitesstarting/Startingwiththeinstallationstandalonemode(localmode)/Thestandalonemode(localmode)pseudo-distributedmode/Thepseudo-distributedmode

slavefiles/Slavefilesstandalonemode(localmode)/Thestandalonemode(localmode)Storm-Starterexamples

building/BuildingandrunningStorm-Starterexamplesrunning/BuildingandrunningStorm-Starterexamples

Storm-YARNabout/Storm-YARNprerequisites/Prerequisitessettingup/SettingupStorm-YARNstorm.yamlconfiguration,obtaining/Gettingthestorm.yamlconfigurationofthelaunchedStormclusterStorm-Starterexamples,building/BuildingandrunningStorm-StarterexamplesStorm-Starterexamples,running/BuildingandrunningStorm-Starterexamples

storm.yamlconfigurationobtaining/Gettingthestorm.yamlconfigurationofthelaunchedStormcluster

Page 280: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

Ttwo-levelschedulers/Omega

Page 281: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

WwebGUI

YARNapplications,monitoringwith/MonitoringYARNapplicationswithwebGUI

Page 282: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

YYARN

used,asmodernoperatingsystemofHadoop/YARNasthemodernoperatingsystemofHadoopdesigngoals/WhatarethedesigngoalsforYARNused,inHadoop/UnderstandingwhereYARNfitsintoHadoopmultitenancyapplicationsupport/YARNmultitenancyapplicationsupportsampleexamples,runningon/RunningsampleexamplesonYARNsamplePiexample,running/RunningasamplePiexamplecompatibility,withMapReduceapplications/YARN’scompatibilitywithMapReduceapplicationsApacheSpark,runningon/WhyrunonYARN?and,Mesosdifferencebetween/Mesosimportance,toBigDataindustry/WhatYARNmeanstothebigdataindustrypresent/Journey–presentandfuturefuture/Journey–presentandfuturepresenton-goingfeatures/Presenton-goingfeaturesfuturefeatures/Futurefeatures

YARN,featuresLongRunningApplicationsonSecureClusters(YARN-896)/Presenton-goingfeaturesApplicationTimelineServer(YARN-321,YARN-1530)/Presenton-goingfeaturesDiskscheduling(YARN-2139)/Presenton-goingfeaturesReservation-basedscheduling(YARN-1051)/Presenton-goingfeaturesContainerResizing(YARN-1197)/FuturefeaturesAdminlabels(YARN-796)/FuturefeaturesContainerDelegation(YARN-1488)/Futurefeatures

YARN-321URL/Presenton-goingfeatures

YARN-796URL/Futurefeatures

YARN-896URL/Presenton-goingfeatures

YARN-1197URL/Futurefeatures

YARN-1530URL/Presenton-goingfeatures

YARN-2139URL/Presenton-goingfeatures

YARN-supportedframeworksabout/YARN-supportedframeworks

YARNadministrations

Page 283: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

about/AdministrationofYARNconfigurationfiles/AdministrationofYARNadministrativetools/Administrativetoolsnodes,addingfromYARNcluster/AddingandremovingnodesfromaYARNclusternodes,removingfromYARNcluster/AddingandremovingnodesfromaYARNclusterYARNjobs,administrating/AdministratingYARNjobsMapReducejob,configurations/MapReducejobconfigurationsYARNlogmanagement/YARNlogmanagementYARNwebuserinterface/YARNwebuserinterface

YARNapplicationsmonitoring,withwebGUI/MonitoringYARNapplicationswithwebGUIdeveloping/DevelopingYARNapplicationsApplicationClientProtocol/DevelopingYARNapplicationsApplicationMasterProtocol/DevelopingYARNapplicationsContainerManagerProtocol/DevelopingYARNapplications

YARNapplicationworkflowabout/TheYARNapplicationworkflowYARNclient,writing/WritingtheYARNclientApplicationMaster,writing/WritingtheYARNApplicationMaster

YARNarchitecturecomponents/CorecomponentsofYARNarchitecturedevelopment/RecentdevelopmentsinYARNarchitecture

YARNarchitecture,componentsResourceManager/ResourceManagerApplicationMaster(AM)/ApplicationMaster(AM)NodeManager(NM)/NodeManager(NM)

YARNclientwriting/WritingtheYARNclient

YARNclusternodes,addingfrom/AddingandremovingnodesfromaYARNclusternodes,removingfrom/AddingandremovingnodesfromaYARNcluster

YARNjobsadministrating/AdministratingYARNjobs

YARNlogmanagement/YARNlogmanagementYARNMapReducesettings

example/ExampleYARNMapReducesettingsproperties/ExampleYARNMapReducesettings

YARNschedulerpoliciesabout/YARNschedulerpoliciesFIFOscheduler/TheFIFO(FirstInFirstOut)schedulerFairscheduler/Thefairschedulercapacityscheduler/Thecapacityscheduler

Page 284: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

YARNschedulingpolicesabout/YARNschedulingpoliciesFIFOscheduler/TheFIFO(FirstInFirstOut)schedulercapacityscheduler/ThecapacityschedulerFairscheduler/Thefairscheduler

YARNwebuserinterface/YARNwebuserinterface

Page 285: YARN Essentials - DropPDF1.droppdf.com/files/4DwyU/yarn-essentials-amol-fasale-2015.pdfYARN essentials is about YARN—the modern operating system for Hadoop. This book contains all

ZZookeeper

URL/ApacheZooKeepershouldbeinstalled