cloudera_developer_training.pdf

593
01#1 © Copyright 2010/2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. Cloudera Developer Training for Apache Hadoop 201301

Upload: gaddam-neeraj

Post on 14-Sep-2015

13 views

Category:

Documents


1 download

TRANSCRIPT

  • 01#1$"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Cloudera"Developer"Training"for"Apache"Hadoop"

    201301"

  • 01#2$"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    IntroducEon"Chapter"1"

  • 01#3$"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Course"Chapters"

    ! Introduc/on$

    !WriEng"a"MapReduce"Program"! Unit"TesEng"MapReduce"Programs"! Delving"Deeper"into"the"Hadoop"API"! PracEcal"Development"Tips"and"Techniques"! Data"Input"and"Output"! Common"MapReduce"Algorithms"! Joining"Data"Sets"in"MapReduce"Jobs"

    ! Conclusion"! Appendix:"Cloudera"Enterprise"! Appendix:"Graph"ManipulaEon"in"MapReduce"

    ! IntegraEng"Hadoop"into"the"Enterprise"Workow"!Machine"Learning"and"Mahout"! An"IntroducEon"to"Hive"and"Pig"! An"IntroducEon"to"Oozie"

    IntroducEon"to"Apache"Hadoop""and"its"Ecosystem"

    Basic"Programming"with"the"Hadoop"Core"API"

    Problem"Solving"with"MapReduce"

    Course"Conclusion"and"Appendices"

    Course$Introduc/on$

    The"Hadoop"Ecosystem"

    ! The"MoEvaEon"for"Hadoop"! Hadoop:"Basic"Concepts"

  • 01#4$"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Chapter"Topics"

    Course$Introduc/on$Introduc/on$

    ! About$this$course$! About"Cloudera"! Course"logisEcs"

  • 01#5$"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    During$this$course,$you$will$learn:$

    !The$core$technologies$of$Hadoop$

    !How$HDFS$and$MapReduce$work$

    !How$to$develop$MapReduce$applica/ons$

    !How$to$unit$test$MapReduce$applica/ons$

    !How$to$use$MapReduce$combiners,$par//oners,$and$the$distributed$cache$

    !Best$prac/ces$for$developing$and$debugging$MapReduce$applica/ons$

    !How$to$implement$data$input$and$output$in$MapReduce$applica/ons$

    Course"ObjecEves"

  • 01#6$"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Algorithms$for$common$MapReduce$tasks$

    !How$to$join$data$sets$in$MapReduce$

    !How$Hadoop$integrates$into$the$data$center$

    !How$to$use$Mahouts$Machine$Learning$algorithms$

    !How$Hive$and$Pig$can$be$used$for$rapid$applica/on$development$

    !How$to$create$large$workows$using$Oozie$

    Course"ObjecEves"(contd)"

  • 01#7$"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Chapter"Topics"

    Course$Introduc/on$Introduc/on$

    ! About"this"course"! About$Cloudera$! Course"logisEcs"

  • 01#8$"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Founded$by$leading$experts$on$Hadoop$from$Facebook,$Google,$Oracle$and$Yahoo$

    !Provides$consul/ng$and$training$services$for$Hadoop$users$

    !Sta$includes$commi[ers$to$virtually$all$Hadoop$projects$

    !Many$authors$of$industry$standard$books$on$Apache$Hadoop$projects$Lars"George,"Tom"White,"Eric"Sammer,"etc."

    About"Cloudera"

  • 01#9$"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Clouderas$Distribu/on,$including$Apache$Hadoop$(CDH)$A"set"of"easy/to/install"packages"built"from"the"Apache"Hadoop"core"repository,"integrated"with"several"addiEonal"open"source"Hadoop"ecosystem"projects"Includes"a"stable"version"of"Hadoop,"plus"criEcal"bug"xes"and"solid"new"features"from"the"development"version"100%"open"source"

    !Cloudera$Manager,$Free$Edi/on$The"easiest"way"to"deploy"a"Hadoop"cluster"Automates"installaEon"of"Hadoop"soaware"InstallaEon,"monitoring"and"conguraEon"is"performed"from"a"central"machine"Manages"up"to"50"nodes"Completely"free"

    Cloudera"Soaware"

  • 01#10$"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Cloudera$Enterprise$Core$Complete"package"of"soaware"and"support"Built"on"top"of"CDH"Includes"full"version"of"Cloudera"Manager"

    Install,"manage,"and"maintain"a"cluster"of"any"size"LDAP"integraEon"Resource"consumpEon"tracking"ProacEve"health"checks"AlerEng"ConguraEon"change"audit"trails"And"more"

    !Cloudera$Enterprise$RTD$Includes"support"for"Apache"HBase"

    Cloudera"Enterprise"

  • 01#11$"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Provides$consultancy$and$support$services$to$many$key$users$of$Hadoop$Including"eBay,"JPMorganChase,"Experian,"Groupon,"Morgan"Stanley,"Nokia,"Orbitz,"NaEonal"Cancer"InsEtute,"RIM,"The"Walt"Disney"Company"

    !Solu/ons$Architects$are$experts$in$Hadoop$and$related$technologies$Many"are"commi>ers"to"the"Apache"Hadoop"and"ecosystem"projects"

    !Provides$training$in$key$areas$of$Hadoop$administra/on$and$development$Courses"include"System"Administrator"training,"Developer"training,"Hive"and"Pig"training,"HBase"Training,"EssenEals"for"Managers"Custom"course"development"available"Both"public"and"on/site"training"available"

    Cloudera"Services"

  • 01#12$"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Chapter"Topics"

    Course$Introduc/on$Introduc/on$

    ! About"this"course"! About"Cloudera"! Course$logis/cs$

  • 01#13$"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Course$start$and$end$/mes$

    !Lunch$

    !Breaks$

    !Restrooms$

    !Can$I$come$in$early/stay$late?$

    !Cer/ca/on$

    LogisEcs"

  • 01#14$"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !About$your$instructor$

    !About$you$Experience"with"Hadoop?"Experience"as"a"developer?"

    What"programming"languages"do"you"use?"ExpectaEons"from"the"course?"

    IntroducEons"

  • 02#1%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    The"MoBvaBon"for"Hadoop"Chapter"2"

  • 02#2%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Course"Chapters"

    ! IntroducBon"

    !WriBng"a"MapReduce"Program"! Unit"TesBng"MapReduce"Programs"! Delving"Deeper"into"the"Hadoop"API"! PracBcal"Development"Tips"and"Techniques"! Data"Input"and"Output"! Common"MapReduce"Algorithms"! Joining"Data"Sets"in"MapReduce"Jobs"

    ! Conclusion"! Appendix:"Cloudera"Enterprise"! Appendix:"Graph"ManipulaBon"in"MapReduce"

    ! IntegraBng"Hadoop"into"the"Enterprise"Workow"!Machine"Learning"and"Mahout"! An"IntroducBon"to"Hive"and"Pig"! An"IntroducBon"to"Oozie"

    Introduc.on%to%Apache%Hadoop%%and%its%Ecosystem%

    Basic"Programming"with"the"Hadoop"Core"API"

    Problem"Solving"with"MapReduce"

    Course"Conclusion"and"Appendices"

    Course"IntroducBon"

    The"Hadoop"Ecosystem"

    ! The%Mo.va.on%for%Hadoop%! Hadoop:"Basic"Concepts"

  • 02#3%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    In%this%chapter%you%will%learn%

    !What%problems%exist%with%tradi.onal%large#scale%compu.ng%systems%

    !What%requirements%an%alterna.ve%approach%should%have%

    !How%Hadoop%addresses%those%requirements%

    The"MoBvaBon"For"Hadoop"

  • 02#4%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Chapter"Topics"

    Introduc.on%to%Apache%Hadoop%and%its%Ecosystem%The%Mo.va.on%for%Hadoop%

    ! Problems%with%tradi.onal%large#scale%systems%! Requirements"for"a"new"approach"! Introducing"Hadoop"! Conclusion"

  • 02#5%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Tradi.onally,%computa.on%has%been%processor#bound%RelaBvely"small"amounts"of"data"Signicant"amount"of"complex"processing"performed"on"that"data"

    !For%decades,%the%primary%push%was%to%increase%the%compu.ng%power%of%a%single%machine%Faster"processor,"more"RAM"

    !Distributed%systems%evolved%to%allow%developers%to%use%mul.ple%machines%for%a%single%job%MPI"PVM"Condor"

    TradiBonal"Large/Scale"ComputaBon"

    MPI: Message Passing Interface PVM: Parallel Virtual Machine

  • 02#6%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Programming%for%tradi.onal%distributed%systems%is%complex%Data"exchange"requires"synchronizaBon"Finite"bandwidth"is"available"Temporal"dependencies"are"complicated"It"is"dicult"to"deal"with"parBal"failures"of"the"system"

    !Ken%Arnold,%CORBA%designer:%Failure"is"the"dening"dierence"between"distributed"and"local"programming,"so"you"have"to"design"distributed"systems"with"the"expectaBon"of"failure"Developers"spend"more"Bme"designing"for"failure"than"they"do"actually"working"on"the"problem"itself"

    Distributed"Systems:"Problems"

    CORBA: Common Object Request Broker Architecture

  • 02#7%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Typically,%data%for%a%distributed%system%is%stored%on%a%SAN%

    !At%compute%.me,%data%is%copied%to%the%compute%nodes%

    !Fine%for%rela.vely%limited%amounts%of%data%

    Distributed"Systems:"Data"Storage"

    SAN: Storage Area Network

  • 02#8%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Modern%systems%have%to%deal%with%far%more%data%than%was%the%case%in%the%past%OrganizaBons"are"generaBng"huge"amounts"of"data"That"data"has"inherent"value,"and"cannot"be"discarded"

    !Examples:%Facebook""over"70PB"of"data"eBay""over"5PB"of"data"

    !Many%organiza.ons%are%genera.ng%data%at%a%rate%of%terabytes%per%day%

    The"Data/Driven"World"

  • 02#9%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Moores%Law%has%held%rm%for%over%40%years%Processing"power"doubles"every"two"years"Processing"speed"is"no"longer"the"problem"

    !Ge^ng%the%data%to%the%processors%becomes%the%bo_leneck%

    !Quick%calcula.on%Typical"disk"data"transfer"rate:"75MB/sec"Time"taken"to"transfer"100GB"of"data"to"the"processor:"approx"22"minutes!"Assuming"sustained"reads"Actual"Bme"will"be"worse,"since"most"servers"have"less"than"100GB"of"RAM"available"

    !A%new%approach%is%needed%

    Data"Becomes"the"Bo>leneck"

  • 02#10%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Chapter"Topics"

    Introduc.on%to%Apache%Hadoop%and%its%Ecosystem%The%Mo.va.on%for%Hadoop%

    ! Problems"with"tradiBonal"large/scale"systems"! Requirements%for%a%new%approach%! Introducing"Hadoop"! Conclusion"

  • 02#11%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !The%system%must%support%par.al%failure%Failure"of"a"component"should"result"in"a"graceful"degradaBon"of"applicaBon"performance"Not"complete"failure"of"the"enBre"system"

    ParBal"Failure"Support"

  • 02#12%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !If%a%component%of%the%system%fails,%its%workload%should%be%assumed%by%s.ll#func.oning%units%in%the%system%Failure"should"not"result"in"the"loss"of"any"data"

    Data"Recoverability"

  • 02#13%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !If%a%component%of%the%system%fails%and%then%recovers,%it%should%be%able%to%rejoin%the%system%Without"requiring"a"full"restart"of"the"enBre"system"

    Component"Recovery"

  • 02#14%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Component%failures%during%execu.on%of%a%job%should%not%aect%the%outcome%of%the%job%%

    Consistency"

  • 02#15%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Adding%load%to%the%system%should%result%in%a%graceful%decline%in%performance%of%individual%jobs%Not"failure"of"the"system"

    !Increasing%resources%should%support%a%propor.onal%increase%in%load%capacity%

    Scalability"

  • 02#16%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Chapter"Topics"

    Introduc.on%to%Apache%Hadoop%and%its%Ecosystem%The%Mo.va.on%for%Hadoop%

    ! Problems"with"tradiBonal"large/scale"systems"! Requirements"for"a"new"approach"! Introducing%Hadoop%! Conclusion"

  • 02#17%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Hadoop%is%based%on%work%done%by%Google%in%the%late%1990s/early%2000s%Specically,"on"papers"describing"the"Google"File"System"(GFS)"published"in"2003,"and"MapReduce"published"in"2004"

    !This%work%takes%a%radical%new%approach%to%the%problem%of%distributed%compu.ng%Meets"all"the"requirements"we"have"for"reliability"and"scalability"

    !Core%concept:%distribute%the%data%as%it%is%ini.ally%stored%in%the%system%Individual"nodes"can"work"on"data"local"to"those"nodes"

    No"data"transfer"over"the"network"is"required"for"iniBal"processing"

    Hadoops"History"

  • 02#18%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Applica.ons%are%wri_en%in%high#level%code%Developers"need"not"worry"about"network"programming,"temporal"dependencies"or"low/level"infrastructure"

    !Nodes%talk%to%each%other%as%li_le%as%possible%Developers"should"not"write"code"which"communicates"between"nodes"Shared"nothing"architecture"

    !Data%is%spread%among%machines%in%advance%ComputaBon"happens"where"the"data"is"stored,"wherever"possible"

    Data"is"replicated"mulBple"Bmes"on"the"system"for"increased"availability"and"reliability"

    Core"Hadoop"Concepts"

  • 02#19%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !When%data%is%loaded%into%the%system,%it%is%split%into%blocks%Typically"64MB"or"128MB"

    !Map%tasks%(the%rst%part%of%the%MapReduce%system)%work%on%rela.vely%small%por.ons%of%data%Typically"a"single"block"

    !A%master%program%allocates%work%to%nodes%such%that%a%Map%task%will%work%on%a%block%of%data%stored%locally%on%that%node%whenever%possible%Many"nodes"work"in"parallel,"each"on"their"own"part"of"the"overall"dataset"

    Hadoop:"Very"High/Level"Overview"

  • 02#20%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !If%a%node%fails,%the%master%will%detect%that%failure%and%re#assign%the%work%to%a%dierent%node%on%the%system%

    !Restar.ng%a%task%does%not%require%communica.on%with%nodes%working%on%other%por.ons%of%the%data%

    !If%a%failed%node%restarts,%it%is%automa.cally%added%back%to%the%system%and%assigned%new%tasks%

    !If%a%node%appears%to%be%running%slowly,%the%master%can%redundantly%execute%another%instance%of%the%same%task%Results"from"the"rst"to"nish"will"be"used"Known"as"speculaBve"execuBon"

    Fault"Tolerance"

  • 02#21%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Chapter"Topics"

    Introduc.on%to%Apache%Hadoop%and%its%Ecosystem%The%Mo.va.on%for%Hadoop%

    ! Problems"with"tradiBonal"large/scale"systems"! Requirements"for"a"new"approach"! Introducing"Hadoop"! Conclusion%

  • 02#22%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    In%this%chapter%you%have%learned%

    !What%problems%exist%with%tradi.onal%large#scale%compu.ng%systems%

    !What%requirements%an%alterna.ve%approach%should%have%

    !How%Hadoop%addresses%those%requirements%

    Conclusion"

  • 03#1%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Hadoop:"Basic"Concepts"Chapter"3"

  • 03#2%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Course"Chapters"

    ! IntroducDon"

    !WriDng"a"MapReduce"Program"! Unit"TesDng"MapReduce"Programs"! Delving"Deeper"into"the"Hadoop"API"! PracDcal"Development"Tips"and"Techniques"! Data"Input"and"Output"! Common"MapReduce"Algorithms"! Joining"Data"Sets"in"MapReduce"Jobs"

    ! Conclusion"! Appendix:"Cloudera"Enterprise"! Appendix:"Graph"ManipulaDon"in"MapReduce"

    ! IntegraDng"Hadoop"into"the"Enterprise"Workow"!Machine"Learning"and"Mahout"! An"IntroducDon"to"Hive"and"Pig"! An"IntroducDon"to"Oozie"

    Introduc/on%to%Apache%Hadoop%%and%its%Ecosystem%

    Basic"Programming"with"the"Hadoop"Core"API"

    Problem"Solving"with"MapReduce"

    Course"Conclusion"and"Appendices"

    Course"IntroducDon"

    The"Hadoop"Ecosystem"

    ! The"MoDvaDon"for"Hadoop"! Hadoop:%Basic%Concepts%

  • 03#3%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    In%this%chapter%you%will%learn%

    !What%Hadoop%is%

    !What%features%the%Hadoop%Distributed%File%System%(HDFS)%provides%

    !The%concepts%behind%MapReduce%

    !How%a%Hadoop%cluster%operates%

    !What%other%Hadoop%Ecosystem%projects%exist%

    Hadoop:"Basic"Concepts"

  • 03#4%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Chapter"Topics"

    Introduc/on%to%Apache%Hadoop%and%its%Ecosystem%Hadoop:%Basic%Concepts%

    ! The%Hadoop%project%and%Hadoop%components%! The"Hadoop"Distributed"File"System"(HDFS)"! Hands/On"Exercise:"Using"HDFS"! How"MapReduce"works"! Hands/On"Exercise:"Running"a"MapReduce"Job"! How"a"Hadoop"cluster"operates"!Other"Hadoop"ecosystem"components"! Conclusion"

  • 03#5%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Hadoop%is%an%open#source%project%overseen%by%the%Apache%SoPware%Founda/on%

    !Originally%based%on%papers%published%by%Google%in%2003%and%2004%

    !Hadoop%commiTers%work%at%several%dierent%organiza/ons%Including"Cloudera,"Yahoo!,"Facebook,"LinkedIn"

    The"Hadoop"Project"

  • 03#6%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Hadoop%consists%of%two%core%components%The"Hadoop"Distributed"File"System"(HDFS)"MapReduce"

    !There%are%many%other%projects%based%around%core%Hadoop%Oaen"referred"to"as"the"Hadoop"Ecosystem"Pig,"Hive,"HBase,"Flume,"Oozie,"Sqoop,"etc"

    Many"are"discussed"later"in"the"course"!A%set%of%machines%running%HDFS%and%MapReduce%is%known%as%a%Hadoop&Cluster&Individual"machines"are"known"as"nodes&A"cluster"can"have"as"few"as"one"node,"as"many"as"several"thousand"

    More"nodes"="be>er"performance!"

    Hadoop"Components"

  • 03#7%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !HDFS,%the%Hadoop%Distributed%File%System,%is%responsible%for%storing%data%on%the%cluster%

    !Data%is%split%into%blocks%and%distributed%across%mul/ple%nodes%in%the%cluster%Each"block"is"typically"64MB"or"128MB"in"size"

    !Each%block%is%replicated%mul/ple%/mes%Default"is"to"replicate"each"block"three"Dmes"Replicas"are"stored"on"dierent"nodes"

    This"ensures"both"reliability"and"availability"

    Hadoop"Components:"HDFS"

  • 03#8%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !MapReduce%is%the%system%used%to%process%data%in%the%Hadoop%cluster%

    !Consists%of%two%phases:%Map,%and%then%Reduce%Between"the"two"is"a"stage"known"as"the"shue&and&sort"

    !Each%Map%task%operates%on%a%discrete%por/on%of%the%overall%dataset%Typically"one"HDFS"block"of"data"

    !APer%all%Maps%are%complete,%the%MapReduce%system%distributes%the%intermediate%data%to%nodes%which%perform%the%Reduce%phase%Much"more"on"this"later!"

    Hadoop"Components:"MapReduce"

  • 03#9%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Chapter"Topics"

    Introduc/on%to%Apache%Hadoop%and%its%Ecosystem%Hadoop:%Basic%Concepts%

    ! The"Hadoop"project"and"Hadoop"components"! The%Hadoop%Distributed%File%System%(HDFS)%! Hands/On"Exercise:"Using"HDFS"! How"MapReduce"works"! Hands/On"Exercise:"Running"a"MapReduce"Job"! How"a"Hadoop"cluster"operates"!Other"Hadoop"ecosystem"components"! Conclusion"

  • 03#10%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !HDFS%is%a%lesystem%wriTen%in%Java%Based"on"Googles"GFS"

    !Sits%on%top%of%a%na/ve%lesystem%Such"as"ext3,"ext4"or"xfs"

    !Provides%redundant%storage%for%massive%amounts%of%data%Using"readily/available,"industry/standard"computers"

    HDFS"Basic"Concepts"

  • 03#11%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !HDFS%performs%best%with%a%modest%number%of%large%les%Millions,"rather"than"billions,"of"les"Each"le"typically"100MB"or"more"

    !Files%in%HDFS%are%write%once%No"random"writes"to"les"are"allowed"

    !HDFS%is%op/mized%for%large,%streaming%reads%of%les%Rather"than"random"reads"

    HDFS"Basic"Concepts"(contd)"

  • 03#12%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Files%are%split%into%blocks%Each"block"is"usually"64MB"or"128MB"

    !Data%is%distributed%across%many%machines%at%load%/me%Dierent"blocks"from"the"same"le"will"be"stored"on"dierent"machines"This"provides"for"ecient"MapReduce"processing"(see"later)"

    !Blocks%are%replicated%across%mul/ple%machines,%known%as%DataNodes&Default"replicaDon"is"three/fold"

    Meaning"that"each"block"exists"on"three"dierent"machines"!A%master%node%called%the%NameNode&keeps%track%of%which%blocks%make%up%a%le,%and%where%those%blocks%are%located%Known"as"the"metadata"

    How"Files"Are"Stored"

  • 03#13%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !NameNode%holds%metadata%for%the%two%les%(Foo.txt%and%Bar.txt)%

    !DataNodes%hold%the%actual%blocks%Each"block"will"be"64MB"or"128MB"in"size"Each"block"is"replicated"three"Dmes"on"the"cluster"

    How"Files"Are"Stored:"Example"

    Foo.txt: blk_001, blk_002, blk_003Bar.txt: blk_004, blk_005

    NameNode

    DataNodes

    blk_003 blk_004

    blk_001 blk_003

    blk_004

    blk_001 blk_005

    blk_002 blk_004

    blk_002 blk_003

    blk_005

    blk_001 blk_002

    blk_005

  • 03#14%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !The%NameNode%daemon%must%be%running%at%all%/mes%If"the"NameNode"stops,"the"cluster"becomes"inaccessible"Your"system"administrator"will"take"care"to"ensure"that"the"NameNode"hardware"is"reliable!"

    !The%NameNode%holds%all%of%its%metadata%in%RAM%for%fast%access%It"keeps"a"record"of"changes"on"disk"for"crash"recovery"

    !A%separate%daemon%known%as%the%Secondary&NameNode&takes%care%of%some%housekeeping%tasks%for%the%NameNode%Be"careful:"The"Secondary"NameNode"is"not%a"backup"NameNode!"

    More"On"The"HDFS"NameNode"

  • 03#15%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !CDH4%introduced%High%Availability%for%the%NameNode%

    !Instead%of%a%single%NameNode,%there%are%now%two%An"AcDve"NameNode"A"Standby"NameNode"

    !If%the%Ac/ve%NameNode%fails,%the%Standby%NameNode%can%automa/cally%take%over%

    !The%Standby%NameNode%does%the%work%performed%by%the%Secondary%NameNode%in%classic%HDFS%HA"HDFS"does"not"run"a"Secondary"NameNode"daemon"

    !Your%system%administrator%will%choose%whether%to%set%the%cluster%up%with%NameNode%High%Availability%or%not%

    NameNode"High"Availability"in"CDH4"

  • 03#16%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Although%les%are%split%into%64MB%or%128MB%blocks,%if%a%le%is%smaller%than%this%the%full%64MB/128MB%will%not%be%used%

    !Blocks%are%stored%as%standard%les%on%the%DataNodes,%in%a%set%of%directories%specied%in%Hadoops%congura/on%les%This"will"be"set"by"the"system"administrator"

    !Without%the%metadata%on%the%NameNode,%there%is%no%way%to%access%the%les%in%the%HDFS%cluster%

    !When%a%client%applica/on%wants%to%read%a%le:%It"communicates"with"the"NameNode"to"determine"which"blocks"make"up"the"le,"and"which"DataNodes"those"blocks"reside"on"It"then"communicates"directly"with"the"DataNodes"to"read"the"data"The"NameNode"will"not"be"a"bo>leneck"

    HDFS:"Points"To"Note"

  • 03#17%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Applica/ons%can%read%and%write%HDFS%les%directly%via%the%Java%API%Covered"later"in"the"course"

    !Typically,%les%are%created%on%a%local%lesystem%and%must%be%moved%into%HDFS%

    !Likewise,%les%stored%in%HDFS%may%need%to%be%moved%to%a%machines%local%lesystem%

    !Access%to%HDFS%from%the%command%line%is%achieved%with%the%hadoop fs%command%

    Accessing"HDFS"

  • 03#18%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Copy%le%foo.txt%from%local%disk%to%the%users%directory%in%HDFS%

    This"will"copy"the"le"to"/user/username/foo.txt !Get%a%directory%lis/ng%of%the%users%home%directory%in%HDFS%

    !Get%a%directory%lis/ng%of%the%HDFS%root%directory%

    hadoop fs"Examples"

    hadoop fs -put foo.txt foo.txt

    hadoop fs -ls

    hadoop fs ls /

  • 03#19%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Display%the%contents%of%the%HDFS%le%/user/fred/bar.txt%%

    !Move%that%le%to%the%local%disk,%named%as%baz.txt

    !Create%a%directory%called%input%under%the%users%home%directory%

    hadoop fs"Examples"(contd)"

    hadoop fs cat /user/fred/bar.txt

    hadoop fs get /user/fred/bar.txt baz.txt

    hadoop fs mkdir input

    Note:"copyFromLocal"is"a"synonym"for"put;"copyToLocal"is"a"synonym"for"get""

  • 03#20%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Delete%the%directory%input_old%and%all%its%contents%

    hadoop fs"Examples"(contd)"

    hadoop fs rm -r input_old

  • 03#21%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Chapter"Topics"

    Introduc/on%to%Apache%Hadoop%and%its%Ecosystem%Hadoop:%Basic%Concepts%

    ! The"Hadoop"project"and"Hadoop"components"! The"Hadoop"Distributed"File"System"(HDFS)"! Hands#On%Exercise:%Using%HDFS%! How"MapReduce"works"! Hands/On"Exercise:"Running"a"MapReduce"Job"! How"a"Hadoop"cluster"operates"!Other"Hadoop"ecosystem"components"! Conclusion"

  • 03#22%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !During%this%course,%you%will%perform%numerous%hands#on%exercises%using%the%Cloudera%Training%Virtual%Machine%(VM)%

    !The%VM%has%Hadoop%installed%in%pseudo5distributed&mode%This"essenDally"means"that"it"is"a"cluster"comprised"of"a"single"node"Using"a"pseudo/distributed"cluster"is"the"typical"way"to"test"your"code"before"you"run"it"on"your"full"cluster"It"operates"almost"exactly"like"a"real"cluster"

    A"key"dierence"is"that"the"data"replicaDon"factor"is"set"to"1,"not"3"

    Aside:"The"Training"Virtual"Machine"

  • 03#23%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !In%this%Hands#On%Exercise%you%will%gain%familiarity%with%manipula/ng%les%in%HDFS%

    !Please%refer%to%the%Hands#On%Exercise%Manual%

    Hands/On"Exercise:"Using"HDFS"

  • 03#24%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Chapter"Topics"

    Introduc/on%to%Apache%Hadoop%and%its%Ecosystem%Hadoop:%Basic%Concepts%

    ! The"Hadoop"project"and"Hadoop"components"! The"Hadoop"Distributed"File"System"(HDFS)"! Hands/On"Exercise:"Using"HDFS"! How%MapReduce%works%! Hands/On"Exercise:"Running"a"MapReduce"Job"! How"a"Hadoop"cluster"operates"!Other"Hadoop"ecosystem"components"! Conclusion"

  • 03#25%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !MapReduce%is%a%method%for%distribu/ng%a%task%across%mul/ple%nodes%

    !Each%node%processes%data%stored%on%that%node%%Where"possible"

    !Consists%of%two%phases:%Map"Reduce"

    What"Is"MapReduce?"

  • 03#26%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Automa/c%paralleliza/on%and%distribu/on%

    !Fault#tolerance%

    !Status%and%monitoring%tools%

    !A%clean%abstrac/on%for%programmers%MapReduce"programs"are"usually"wri>en"in"Java"

    Can"be"wri>en"in"any"language"using"Hadoop&Streaming"(see"later)"All"of"Hadoop"is"wri>en"in"Java"

    !MapReduce%abstracts%all%the%housekeeping%away%from%the%developer%Developer"can"concentrate"simply"on"wriDng"the"Map"and"Reduce"funcDons"

    Features"of"MapReduce"

  • 03#27%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    MapReduce:"The"Big"Picture"

  • 03#28%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !MapReduce%jobs%are%controlled%by%a%soPware%daemon%known%as%the%JobTracker&

    !The%JobTracker%resides%on%a%master%node%Clients"submit"MapReduce"jobs"to"the"JobTracker"The"JobTracker"assigns"Map"and"Reduce"tasks"to"other"nodes"on"the"cluster"These"nodes"each"run"a"soaware"daemon"known"as"the"TaskTracker"The"TaskTracker"is"responsible"for"actually"instanDaDng"the"Map"or"Reduce"task,"and"reporDng"progress"back"to"the"JobTracker"

    MapReduce:"The"JobTracker"

  • 03#29%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !CDH4%contains%standard%MapReduce%(MR1)%

    !CDH4%also%includes%MapReduce%version%2%(MR2)%Also"known"as"YARN"(Yet"Another"Resource"NegoDator)"A"complete"rewrite"of"the"Hadoop"MapReduce"framework"

    !MR2%is%not%yet%considered%produc/on#ready%Included"in"CDH4"as"a"technology"preview"

    !Exis/ng%code%will%work%with%no%modica/on%on%MR2%clusters%when%the%technology%matures%Code"will"need"to"be"re/compiled,"but"the"API"remains"idenDcal"

    !For%produc/on%use,%we%strongly%recommend%using%MR1%

    Aside:"MapReduce"Version"2"

  • 03#30%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !A%job%is%a%full%program%A"complete"execuDon"of"Mappers"and"Reducers"over"a"dataset"

    !A%task%is%the%execu/on%of%a%single%Mapper%or%Reducer%over%a%slice%of%data%

    !A%task&aempts"as"there"are"tasks"If"a"task"a>empt"fails,"another"will"be"started"by"the"JobTracker"Specula7ve&execu7on"(see"later)"can"also"result"in"more"task"a>empts"than"completed"tasks&

    MapReduce:"Terminology"

  • 03#31%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Hadoop%aTempts%to%ensure%that%Mappers%run%on%nodes%which%hold%their%por/on%of%the%data%locally,%to%avoid%network%trac%MulDple"Mappers"run"in"parallel,"each"processing"a"porDon"of"the"input"data"

    !The%Mapper%reads%data%in%the%form%of%key/value%pairs%

    !It%outputs%zero%or%more%key/value%pairs%(pseudo#code):%

    MapReduce:"The"Mapper"

    map(in_key, in_value) -> (inter_key, inter_value) list

  • 03#32%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !The%Mapper%may%use%or%completely%ignore%the%input%key%For"example,"a"standard"pa>ern"is"to"read"a"line"of"a"le"at"a"Dme"

    The"key"is"the"byte"oset"into"the"le"at"which"the"line"starts"The"value"is"the"contents"of"the"line"itself"Typically"the"key"is"considered"irrelevant"

    !If%the%Mapper%writes%anything%out,%the%output%must%be%in%the%form%of%%key/value%pairs%

    MapReduce:"The"Mapper"(contd)"

  • 03#33%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Turn%input%into%upper%case%(pseudo#code):%

    Example"Mapper:"Upper"Case"Mapper"

    let map(k, v) = emit(k.toUpper(), v.toUpper())

    ('foo', 'bar') -> ('FOO', 'BAR') ('foo', 'other') -> ('FOO', 'OTHER') ('baz', 'more data') -> ('BAZ', 'MORE DATA')

  • 03#34%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Output%each%input%character%separately%(pseudo#code):%

    Example"Mapper:"Explode"Mapper"

    let map(k, v) = foreach char c in v: emit (k, c)

    ('foo', 'bar') -> ('foo', 'b'), ('foo', 'a'), ('foo', 'r')

    ('baz', 'other') -> ('baz', 'o'), ('baz', 't'), ('baz', 'h'), ('baz', 'e'), ('baz', 'r')

  • 03#35%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Only%output%key/value%pairs%where%the%input%value%is%a%prime%number%(pseudo#code):%

    Example"Mapper:"Filter"Mapper"

    let map(k, v) = if (isPrime(v)) then emit(k, v)

    ('foo', 7) -> ('foo', 7) ('baz', 10) -> nothing

  • 03#36%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !The%key%output%by%the%Mapper%does%not%need%to%be%iden/cal%to%the%input%key%

    !Output%the%word%length%as%the%key%(pseudo#code):%

    Example"Mapper:"Changing"Keyspaces"

    let map(k, v) = emit(v.length(), v)

    ('foo', 'bar') -> (3, 'bar') ('baz', 'other') -> (5, 'other') ('foo', 'abracadabra') -> (11, 'abracadabra')

  • 03#37%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !APer%the%Map%phase%is%over,%all%the%intermediate%values%for%a%given%intermediate%key%are%combined%together%into%a%list%

    !This%list%is%given%to%a%Reducer%There"may"be"a"single"Reducer,"or"mulDple"Reducers"

    This"is"specied"as"part"of"the"job"conguraDon"(see"later)"All"values"associated"with"a"parDcular"intermediate"key"are"guaranteed"to"go"to"the"same"Reducer"The"intermediate"keys,"and"their"value"lists,"are"passed"to"the"Reducer"in"sorted"key"order"This"step"is"known"as"the"shue"and"sort"

    !The%Reducer%outputs%zero%or%more%nal%key/value%pairs%These"are"wri>en"to"HDFS"In"pracDce,"the"Reducer"usually"emits"a"single"key/value"pair"for"each"input"key"

    MapReduce:"The"Reducer"

  • 03#38%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Add%up%all%the%values%associated%with%each%intermediate%key%(pseudo#code):%

    Example"Reducer:"Sum"Reducer"

    let reduce(k, vals) = sum = 0 foreach int i in vals: sum += i emit(k, sum)

    (bar', [9, 3, -17, 44]) -> (bar', 39) (foo', [123, 100, 77]) -> (foo', 300)

  • 03#39%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !The%Iden/ty%Reducer%is%very%common%(pseudo#code):%

    Example"Reducer:"IdenDty"Reducer"

    let reduce(k, vals) = foreach v in vals: emit(k, v)

    ('bar', [123, 100, 77]) -> ('bar', 123), ('bar', 100), ('bar', 77)

    ('foo', [9, 3, -17, 44]) -> ('foo', 9), ('foo', 3), ('foo', -17), ('foo', 44)

  • 03#40%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Count%the%number%of%occurrences%of%each%word%in%a%large%amount%of%input%data%This"is"the"hello"world"of"MapReduce"programming"

    MapReduce"Example:"Word"Count"

    map(String input_key, String input_value) foreach word w in input_value: emit(w, 1)

    reduce(String output_key, Iterator intermediate_vals) set count = 0 foreach v in intermediate_vals: count += v emit(output_key, count)

  • 03#41%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Input%to%the%Mapper:%

    !Output%from%the%Mapper:%

    MapReduce"Example:"Word"Count"(contd)"

    (3414, 'the cat sat on the mat') (3437, 'the aardvark sat on the sofa')

    ('the', 1), ('cat', 1), ('sat', 1), ('on', 1), ('the', 1), ('mat', 1), ('the', 1), ('aardvark', 1), ('sat', 1), ('on', 1), ('the', 1), ('sofa', 1)

  • 03#42%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Intermediate%data%sent%to%the%Reducer:%

    !Final%Reducer%output:%

    MapReduce"Example:"Word"Count"(contd)"

    ('aardvark', [1]) ('cat', [1]) ('mat', [1]) ('on', [1, 1]) ('sat', [1, 1]) ('sofa', [1]) ('the', [1, 1, 1, 1]) ('aardvark', 1) ('cat', 1) ('mat', 1) ('on', 2) ('sat', 2) ('sofa', 1) ('the', 4)

  • 03#43%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Whenever%possible,%Hadoop%will%aTempt%to%ensure%that%a%Map%task%on%a%node%is%working%on%a%block%of%data%stored%locally%on%that%node%via%HDFS%

    !If%this%is%not%possible,%the%Map%task%will%have%to%transfer%the%data%across%the%network%as%it%processes%that%data%

    !Once%the%Map%tasks%have%nished,%data%is%then%transferred%across%the%network%to%the%Reducers%Although"the"Reducers"may"run"on"the"same"physical"machines"as"the"Map"tasks,"there"is"no"concept"of"data"locality"for"the"Reducers"All"Mappers"will,"in"general,"have"to"communicate"with"all"Reducers"

    MapReduce:"Data"Locality"

  • 03#44%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !It%appears%that%the%shue%and%sort%phase%is%a%boTleneck%The"reduce"method"in"the"Reducers"cannot"start"unDl"all"Mappers"have"nished"

    !In%prac/ce,%Hadoop%will%start%to%transfer%data%from%Mappers%to%Reducers%as%the%Mappers%nish%work%This"miDgates"against"a"huge"amount"of"data"transfer"starDng"as"soon"as"the"last"Mapper"nishes"Note"that"this"behavior"is"congurable"

    The"developer"can"specify"the"percentage"of"Mappers"which"should"nish"before"Reducers"start"retrieving"data"

    The"developers"reduce"method"sDll"does"not"start"unDl"all"intermediate"data"has"been"transferred"and"sorted"

    MapReduce:"Is"Shue"and"Sort"a"Bo>leneck?"

  • 03#45%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !It%is%possible%for%one%Map%task%to%run%more%slowly%than%the%others%Perhaps"due"to"faulty"hardware,"or"just"a"very"slow"machine"

    !It%would%appear%that%this%would%create%a%boTleneck%The"reduce"method"in"the"Reducer"cannot"start"unDl"every"Mapper"has"nished"

    !Hadoop%uses%specula=ve&execu=on%to%mi/gate%against%this%If"a"Mapper"appears"to"be"running"signicantly"more"slowly"than"the"others,"a"new"instance"of"the"Mapper"will"be"started"on"another"machine,"operaDng"on"the"same"data"The"results"of"the"rst"Mapper"to"nish"will"be"used"Hadoop"will"kill"o"the"Mapper"which"is"sDll"running"

    MapReduce:"Is"a"Slow"Mapper"a"Bo>leneck?"

  • 03#46%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Write%the%Mapper%and%Reducer%classes%

    !Write%a%Driver%class%that%congures%the%job%and%submits%it%to%the%cluster%Driver"classes"are"covered"in"the"next"chapter"

    !Compile%the%Mapper,%Reducer,%and%Driver%classes%Example:""javac -classpath `hadoop classpath` *.java

    !Create%a%jar%le%with%the%Mapper,%Reducer,%and%Driver%classes%Example:"jar cvf foo.jar *.class

    !Run%the%hadoop jar%command%to%submit%the%job%to%the%Hadoop%cluster%Example:"hadoop jar foo.jar Foo in_dir out_dir

    CreaDng"and"Running"a"MapReduce"Job"

  • 03#47%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Chapter"Topics"

    Introduc/on%to%Apache%Hadoop%and%its%Ecosystem%Hadoop:%Basic%Concepts%

    ! The"Hadoop"project"and"Hadoop"components"! The"Hadoop"Distributed"File"System"(HDFS)"! Hands/On"Exercise:"Using"HDFS"! How"MapReduce"works"! Hands#On%Exercise:%Running%a%MapReduce%Job%! How"a"Hadoop"cluster"operates"!Other"Hadoop"ecosystem"components"! Conclusion"

  • 03#48%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !In%this%Hands#On%Exercise,%you%will%run%a%MapReduce%job%on%your%pseudo#distributed%Hadoop%cluster%

    !Please%refer%to%the%Hands#On%Exercise%Manual%

    Hands/On"Exercise:"Running"A"MapReduce"Job"

  • 03#49%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Chapter"Topics"

    Introduc/on%to%Apache%Hadoop%and%its%Ecosystem%Hadoop:%Basic%Concepts%

    ! The"Hadoop"project"and"Hadoop"components"! The"Hadoop"Distributed"File"System"(HDFS)"! Hands/On"Exercise:"Using"HDFS"! How"MapReduce"works"! Hands/On"Exercise:"Running"a"MapReduce"Job"! How%a%Hadoop%cluster%operates%!Other"Hadoop"ecosystem"components"! Conclusion"

  • 03#50%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Cluster%installa/on%is%usually%performed%by%the%system%administrator,%and%is%outside%the%scope%of%this%course%Cloudera"oers"a"training"course"for"System"Administrators"specically"aimed"at"those"responsible"for"commissioning"and"maintaining"Hadoop"clusters"

    !However,%its%very%useful%to%understand%how%the%component%parts%of%the%Hadoop%cluster%work%together%

    !Typically,%a%developer%will%congure%their%machine%to%run%in%pseudo5distributed&mode%This"eecDvely"creates"a"single/machine"cluster"All"ve"Hadoop"daemons"are"running"on"the"same"machine"Very"useful"for"tesDng"code"before"it"is"deployed"to"the"real"cluster"

    Installing"A"Hadoop"Cluster"

  • 03#51%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Easiest%way%to%download%and%install%Hadoop,%either%for%a%full%cluster%or%in%pseudo#distributed%mode,%is%by%using%Clouderas%Distribu/on,%including%Apache%Hadoop%(CDH)%Vanilla"Hadoop"plus"many"patches,"backports,"bugxes"Supplied"as"a"Debian"package"(for"Linux"distribuDons"such"as"Ubuntu),"an"RPM"(for"CentOS/RedHat"Enterprise"Linux),"and"as"a"tarball"Full"documentaDon"available"at"http://cloudera.com/

    Installing"A"Hadoop"Cluster"(contd)"

  • 03#52%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Hadoop%is%comprised%of%ve%separate%daemons%

    !NameNode%Holds"the"metadata"for"HDFS"

    !Secondary%NameNode%Performs"housekeeping"funcDons"for"the"NameNode"Is"not"a"backup"or"hot"standby"for"the"NameNode!"

    !DataNode%Stores"actual"HDFS"data"blocks"

    !JobTracker%Manages"MapReduce"jobs,"distributes"individual"tasks"to"machines"running"the"

    !TaskTracker%InstanDates"and"monitors"individual"Map"and"Reduce"tasks"

    The"Five"Hadoop"Daemons"

  • 03#53%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Each%daemon%runs%in%its%own%Java%Virtual%Machine%(JVM)%

    !No%node%on%a%real%cluster%will%run%all%ve%daemons%Although"this"is"technically"possible"

    !We%can%consider%nodes%to%be%in%two%dierent%categories:%Master"Nodes"

    Run"the"NameNode,"Secondary"NameNode,"JobTracker"daemons"Only"one"of"each"of"these"daemons"runs"on"the"cluster"

    Slave"Nodes"Run"the"DataNode"and"TaskTracker"daemons"

    A"slave"node"will"run"both"of"these"daemons"

    The"Five"Hadoop"Daemons"(contd)"

  • 03#54%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !On%very%small%clusters,%the%NameNode,%JobTracker%and%Secondary%NameNode%daemons%can%all%reside%on%a%single%machine%It"is"typical"to"put"them"on"separate"machines"as"the"cluster"grows"beyond"20/30"nodes"

    !Each%daemon%runs%in%a%separate%Java%Virtual%Machine%(JVM)%

    Basic"Cluster"ConguraDon"

  • 03#55%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !When%a%client%submits%a%job,%its%congura/on%informa/on%is%packaged%into%an%XML%le%

    !This%le,%along%with%the%.jar%le%containing%the%actual%program%code,%is%handed%to%the%JobTracker%The"JobTracker"then"parcels"out"individual"tasks"to"TaskTracker"nodes"When"a"TaskTracker"receives"a"request"to"run"a"task,"it"instanDates"a"separate"JVM"for"that"task"TaskTracker"nodes"can"be"congured"to"run"mulDple"tasks"at"the"same"Dme"If"the"node"has"enough"processing"power"and"memory"

    Submirng"A"Job"

  • 03#56%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !The%intermediate%data%is%held%on%the%TaskTrackers%local%disk%

    !As%Reducers%start%up,%the%intermediate%data%is%distributed%across%the%network%to%the%Reducers%

    !Reducers%write%their%nal%output%to%HDFS%

    !Once%the%job%has%completed,%the%TaskTracker%can%delete%the%intermediate%data%from%its%local%disk%Note"that"the"intermediate"data"is"not"deleted"unDl"the"enDre"job"completes"

    Submirng"A"Job"(contd)"

  • 03#57%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Chapter"Topics"

    Introduc/on%to%Apache%Hadoop%and%its%Ecosystem%Hadoop:%Basic%Concepts%

    ! The"Hadoop"project"and"Hadoop"components"! The"Hadoop"Distributed"File"System"(HDFS)"! Hands/On"Exercise:"Using"HDFS"! How"MapReduce"works"! Hands/On"Exercise:"Running"a"MapReduce"Job"! How"a"Hadoop"cluster"operates"!Other%Hadoop%ecosystem%components%! Conclusion"

  • 03#58%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !The%term%Hadoop%core%refers%to%HDFS%and%MapReduce%

    !Many%other%projects%exist%which%use%Hadoop%core%Either"both"HDFS"and"MapReduce,"or"just"HDFS"

    !Most%are%Apache%projects%or%Apache%Incubator%projects%Some"others"are"not"hosted"by"the"Apache"Soaware"FoundaDon"

    These"are"oaen"hosted"on"GitHub"or"a"similar"repository"!We%will%inves/gate%many%of%these%projects%later%in%the%course%

    !Following%is%an%introduc/on%to%some%of%the%most%signicant%projects%

    Other"Ecosystem"Projects:"IntroducDon"

  • 03#59%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Hive%is%an%abstrac/on%on%top%of%MapReduce%

    !Allows%users%to%query%data%in%the%Hadoop%cluster%without%knowing%Java%or%MapReduce%

    !Uses%the%HiveQL%language%Very"similar"to"SQL"

    !The%Hive%Interpreter%runs%on%a%client%machine%Turns"HiveQL"queries"into"MapReduce"jobs"Submits"those"jobs"to"the"cluster"

    !Note:%this%does%not%turn%the%cluster%into%a%rela/onal%database%server!%It"is"sDll"simply"running"MapReduce"jobs"Those"jobs"are"created"by"the"Hive"Interpreter"

    Hive"

  • 03#60%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Sample%Hive%query:%

    %

    !We%will%inves/gate%Hive%in%greater%detail%later%in%the%course%

    Hive"(contd)"

    SELECT stock.product, SUM(orders.purchases) FROM stock JOIN orders ON (stock.id = orders.stock_id) WHERE orders.quarter = 'Q1' GROUP BY stock.product;

  • 03#61%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Pig%is%an%alterna/ve%abstrac/on%on%top%of%MapReduce%

    !Uses%a%dataow%scrip/ng%language%Called"PigLaDn"

    !The%Pig%interpreter%runs%on%the%client%machine%Takes"the"PigLaDn"script"and"turns"it"into"a"series"of"MapReduce"jobs"Submits"those"jobs"to"the"cluster"

    !As%with%Hive,%nothing%magical%happens%on%the%cluster%It"is"sDll"simply"running"MapReduce"jobs"

    Pig"

  • 03#62%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Sample%Pig%script:%

    %

    !We%will%inves/gate%Pig%in%more%detail%later%in%the%course%

    Pig"(contd)"

    stock = LOAD '/user/fred/stock' AS (id, item); orders = LOAD '/user/fred/orders' AS (id, cost); grpd = GROUP orders BY id; totals = FOREACH grpd GENERATE group, SUM(orders.cost) AS t;

    result = JOIN stock BY id, totals BY group; DUMP result;

  • 03#63%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Impala%is%an%open#source%project%created%by%Cloudera%

    !Facilitates%real#/me%queries%of%data%in%HDFS%

    !Does%not%use%MapReduce%Uses"its"own"daemon,"running"on"each"slave"node"Queries"data"stored"in"HDFS"

    !Uses%a%language%very%similar%to%HiveQL%But"produces"results"much,"much"faster"

    Typically"between"ve"and"40"Dmes"faster"than"Hive"!Currently%in%beta%

    Although"being"used"in"producDon"by"mulDple"organizaDons"

    Impala"

  • 03#64%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Flume%provides%a%method%to%import%data%into%HDFS%as%it%is%generated%Instead"of"batch/processing"the"data"later"For"example,"log"les"from"a"Web"server"

    !Sqoop%provides%a%method%to%import%data%from%tables%in%a%rela/onal%database%into%HDFS%Does"this"very"eciently"via"a"Map/only"MapReduce"job"Can"also"go"the"other"way"

    Populate"database"tables"from"les"in"HDFS"!We%will%inves/gate%Flume%and%Sqoop%later%in%the%course%

    Flume"and"Sqoop"

  • 03#65%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Oozie%allows%developers%to%create%a%workow%of%MapReduce%jobs%Including"dependencies"between"jobs"

    !The%Oozie%server%submits%the%jobs%to%the%server%in%the%correct%sequence%

    !We%will%inves/gate%Oozie%later%in%the%course%

    Oozie"

  • 03#66%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !HBase%is%the%Hadoop%database%

    !A%NoSQL%datastore%

    !Can%store%massive%amounts%of%data%Gigabytes,"terabytes,"and"even"petabytes"of"data"in"a"table"

    !Scales%to%provide%very%high%write%throughput%Hundreds"of"thousands"of"inserts"per"second"

    !Copes%well%with%sparse%data%Tables"can"have"many"thousands"of"columns"

    Even"if"most"columns"are"empty"for"any"given"row"!Has%a%very%constrained%access%model%

    Insert"a"row,"retrieve"a"row,"do"a"full"or"parDal"table"scan"Only"one"column"(the"row"key)"is"indexed"

    HBase"

  • 03#67%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    HBase"vs"TradiDonal"RDBMSs"

    RDBMS% HBase%Data%layout% Row/oriented" Column/oriented"

    Transac/ons% Yes" Single"row"only"

    Query%language% SQL" get/put/scan"

    Security% AuthenDcaDon/AuthorizaDon" Kerberos"

    Indexes% On"arbitrary"columns" Row/key"only"

    Max%data%size% TBs" PB+"

    Read/write%throughput%limits%

    1000s"queries/second" Millions"of"queries/second"

  • 03#68%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Chapter"Topics"

    Introduc/on%to%Apache%Hadoop%and%its%Ecosystem%Hadoop:%Basic%Concepts%

    ! The"Hadoop"project"and"Hadoop"components"! The"Hadoop"Distributed"File"System"(HDFS)"! Hands/On"Exercise:"Using"HDFS"! How"MapReduce"works"! Hands/On"Exercise:"Running"a"MapReduce"Job"! How"a"Hadoop"cluster"operates"!Other"Hadoop"ecosystem"components"! Conclusion%

  • 03#69%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    In%this%chapter%you%have%learned%

    !What%Hadoop%is%

    !What%features%the%Hadoop%Distributed%File%System%(HDFS)%provides%

    !The%concepts%behind%MapReduce%

    !How%a%Hadoop%cluster%operates%

    !What%other%Hadoop%Ecosystem%projects%exist%

    Conclusion"

  • 04#1%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    WriAng"a"MapReduce"Program"Chapter"4"

  • 04#2%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Course"Chapters"

    ! IntroducAon"

    !Wri*ng%a%MapReduce%Program%! Unit"TesAng"MapReduce"Programs"! Delving"Deeper"into"the"Hadoop"API"! PracAcal"Development"Tips"and"Techniques"! Data"Input"and"Output"! Common"MapReduce"Algorithms"! Joining"Data"Sets"in"MapReduce"Jobs"

    ! Conclusion"! Appendix:"Cloudera"Enterprise"! Appendix:"Graph"ManipulaAon"in"MapReduce"

    ! IntegraAng"Hadoop"into"the"Enterprise"Workow"!Machine"Learning"and"Mahout"! An"IntroducAon"to"Hive"and"Pig"! An"IntroducAon"to"Oozie"

    IntroducAon"to"Apache"Hadoop""and"its"Ecosystem"

    Basic%Programming%with%the%Hadoop%Core%API%

    Problem"Solving"with"MapReduce"

    Course"Conclusion"and"Appendices"

    Course"IntroducAon"

    The"Hadoop"Ecosystem"

    ! The"MoAvaAon"for"Hadoop"! Hadoop:"Basic"Concepts"

  • 04#3%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    In%this%chapter%you%will%learn%

    !The%MapReduce%ow%

    !Basic%MapReduce%API%concepts%

    !How%to%write%MapReduce%drivers,%Mappers,%and%Reducers%in%Java%

    !How%to%write%Mappers%and%Reducers%in%other%languages%using%the%Streaming%API%

    !How%to%speed%up%your%Hadoop%development%by%using%Eclipse%

    !The%dierences%between%the%old%and%new%MapReduce%APIs%

    WriAng"a"MapReduce"Program"

  • 04#4%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Chapter"Topics"

    Basic%Programming%with%the%%Hadoop%Core%API%Wri*ng%a%MapReduce%Program%

    ! The%MapReduce%ow%! Basic"MapReduce"API"concepts"!WriAng"MapReduce"applicaAons"in"Java" The"driver" The"Mapper" The"Reducer"

    !WriAng"Mappers"and"Reducers"in"other"languages"with"the"Streaming"API"! Speeding"up"Hadoop"development"by"using"Eclipse"! Hands/On"Exercise:"WriAng"a"MapReduce"Program"! Dierences"between"the"Old"and"New"MapReduce"APIs"! Conclusion"

  • 04#5%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !In%the%previous%chapter,%you%ran%a%sample%MapReduce%program%WordCount,"which"counted"the"number"of"occurrences"of"each"unique"word"in"a"set"of"les"

    !In%this%chapter,%we%will%examine%the%code%for%WordCount%This"will"demonstrate"the"Hadoop"API"

    !We%will%also%inves*gate%Hadoop%Streaming%Allows"you"to"write"MapReduce"programs"in"(virtually)"any"language"

    A"Sample"MapReduce"Program:"IntroducAon"

  • 04#6%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !On%the%following%slides%we%show%the%MapReduce%ow%

    !Each%of%the%por*ons%(RecordReader,%Mapper,%Par**oner,%Reducer,%etc.)%can%be%created%by%the%developer%

    !We%will%cover%each%of%these%as%we%move%through%the%course%

    !You%will%always%create%at%least%a%Mapper,%Reducer,%and%driver%code%Those"are"the"porAons"we"will"invesAgate"in"this"chapter"

    The"MapReduce"Flow:"IntroducAon"

  • 04#7%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    The"MapReduce"Flow:"The"Mapper"

  • 04#8%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    The"MapReduce"Flow:"Shue"and"Sort"

  • 04#9%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    The"MapReduce"Flow:"Reducers"to"Outputs"

  • 04#10%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Chapter"Topics"

    Basic%Programming%with%the%%Hadoop%Core%API%Wri*ng%a%MapReduce%Program%

    ! The"MapReduce"ow"! Basic%MapReduce%API%concepts%!WriAng"MapReduce"applicaAons"in"Java" The"driver" The"Mapper" The"Reducer"

    !WriAng"Mappers"and"Reducers"in"other"languages"with"the"Streaming"API"! Speeding"up"Hadoop"development"by"using"Eclipse"! Hands/On"Exercise:"WriAng"a"MapReduce"Program"! Dierences"between"the"Old"and"New"MapReduce"APIs"! Conclusion"

  • 04#11%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !To%inves*gate%the%API,%we%will%dissect%the%WordCount%program%you%ran%in%the%previous%chapter%

    !This%consists%of%three%por*ons%The"driver"code"

    Code"that"runs"on"the"client"to"congure"and"submit"the"job"The"Mapper"The"Reducer"

    !Before%we%look%at%the%code,%we%need%to%cover%some%basic%Hadoop%API%concepts%

    Our"MapReduce"Program:"WordCount"

  • 04#12%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !The%data%passed%to%the%Mapper%is%specied%by%an%InputFormat+Specied"in"the"driver"code"Denes"the"locaAon"of"the"input"data"

    A"le"or"directory,"for"example"Determines"how"to"split"the"input"data"into"input&splits"

    Each"Mapper"deals"with"a"single"input"split""InputFormat"is"a"factory"for"RecordReader"objects"to"extract""(key,"value)"records"from"the"input"source"

    Gecng"Data"to"the"Mapper"

  • 04#13%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Gecng"Data"to"the"Mapper"(contd)"

  • 04#14%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !FileInputFormat%The"base"class"used"for"all"le/based"InputFormats"

    !TextInputFormat The"default"Treats"each"\n/terminated"line"of"a"le"as"a"value"Key"is"the"byte"oset"within"the"le"of"that"line"

    !KeyValueTextInputFormat Maps"\n/terminated"lines"as"key"SEP"value"

    By"default,"separator"is"a"tab"!SequenceFileInputFormat

    Binary"le"of"(key,"value)"pairs"with"some"addiAonal"metadata"!SequenceFileAsTextInputFormat

    Similar,"but"maps"(key.toString(),"value.toString())"

    Some"Standard"InputFormats"

  • 04#15%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Keys%and%values%in%Hadoop%are%Objects%!Values%are%objects%which%implement%Writable !Keys%are%objects%which%implement%WritableComparable

    Keys"and"Values"are"Objects"

  • 04#16%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !Hadoop%denes%its%own%box%classes%for%strings,%integers%and%so%on%IntWritable"for"ints"LongWritable"for"longs"FloatWritable"for"oats"DoubleWritable"for"doubles"Text"for"strings"Etc.""

    !The%Writable%interface%makes%serializa*on%quick%and%easy%for%Hadoop%%!Any%values%type%must%implement%the%Writable%interface%

    What"is"Writable?"

  • 04#17%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !A%WritableComparable%is%a%Writable%which%is%also%Comparable Two"WritableComparables"can"be"compared"against"each"other"to"determine"their"order"Keys"must"be"WritableComparables"because"they"are"passed"to"the"Reducer"in"sorted"order"We"will"talk"more"about"WritableComparables"later"

    !Note%that%despite%their%names,%all%Hadoop%box%classes%implement%both%Writable%and%WritableComparable For"example,"IntWritable"is"actually"a"WritableComparable

    What"is"WritableComparable?"

  • 04#18%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Chapter"Topics"

    Basic%Programming%with%the%%Hadoop%Core%API%Wri*ng%a%MapReduce%Program%

    ! The"MapReduce"ow"! Basic"MapReduce"API"concepts"!Wri*ng%MapReduce%applica*ons%in%Java% The%driver% The"Mapper" The"Reducer"

    !WriAng"Mappers"and"Reducers"in"other"languages"with"the"Streaming"API"! Speeding"up"Hadoop"development"by"using"Eclipse"! Hands/On"Exercise:"WriAng"a"MapReduce"Program"! Dierences"between"the"Old"and"New"MapReduce"APIs"! Conclusion"

  • 04#19%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !The%driver%code%runs%on%the%client%machine%

    !It%congures%the%job,%then%submits%it%to%the%cluster%

    The"Driver"Code:"IntroducAon"

  • 04#20%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    The"Driver:"Complete"Code"

    import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.Job; public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount \n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class);

  • 04#21%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    The"Driver:"Complete"Code"(contd)"

    job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }

  • 04#22%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    The"Driver:"Import"Statements"

    import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.Job; public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount \n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class);

    You"will"typically"import"these"classes"into"every"MapReduce"job"you"write."We"will"omit"the"import statements"in"future"slides"for"brevity.""

  • 04#23%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    The"Driver:"Main"Code"

    public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount \n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }

  • 04#24%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    The"Driver"Class:"main"Method"

    public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount \n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }

    The"main"method"accepts"two"command/line"arguments:"the"input"and"output"directories."

  • 04#25%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Sanity"Checking"The"Jobs"InvocaAon"

    public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount \n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }

    The"rst"step"is"to"ensure"that"we"have"been"given"two"command/line"arguments."If"not,"print"a"help"message"and"exit."

  • 04#26%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Conguring"The"Job"With"the"Job"Object

    public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount \n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }

    To"congure"the"job,"create"a"new"Job"object"and"specify"the"class"which"will"be"called"to"run"the"job."

  • 04#27%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !The%Job%class%allows%you%to%set%congura*on%op*ons%for%your%MapReduce%job%The"classes"to"be"used"for"your"Mapper"and"Reducer"The"input"and"output"directories"Many"other"opAons"

    !Any%op*ons%not%explicitly%set%in%your%driver%code%will%be%read%from%your%Hadoop%congura*on%les%Usually"located"in"/etc/hadoop/conf

    !Any%op*ons%not%specied%in%your%congura*on%les%will%receive%Hadoops%default%values%

    !You%can%also%use%the%Job%object%to%submit%the%job,%control%its%execu*on,%and%query%its%state%%

    CreaAng"a"New"Job"Object"

  • 04#28%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Naming"The"Job"

    public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount \n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }

    Give"the"job"a"meaningful"name."

  • 04#29%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Specifying"Input"and"Output"Directories"

    public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount \n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }

    Next,"specify"the"input"directory"from"which"data"will"be"read,"and"the"output"directory"to"which"nal"output"will"be"wri>en.""

  • 04#30%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !The%default%InputFormat%(TextInputFormat)%will%be%used%unless%you%specify%otherwise%

    !To%use%an%InputFormat%other%than%the%default,%use%e.g.%job.setInputFormatClass(KeyValueTextInputFormat.class)

    Specifying"the"InputFormat"

  • 04#31%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !By%default,%FileInputFormat.setInputPaths()%will%read%all%les%from%a%specied%directory%and%send%them%to%Mappers%ExcepAons:"items"whose"names"begin"with"a"period"(.)"or"underscore"(_)"Globs"can"be"specied"to"restrict"input"

    For"example,"/2010/*/01/* !Alterna*vely,%FileInputFormat.addInputPath()%can%be%called%mul*ple%*mes,%specifying%a%single%le%or%directory%each%*me%

    !More%advanced%ltering%can%be%performed%by%implemen*ng%a%PathFilter%Interface"with"a"method"named"accept

    Takes"a"path"to"a"le,"returns"true"or"false"depending"on"whether"or"not"the"le"should"be"processed"

    Determining"Which"Files"To"Read"

  • 04#32%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !FileOutputFormat.setOutputPath()%species%the%directory%to%which%the%Reducers%will%write%their%nal%output%

    !The%driver%can%also%specify%the%format%of%the%output%data%Default"is"a"plain"text"le"Could"be"explicitly"wri>en"as"job.setOutputFormatClass(TextOutputFormat.class)

    !We%will%discuss%OutputFormats%in%more%depth%in%a%later%chapter%

    Specifying"Final"Output"With"OutputFormat"

  • 04#33%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Specify"The"Classes"for"Mapper"and"Reducer"

    public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount \n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }

    Give"the"Job"object"informaAon"about"which"classes"are"to"be"instanAated"as"the"Mapper"and"Reducer."

  • 04#34%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Specify"The"Intermediate"Data"Types"

    public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount \n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }

    Specify"the"types"for"the"intermediate"output"key"and"value"produced"by"the"Mapper."

  • 04#35%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Specify"The"Final"Output"Data"Types"

    public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount \n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }

    Specify"the"types"for"the"Reducers"output"key"and"value."

  • 04#36%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Running"The"Job"

    public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount \n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }

    Start"the"job,"wait"for"it"to"complete,"and"exit"with"a"return"code."

  • 04#37%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    !There%are%two%ways%to%run%your%MapReduce%job:%job.waitForCompletion()

    Blocks"(waits"for"the"job"to"complete"before"conAnuing)"job.submit()

    Does"not"block"(driver"code"conAnues"as"the"job"is"running)"!The%job%determines%the%proper%division%of%input%data%into%InputSplits,%and%then%sends%the%job%informa*on%to%the%JobTracker%daemon%on%the%cluster%

    Running"The"Job"(contd)"

  • 04#38%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Reprise:"Driver"Code"

    public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount \n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }

  • 04#39%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Chapter"Topics"

    Basic%Programming%with%the%%Hadoop%Core%API%Wri*ng%a%MapReduce%Program%

    ! The"MapReduce"ow"! Basic"MapReduce"API"concepts"!Wri*ng%MapReduce%applica*ons%in%Java% The"driver" The%Mapper% The"Reducer"

    !WriAng"Mappers"and"Reducers"in"other"languages"with"the"Streaming"API"! Speeding"up"Hadoop"development"by"using"Eclipse"! Hands/On"Exercise:"WriAng"a"MapReduce"Program"! Dierences"between"the"Old"and"New"MapReduce"APIs"! Conclusion"

  • 04#40%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    The"Mapper:"Complete"Code"

    import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class WordMapper extends Mapper { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } }

  • 04#41%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    The"Mapper:"import"Statements"

    import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class WordMapper extends Mapper { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } }

    You"will"typically"import"java.io.IOException,"and"the"org.apache.hadoop"classes"shown,"in"every"Mapper"you"write."We"will"omit"the"import"statements"in"future"slides"for"brevity.""

  • 04#42%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    The"Mapper:"Main"Code"

    public class WordMapper extends Mapper { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } }

  • 04#43%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    public class WordMapper extends Mapper { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } }

    The"Mapper:"Main"Code"(contd)"

    Your"Mapper"class"should"extend"the"Mapper"class."The"Mapper class"expects"four"generics,"which"dene"the"types"of"the"input"and"output"key/value"pairs."The"rst"two"parameters"dene"the"input"key"and"value"types,"the"second"two"dene"the"output"key"and"value"types."

  • 04#44%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    public class WordMapper extends Mapper { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } }

    The"map"Method"

    The"map"methods"signature"looks"like"this."It"will"be"passed"a"key,"a"value,"and"a"Context"object."The"Context"is"used"to"write"the"intermediate"data."It"also"contains"informaAon"about"the"jobs"conguraAon"(see"later)."

  • 04#45%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    public class WordMapper extends Mapper { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } }

    The"map"Method:"Processing"The"Line"

    value"is"a"Text"object,"so"we"retrieve"the"string"it"contains."

  • 04#46%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    public class WordMapper extends Mapper { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } }

    The"map"Method:"Processing"The"Line"(contd)"

    We"split"the"string"up"into"words"using"a"regular"expression"with"non/alphanumeric"characters"as"the"delimiter,"and"then"loop"through"the"words."

  • 04#47%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    public class WordMapper extends Mapper { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } }

    Outpucng"Intermediate"Data"

    To"emit"a"(key,"value)"pair,"we"call"the"write"method"of"our"Context object."The"key"will"be"the"word"itself,"the"value"will"be"the"number"1."Recall"that"the"output"key"must"be"a"WritableComparable,"and"the"value"must"be"a"Writable.

  • 04#48%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Reprise:"The"Map"Method"

    public class WordMapper extends Mapper { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } }

  • 04#49%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    Chapter"Topics"

    Basic%Programming%with%the%%Hadoop%Core%API%Wri*ng%a%MapReduce%Program%

    ! The"MapReduce"ow"! Basic"MapReduce"API"concepts"!Wri*ng%MapReduce%applica*ons%in%Java% The"driver" The"Mapper" The%Reducer%

    !WriAng"Mappers"and"Reducers"in"other"languages"with"the"Streaming"API"! Speeding"up"Hadoop"development"by"using"Eclipse"! Hands/On"Exercise:"WriAng"a"MapReduce"Program"! Dierences"between"the"Old"and"New"MapReduce"APIs"! Conclusion"

  • 04#50%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    The"Reducer:"Complete"Code"

    import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class SumReducer extends Reducer { @Override public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int wordCount = 0;

    for (IntWritable value : values) { wordCount += value.get(); }

    context.write(key, new IntWritable(wordCount)); } }

  • 04#51%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class SumReducer extends Reducer { @Override public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int wordCount = 0;

    for (IntWritable value : values) { wordCount += value.get(); }

    context.write(key, new IntWritable(wordCount)); } }

    The"Reducer:"Import"Statements"

    As"with"the"Mapper,"you"will"typically"import"java.io.IOException,"and"the"org.apache.hadoop classes"shown,"in"every"Reducer"you"write."We"will"omit"the"import"statements"in"future"slides"for"brevity.""

  • 04#52%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    The"Reducer:"Main"Code"

    public class SumReducer extends Reducer { @Override public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int wordCount = 0;

    for (IntWritable value : values) { wordCount += value.get(); }

    context.write(key, new IntWritable(wordCount)); } }

  • 04#53%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    public class SumReducer extends Reducer { @Override public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int wordCount = 0;

    for (IntWritable value : values) { wordCount += value.get(); }

    context.write(key, new IntWritable(wordCount)); } }

    The"Reducer:"Main"Code"(contd)"

    Your"Reducer"class"should"extend"Reducer."The"Reducer class"expects"four"generics,"which"dene"the"types"of"the"input"and"output"key/value"pairs."The"rst"two"parameters"dene"the"intermediate"key"and"value"types,"the"second"two"dene"the"nal"output"key"and"value"types."The"keys"are"WritableComparables,"the"values"are"Writables."

  • 04#54%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

    public class SumReducer extends Reducer { @Override public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int wordCount