cloudera_developer_training.pdf
TRANSCRIPT
-
01#1$"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Cloudera"Developer"Training"for"Apache"Hadoop"
201301"
-
01#2$"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
IntroducEon"Chapter"1"
-
01#3$"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Course"Chapters"
! Introduc/on$
!WriEng"a"MapReduce"Program"! Unit"TesEng"MapReduce"Programs"! Delving"Deeper"into"the"Hadoop"API"! PracEcal"Development"Tips"and"Techniques"! Data"Input"and"Output"! Common"MapReduce"Algorithms"! Joining"Data"Sets"in"MapReduce"Jobs"
! Conclusion"! Appendix:"Cloudera"Enterprise"! Appendix:"Graph"ManipulaEon"in"MapReduce"
! IntegraEng"Hadoop"into"the"Enterprise"Workow"!Machine"Learning"and"Mahout"! An"IntroducEon"to"Hive"and"Pig"! An"IntroducEon"to"Oozie"
IntroducEon"to"Apache"Hadoop""and"its"Ecosystem"
Basic"Programming"with"the"Hadoop"Core"API"
Problem"Solving"with"MapReduce"
Course"Conclusion"and"Appendices"
Course$Introduc/on$
The"Hadoop"Ecosystem"
! The"MoEvaEon"for"Hadoop"! Hadoop:"Basic"Concepts"
-
01#4$"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Chapter"Topics"
Course$Introduc/on$Introduc/on$
! About$this$course$! About"Cloudera"! Course"logisEcs"
-
01#5$"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
During$this$course,$you$will$learn:$
!The$core$technologies$of$Hadoop$
!How$HDFS$and$MapReduce$work$
!How$to$develop$MapReduce$applica/ons$
!How$to$unit$test$MapReduce$applica/ons$
!How$to$use$MapReduce$combiners,$par//oners,$and$the$distributed$cache$
!Best$prac/ces$for$developing$and$debugging$MapReduce$applica/ons$
!How$to$implement$data$input$and$output$in$MapReduce$applica/ons$
Course"ObjecEves"
-
01#6$"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Algorithms$for$common$MapReduce$tasks$
!How$to$join$data$sets$in$MapReduce$
!How$Hadoop$integrates$into$the$data$center$
!How$to$use$Mahouts$Machine$Learning$algorithms$
!How$Hive$and$Pig$can$be$used$for$rapid$applica/on$development$
!How$to$create$large$workows$using$Oozie$
Course"ObjecEves"(contd)"
-
01#7$"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Chapter"Topics"
Course$Introduc/on$Introduc/on$
! About"this"course"! About$Cloudera$! Course"logisEcs"
-
01#8$"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Founded$by$leading$experts$on$Hadoop$from$Facebook,$Google,$Oracle$and$Yahoo$
!Provides$consul/ng$and$training$services$for$Hadoop$users$
!Sta$includes$commi[ers$to$virtually$all$Hadoop$projects$
!Many$authors$of$industry$standard$books$on$Apache$Hadoop$projects$Lars"George,"Tom"White,"Eric"Sammer,"etc."
About"Cloudera"
-
01#9$"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Clouderas$Distribu/on,$including$Apache$Hadoop$(CDH)$A"set"of"easy/to/install"packages"built"from"the"Apache"Hadoop"core"repository,"integrated"with"several"addiEonal"open"source"Hadoop"ecosystem"projects"Includes"a"stable"version"of"Hadoop,"plus"criEcal"bug"xes"and"solid"new"features"from"the"development"version"100%"open"source"
!Cloudera$Manager,$Free$Edi/on$The"easiest"way"to"deploy"a"Hadoop"cluster"Automates"installaEon"of"Hadoop"soaware"InstallaEon,"monitoring"and"conguraEon"is"performed"from"a"central"machine"Manages"up"to"50"nodes"Completely"free"
Cloudera"Soaware"
-
01#10$"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Cloudera$Enterprise$Core$Complete"package"of"soaware"and"support"Built"on"top"of"CDH"Includes"full"version"of"Cloudera"Manager"
Install,"manage,"and"maintain"a"cluster"of"any"size"LDAP"integraEon"Resource"consumpEon"tracking"ProacEve"health"checks"AlerEng"ConguraEon"change"audit"trails"And"more"
!Cloudera$Enterprise$RTD$Includes"support"for"Apache"HBase"
Cloudera"Enterprise"
-
01#11$"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Provides$consultancy$and$support$services$to$many$key$users$of$Hadoop$Including"eBay,"JPMorganChase,"Experian,"Groupon,"Morgan"Stanley,"Nokia,"Orbitz,"NaEonal"Cancer"InsEtute,"RIM,"The"Walt"Disney"Company"
!Solu/ons$Architects$are$experts$in$Hadoop$and$related$technologies$Many"are"commi>ers"to"the"Apache"Hadoop"and"ecosystem"projects"
!Provides$training$in$key$areas$of$Hadoop$administra/on$and$development$Courses"include"System"Administrator"training,"Developer"training,"Hive"and"Pig"training,"HBase"Training,"EssenEals"for"Managers"Custom"course"development"available"Both"public"and"on/site"training"available"
Cloudera"Services"
-
01#12$"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Chapter"Topics"
Course$Introduc/on$Introduc/on$
! About"this"course"! About"Cloudera"! Course$logis/cs$
-
01#13$"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Course$start$and$end$/mes$
!Lunch$
!Breaks$
!Restrooms$
!Can$I$come$in$early/stay$late?$
!Cer/ca/on$
LogisEcs"
-
01#14$"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!About$your$instructor$
!About$you$Experience"with"Hadoop?"Experience"as"a"developer?"
What"programming"languages"do"you"use?"ExpectaEons"from"the"course?"
IntroducEons"
-
02#1%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
The"MoBvaBon"for"Hadoop"Chapter"2"
-
02#2%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Course"Chapters"
! IntroducBon"
!WriBng"a"MapReduce"Program"! Unit"TesBng"MapReduce"Programs"! Delving"Deeper"into"the"Hadoop"API"! PracBcal"Development"Tips"and"Techniques"! Data"Input"and"Output"! Common"MapReduce"Algorithms"! Joining"Data"Sets"in"MapReduce"Jobs"
! Conclusion"! Appendix:"Cloudera"Enterprise"! Appendix:"Graph"ManipulaBon"in"MapReduce"
! IntegraBng"Hadoop"into"the"Enterprise"Workow"!Machine"Learning"and"Mahout"! An"IntroducBon"to"Hive"and"Pig"! An"IntroducBon"to"Oozie"
Introduc.on%to%Apache%Hadoop%%and%its%Ecosystem%
Basic"Programming"with"the"Hadoop"Core"API"
Problem"Solving"with"MapReduce"
Course"Conclusion"and"Appendices"
Course"IntroducBon"
The"Hadoop"Ecosystem"
! The%Mo.va.on%for%Hadoop%! Hadoop:"Basic"Concepts"
-
02#3%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
In%this%chapter%you%will%learn%
!What%problems%exist%with%tradi.onal%large#scale%compu.ng%systems%
!What%requirements%an%alterna.ve%approach%should%have%
!How%Hadoop%addresses%those%requirements%
The"MoBvaBon"For"Hadoop"
-
02#4%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Chapter"Topics"
Introduc.on%to%Apache%Hadoop%and%its%Ecosystem%The%Mo.va.on%for%Hadoop%
! Problems%with%tradi.onal%large#scale%systems%! Requirements"for"a"new"approach"! Introducing"Hadoop"! Conclusion"
-
02#5%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Tradi.onally,%computa.on%has%been%processor#bound%RelaBvely"small"amounts"of"data"Signicant"amount"of"complex"processing"performed"on"that"data"
!For%decades,%the%primary%push%was%to%increase%the%compu.ng%power%of%a%single%machine%Faster"processor,"more"RAM"
!Distributed%systems%evolved%to%allow%developers%to%use%mul.ple%machines%for%a%single%job%MPI"PVM"Condor"
TradiBonal"Large/Scale"ComputaBon"
MPI: Message Passing Interface PVM: Parallel Virtual Machine
-
02#6%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Programming%for%tradi.onal%distributed%systems%is%complex%Data"exchange"requires"synchronizaBon"Finite"bandwidth"is"available"Temporal"dependencies"are"complicated"It"is"dicult"to"deal"with"parBal"failures"of"the"system"
!Ken%Arnold,%CORBA%designer:%Failure"is"the"dening"dierence"between"distributed"and"local"programming,"so"you"have"to"design"distributed"systems"with"the"expectaBon"of"failure"Developers"spend"more"Bme"designing"for"failure"than"they"do"actually"working"on"the"problem"itself"
Distributed"Systems:"Problems"
CORBA: Common Object Request Broker Architecture
-
02#7%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Typically,%data%for%a%distributed%system%is%stored%on%a%SAN%
!At%compute%.me,%data%is%copied%to%the%compute%nodes%
!Fine%for%rela.vely%limited%amounts%of%data%
Distributed"Systems:"Data"Storage"
SAN: Storage Area Network
-
02#8%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Modern%systems%have%to%deal%with%far%more%data%than%was%the%case%in%the%past%OrganizaBons"are"generaBng"huge"amounts"of"data"That"data"has"inherent"value,"and"cannot"be"discarded"
!Examples:%Facebook""over"70PB"of"data"eBay""over"5PB"of"data"
!Many%organiza.ons%are%genera.ng%data%at%a%rate%of%terabytes%per%day%
The"Data/Driven"World"
-
02#9%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Moores%Law%has%held%rm%for%over%40%years%Processing"power"doubles"every"two"years"Processing"speed"is"no"longer"the"problem"
!Ge^ng%the%data%to%the%processors%becomes%the%bo_leneck%
!Quick%calcula.on%Typical"disk"data"transfer"rate:"75MB/sec"Time"taken"to"transfer"100GB"of"data"to"the"processor:"approx"22"minutes!"Assuming"sustained"reads"Actual"Bme"will"be"worse,"since"most"servers"have"less"than"100GB"of"RAM"available"
!A%new%approach%is%needed%
Data"Becomes"the"Bo>leneck"
-
02#10%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Chapter"Topics"
Introduc.on%to%Apache%Hadoop%and%its%Ecosystem%The%Mo.va.on%for%Hadoop%
! Problems"with"tradiBonal"large/scale"systems"! Requirements%for%a%new%approach%! Introducing"Hadoop"! Conclusion"
-
02#11%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!The%system%must%support%par.al%failure%Failure"of"a"component"should"result"in"a"graceful"degradaBon"of"applicaBon"performance"Not"complete"failure"of"the"enBre"system"
ParBal"Failure"Support"
-
02#12%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!If%a%component%of%the%system%fails,%its%workload%should%be%assumed%by%s.ll#func.oning%units%in%the%system%Failure"should"not"result"in"the"loss"of"any"data"
Data"Recoverability"
-
02#13%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!If%a%component%of%the%system%fails%and%then%recovers,%it%should%be%able%to%rejoin%the%system%Without"requiring"a"full"restart"of"the"enBre"system"
Component"Recovery"
-
02#14%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Component%failures%during%execu.on%of%a%job%should%not%aect%the%outcome%of%the%job%%
Consistency"
-
02#15%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Adding%load%to%the%system%should%result%in%a%graceful%decline%in%performance%of%individual%jobs%Not"failure"of"the"system"
!Increasing%resources%should%support%a%propor.onal%increase%in%load%capacity%
Scalability"
-
02#16%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Chapter"Topics"
Introduc.on%to%Apache%Hadoop%and%its%Ecosystem%The%Mo.va.on%for%Hadoop%
! Problems"with"tradiBonal"large/scale"systems"! Requirements"for"a"new"approach"! Introducing%Hadoop%! Conclusion"
-
02#17%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Hadoop%is%based%on%work%done%by%Google%in%the%late%1990s/early%2000s%Specically,"on"papers"describing"the"Google"File"System"(GFS)"published"in"2003,"and"MapReduce"published"in"2004"
!This%work%takes%a%radical%new%approach%to%the%problem%of%distributed%compu.ng%Meets"all"the"requirements"we"have"for"reliability"and"scalability"
!Core%concept:%distribute%the%data%as%it%is%ini.ally%stored%in%the%system%Individual"nodes"can"work"on"data"local"to"those"nodes"
No"data"transfer"over"the"network"is"required"for"iniBal"processing"
Hadoops"History"
-
02#18%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Applica.ons%are%wri_en%in%high#level%code%Developers"need"not"worry"about"network"programming,"temporal"dependencies"or"low/level"infrastructure"
!Nodes%talk%to%each%other%as%li_le%as%possible%Developers"should"not"write"code"which"communicates"between"nodes"Shared"nothing"architecture"
!Data%is%spread%among%machines%in%advance%ComputaBon"happens"where"the"data"is"stored,"wherever"possible"
Data"is"replicated"mulBple"Bmes"on"the"system"for"increased"availability"and"reliability"
Core"Hadoop"Concepts"
-
02#19%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!When%data%is%loaded%into%the%system,%it%is%split%into%blocks%Typically"64MB"or"128MB"
!Map%tasks%(the%rst%part%of%the%MapReduce%system)%work%on%rela.vely%small%por.ons%of%data%Typically"a"single"block"
!A%master%program%allocates%work%to%nodes%such%that%a%Map%task%will%work%on%a%block%of%data%stored%locally%on%that%node%whenever%possible%Many"nodes"work"in"parallel,"each"on"their"own"part"of"the"overall"dataset"
Hadoop:"Very"High/Level"Overview"
-
02#20%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!If%a%node%fails,%the%master%will%detect%that%failure%and%re#assign%the%work%to%a%dierent%node%on%the%system%
!Restar.ng%a%task%does%not%require%communica.on%with%nodes%working%on%other%por.ons%of%the%data%
!If%a%failed%node%restarts,%it%is%automa.cally%added%back%to%the%system%and%assigned%new%tasks%
!If%a%node%appears%to%be%running%slowly,%the%master%can%redundantly%execute%another%instance%of%the%same%task%Results"from"the"rst"to"nish"will"be"used"Known"as"speculaBve"execuBon"
Fault"Tolerance"
-
02#21%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Chapter"Topics"
Introduc.on%to%Apache%Hadoop%and%its%Ecosystem%The%Mo.va.on%for%Hadoop%
! Problems"with"tradiBonal"large/scale"systems"! Requirements"for"a"new"approach"! Introducing"Hadoop"! Conclusion%
-
02#22%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
In%this%chapter%you%have%learned%
!What%problems%exist%with%tradi.onal%large#scale%compu.ng%systems%
!What%requirements%an%alterna.ve%approach%should%have%
!How%Hadoop%addresses%those%requirements%
Conclusion"
-
03#1%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Hadoop:"Basic"Concepts"Chapter"3"
-
03#2%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Course"Chapters"
! IntroducDon"
!WriDng"a"MapReduce"Program"! Unit"TesDng"MapReduce"Programs"! Delving"Deeper"into"the"Hadoop"API"! PracDcal"Development"Tips"and"Techniques"! Data"Input"and"Output"! Common"MapReduce"Algorithms"! Joining"Data"Sets"in"MapReduce"Jobs"
! Conclusion"! Appendix:"Cloudera"Enterprise"! Appendix:"Graph"ManipulaDon"in"MapReduce"
! IntegraDng"Hadoop"into"the"Enterprise"Workow"!Machine"Learning"and"Mahout"! An"IntroducDon"to"Hive"and"Pig"! An"IntroducDon"to"Oozie"
Introduc/on%to%Apache%Hadoop%%and%its%Ecosystem%
Basic"Programming"with"the"Hadoop"Core"API"
Problem"Solving"with"MapReduce"
Course"Conclusion"and"Appendices"
Course"IntroducDon"
The"Hadoop"Ecosystem"
! The"MoDvaDon"for"Hadoop"! Hadoop:%Basic%Concepts%
-
03#3%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
In%this%chapter%you%will%learn%
!What%Hadoop%is%
!What%features%the%Hadoop%Distributed%File%System%(HDFS)%provides%
!The%concepts%behind%MapReduce%
!How%a%Hadoop%cluster%operates%
!What%other%Hadoop%Ecosystem%projects%exist%
Hadoop:"Basic"Concepts"
-
03#4%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Chapter"Topics"
Introduc/on%to%Apache%Hadoop%and%its%Ecosystem%Hadoop:%Basic%Concepts%
! The%Hadoop%project%and%Hadoop%components%! The"Hadoop"Distributed"File"System"(HDFS)"! Hands/On"Exercise:"Using"HDFS"! How"MapReduce"works"! Hands/On"Exercise:"Running"a"MapReduce"Job"! How"a"Hadoop"cluster"operates"!Other"Hadoop"ecosystem"components"! Conclusion"
-
03#5%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Hadoop%is%an%open#source%project%overseen%by%the%Apache%SoPware%Founda/on%
!Originally%based%on%papers%published%by%Google%in%2003%and%2004%
!Hadoop%commiTers%work%at%several%dierent%organiza/ons%Including"Cloudera,"Yahoo!,"Facebook,"LinkedIn"
The"Hadoop"Project"
-
03#6%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Hadoop%consists%of%two%core%components%The"Hadoop"Distributed"File"System"(HDFS)"MapReduce"
!There%are%many%other%projects%based%around%core%Hadoop%Oaen"referred"to"as"the"Hadoop"Ecosystem"Pig,"Hive,"HBase,"Flume,"Oozie,"Sqoop,"etc"
Many"are"discussed"later"in"the"course"!A%set%of%machines%running%HDFS%and%MapReduce%is%known%as%a%Hadoop&Cluster&Individual"machines"are"known"as"nodes&A"cluster"can"have"as"few"as"one"node,"as"many"as"several"thousand"
More"nodes"="be>er"performance!"
Hadoop"Components"
-
03#7%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!HDFS,%the%Hadoop%Distributed%File%System,%is%responsible%for%storing%data%on%the%cluster%
!Data%is%split%into%blocks%and%distributed%across%mul/ple%nodes%in%the%cluster%Each"block"is"typically"64MB"or"128MB"in"size"
!Each%block%is%replicated%mul/ple%/mes%Default"is"to"replicate"each"block"three"Dmes"Replicas"are"stored"on"dierent"nodes"
This"ensures"both"reliability"and"availability"
Hadoop"Components:"HDFS"
-
03#8%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!MapReduce%is%the%system%used%to%process%data%in%the%Hadoop%cluster%
!Consists%of%two%phases:%Map,%and%then%Reduce%Between"the"two"is"a"stage"known"as"the"shue&and&sort"
!Each%Map%task%operates%on%a%discrete%por/on%of%the%overall%dataset%Typically"one"HDFS"block"of"data"
!APer%all%Maps%are%complete,%the%MapReduce%system%distributes%the%intermediate%data%to%nodes%which%perform%the%Reduce%phase%Much"more"on"this"later!"
Hadoop"Components:"MapReduce"
-
03#9%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Chapter"Topics"
Introduc/on%to%Apache%Hadoop%and%its%Ecosystem%Hadoop:%Basic%Concepts%
! The"Hadoop"project"and"Hadoop"components"! The%Hadoop%Distributed%File%System%(HDFS)%! Hands/On"Exercise:"Using"HDFS"! How"MapReduce"works"! Hands/On"Exercise:"Running"a"MapReduce"Job"! How"a"Hadoop"cluster"operates"!Other"Hadoop"ecosystem"components"! Conclusion"
-
03#10%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!HDFS%is%a%lesystem%wriTen%in%Java%Based"on"Googles"GFS"
!Sits%on%top%of%a%na/ve%lesystem%Such"as"ext3,"ext4"or"xfs"
!Provides%redundant%storage%for%massive%amounts%of%data%Using"readily/available,"industry/standard"computers"
HDFS"Basic"Concepts"
-
03#11%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!HDFS%performs%best%with%a%modest%number%of%large%les%Millions,"rather"than"billions,"of"les"Each"le"typically"100MB"or"more"
!Files%in%HDFS%are%write%once%No"random"writes"to"les"are"allowed"
!HDFS%is%op/mized%for%large,%streaming%reads%of%les%Rather"than"random"reads"
HDFS"Basic"Concepts"(contd)"
-
03#12%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Files%are%split%into%blocks%Each"block"is"usually"64MB"or"128MB"
!Data%is%distributed%across%many%machines%at%load%/me%Dierent"blocks"from"the"same"le"will"be"stored"on"dierent"machines"This"provides"for"ecient"MapReduce"processing"(see"later)"
!Blocks%are%replicated%across%mul/ple%machines,%known%as%DataNodes&Default"replicaDon"is"three/fold"
Meaning"that"each"block"exists"on"three"dierent"machines"!A%master%node%called%the%NameNode&keeps%track%of%which%blocks%make%up%a%le,%and%where%those%blocks%are%located%Known"as"the"metadata"
How"Files"Are"Stored"
-
03#13%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!NameNode%holds%metadata%for%the%two%les%(Foo.txt%and%Bar.txt)%
!DataNodes%hold%the%actual%blocks%Each"block"will"be"64MB"or"128MB"in"size"Each"block"is"replicated"three"Dmes"on"the"cluster"
How"Files"Are"Stored:"Example"
Foo.txt: blk_001, blk_002, blk_003Bar.txt: blk_004, blk_005
NameNode
DataNodes
blk_003 blk_004
blk_001 blk_003
blk_004
blk_001 blk_005
blk_002 blk_004
blk_002 blk_003
blk_005
blk_001 blk_002
blk_005
-
03#14%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!The%NameNode%daemon%must%be%running%at%all%/mes%If"the"NameNode"stops,"the"cluster"becomes"inaccessible"Your"system"administrator"will"take"care"to"ensure"that"the"NameNode"hardware"is"reliable!"
!The%NameNode%holds%all%of%its%metadata%in%RAM%for%fast%access%It"keeps"a"record"of"changes"on"disk"for"crash"recovery"
!A%separate%daemon%known%as%the%Secondary&NameNode&takes%care%of%some%housekeeping%tasks%for%the%NameNode%Be"careful:"The"Secondary"NameNode"is"not%a"backup"NameNode!"
More"On"The"HDFS"NameNode"
-
03#15%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!CDH4%introduced%High%Availability%for%the%NameNode%
!Instead%of%a%single%NameNode,%there%are%now%two%An"AcDve"NameNode"A"Standby"NameNode"
!If%the%Ac/ve%NameNode%fails,%the%Standby%NameNode%can%automa/cally%take%over%
!The%Standby%NameNode%does%the%work%performed%by%the%Secondary%NameNode%in%classic%HDFS%HA"HDFS"does"not"run"a"Secondary"NameNode"daemon"
!Your%system%administrator%will%choose%whether%to%set%the%cluster%up%with%NameNode%High%Availability%or%not%
NameNode"High"Availability"in"CDH4"
-
03#16%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Although%les%are%split%into%64MB%or%128MB%blocks,%if%a%le%is%smaller%than%this%the%full%64MB/128MB%will%not%be%used%
!Blocks%are%stored%as%standard%les%on%the%DataNodes,%in%a%set%of%directories%specied%in%Hadoops%congura/on%les%This"will"be"set"by"the"system"administrator"
!Without%the%metadata%on%the%NameNode,%there%is%no%way%to%access%the%les%in%the%HDFS%cluster%
!When%a%client%applica/on%wants%to%read%a%le:%It"communicates"with"the"NameNode"to"determine"which"blocks"make"up"the"le,"and"which"DataNodes"those"blocks"reside"on"It"then"communicates"directly"with"the"DataNodes"to"read"the"data"The"NameNode"will"not"be"a"bo>leneck"
HDFS:"Points"To"Note"
-
03#17%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Applica/ons%can%read%and%write%HDFS%les%directly%via%the%Java%API%Covered"later"in"the"course"
!Typically,%les%are%created%on%a%local%lesystem%and%must%be%moved%into%HDFS%
!Likewise,%les%stored%in%HDFS%may%need%to%be%moved%to%a%machines%local%lesystem%
!Access%to%HDFS%from%the%command%line%is%achieved%with%the%hadoop fs%command%
Accessing"HDFS"
-
03#18%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Copy%le%foo.txt%from%local%disk%to%the%users%directory%in%HDFS%
This"will"copy"the"le"to"/user/username/foo.txt !Get%a%directory%lis/ng%of%the%users%home%directory%in%HDFS%
!Get%a%directory%lis/ng%of%the%HDFS%root%directory%
hadoop fs"Examples"
hadoop fs -put foo.txt foo.txt
hadoop fs -ls
hadoop fs ls /
-
03#19%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Display%the%contents%of%the%HDFS%le%/user/fred/bar.txt%%
!Move%that%le%to%the%local%disk,%named%as%baz.txt
!Create%a%directory%called%input%under%the%users%home%directory%
hadoop fs"Examples"(contd)"
hadoop fs cat /user/fred/bar.txt
hadoop fs get /user/fred/bar.txt baz.txt
hadoop fs mkdir input
Note:"copyFromLocal"is"a"synonym"for"put;"copyToLocal"is"a"synonym"for"get""
-
03#20%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Delete%the%directory%input_old%and%all%its%contents%
hadoop fs"Examples"(contd)"
hadoop fs rm -r input_old
-
03#21%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Chapter"Topics"
Introduc/on%to%Apache%Hadoop%and%its%Ecosystem%Hadoop:%Basic%Concepts%
! The"Hadoop"project"and"Hadoop"components"! The"Hadoop"Distributed"File"System"(HDFS)"! Hands#On%Exercise:%Using%HDFS%! How"MapReduce"works"! Hands/On"Exercise:"Running"a"MapReduce"Job"! How"a"Hadoop"cluster"operates"!Other"Hadoop"ecosystem"components"! Conclusion"
-
03#22%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!During%this%course,%you%will%perform%numerous%hands#on%exercises%using%the%Cloudera%Training%Virtual%Machine%(VM)%
!The%VM%has%Hadoop%installed%in%pseudo5distributed&mode%This"essenDally"means"that"it"is"a"cluster"comprised"of"a"single"node"Using"a"pseudo/distributed"cluster"is"the"typical"way"to"test"your"code"before"you"run"it"on"your"full"cluster"It"operates"almost"exactly"like"a"real"cluster"
A"key"dierence"is"that"the"data"replicaDon"factor"is"set"to"1,"not"3"
Aside:"The"Training"Virtual"Machine"
-
03#23%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!In%this%Hands#On%Exercise%you%will%gain%familiarity%with%manipula/ng%les%in%HDFS%
!Please%refer%to%the%Hands#On%Exercise%Manual%
Hands/On"Exercise:"Using"HDFS"
-
03#24%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Chapter"Topics"
Introduc/on%to%Apache%Hadoop%and%its%Ecosystem%Hadoop:%Basic%Concepts%
! The"Hadoop"project"and"Hadoop"components"! The"Hadoop"Distributed"File"System"(HDFS)"! Hands/On"Exercise:"Using"HDFS"! How%MapReduce%works%! Hands/On"Exercise:"Running"a"MapReduce"Job"! How"a"Hadoop"cluster"operates"!Other"Hadoop"ecosystem"components"! Conclusion"
-
03#25%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!MapReduce%is%a%method%for%distribu/ng%a%task%across%mul/ple%nodes%
!Each%node%processes%data%stored%on%that%node%%Where"possible"
!Consists%of%two%phases:%Map"Reduce"
What"Is"MapReduce?"
-
03#26%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Automa/c%paralleliza/on%and%distribu/on%
!Fault#tolerance%
!Status%and%monitoring%tools%
!A%clean%abstrac/on%for%programmers%MapReduce"programs"are"usually"wri>en"in"Java"
Can"be"wri>en"in"any"language"using"Hadoop&Streaming"(see"later)"All"of"Hadoop"is"wri>en"in"Java"
!MapReduce%abstracts%all%the%housekeeping%away%from%the%developer%Developer"can"concentrate"simply"on"wriDng"the"Map"and"Reduce"funcDons"
Features"of"MapReduce"
-
03#27%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
MapReduce:"The"Big"Picture"
-
03#28%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!MapReduce%jobs%are%controlled%by%a%soPware%daemon%known%as%the%JobTracker&
!The%JobTracker%resides%on%a%master%node%Clients"submit"MapReduce"jobs"to"the"JobTracker"The"JobTracker"assigns"Map"and"Reduce"tasks"to"other"nodes"on"the"cluster"These"nodes"each"run"a"soaware"daemon"known"as"the"TaskTracker"The"TaskTracker"is"responsible"for"actually"instanDaDng"the"Map"or"Reduce"task,"and"reporDng"progress"back"to"the"JobTracker"
MapReduce:"The"JobTracker"
-
03#29%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!CDH4%contains%standard%MapReduce%(MR1)%
!CDH4%also%includes%MapReduce%version%2%(MR2)%Also"known"as"YARN"(Yet"Another"Resource"NegoDator)"A"complete"rewrite"of"the"Hadoop"MapReduce"framework"
!MR2%is%not%yet%considered%produc/on#ready%Included"in"CDH4"as"a"technology"preview"
!Exis/ng%code%will%work%with%no%modica/on%on%MR2%clusters%when%the%technology%matures%Code"will"need"to"be"re/compiled,"but"the"API"remains"idenDcal"
!For%produc/on%use,%we%strongly%recommend%using%MR1%
Aside:"MapReduce"Version"2"
-
03#30%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!A%job%is%a%full%program%A"complete"execuDon"of"Mappers"and"Reducers"over"a"dataset"
!A%task%is%the%execu/on%of%a%single%Mapper%or%Reducer%over%a%slice%of%data%
!A%task&aempts"as"there"are"tasks"If"a"task"a>empt"fails,"another"will"be"started"by"the"JobTracker"Specula7ve&execu7on"(see"later)"can"also"result"in"more"task"a>empts"than"completed"tasks&
MapReduce:"Terminology"
-
03#31%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Hadoop%aTempts%to%ensure%that%Mappers%run%on%nodes%which%hold%their%por/on%of%the%data%locally,%to%avoid%network%trac%MulDple"Mappers"run"in"parallel,"each"processing"a"porDon"of"the"input"data"
!The%Mapper%reads%data%in%the%form%of%key/value%pairs%
!It%outputs%zero%or%more%key/value%pairs%(pseudo#code):%
MapReduce:"The"Mapper"
map(in_key, in_value) -> (inter_key, inter_value) list
-
03#32%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!The%Mapper%may%use%or%completely%ignore%the%input%key%For"example,"a"standard"pa>ern"is"to"read"a"line"of"a"le"at"a"Dme"
The"key"is"the"byte"oset"into"the"le"at"which"the"line"starts"The"value"is"the"contents"of"the"line"itself"Typically"the"key"is"considered"irrelevant"
!If%the%Mapper%writes%anything%out,%the%output%must%be%in%the%form%of%%key/value%pairs%
MapReduce:"The"Mapper"(contd)"
-
03#33%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Turn%input%into%upper%case%(pseudo#code):%
Example"Mapper:"Upper"Case"Mapper"
let map(k, v) = emit(k.toUpper(), v.toUpper())
('foo', 'bar') -> ('FOO', 'BAR') ('foo', 'other') -> ('FOO', 'OTHER') ('baz', 'more data') -> ('BAZ', 'MORE DATA')
-
03#34%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Output%each%input%character%separately%(pseudo#code):%
Example"Mapper:"Explode"Mapper"
let map(k, v) = foreach char c in v: emit (k, c)
('foo', 'bar') -> ('foo', 'b'), ('foo', 'a'), ('foo', 'r')
('baz', 'other') -> ('baz', 'o'), ('baz', 't'), ('baz', 'h'), ('baz', 'e'), ('baz', 'r')
-
03#35%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Only%output%key/value%pairs%where%the%input%value%is%a%prime%number%(pseudo#code):%
Example"Mapper:"Filter"Mapper"
let map(k, v) = if (isPrime(v)) then emit(k, v)
('foo', 7) -> ('foo', 7) ('baz', 10) -> nothing
-
03#36%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!The%key%output%by%the%Mapper%does%not%need%to%be%iden/cal%to%the%input%key%
!Output%the%word%length%as%the%key%(pseudo#code):%
Example"Mapper:"Changing"Keyspaces"
let map(k, v) = emit(v.length(), v)
('foo', 'bar') -> (3, 'bar') ('baz', 'other') -> (5, 'other') ('foo', 'abracadabra') -> (11, 'abracadabra')
-
03#37%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!APer%the%Map%phase%is%over,%all%the%intermediate%values%for%a%given%intermediate%key%are%combined%together%into%a%list%
!This%list%is%given%to%a%Reducer%There"may"be"a"single"Reducer,"or"mulDple"Reducers"
This"is"specied"as"part"of"the"job"conguraDon"(see"later)"All"values"associated"with"a"parDcular"intermediate"key"are"guaranteed"to"go"to"the"same"Reducer"The"intermediate"keys,"and"their"value"lists,"are"passed"to"the"Reducer"in"sorted"key"order"This"step"is"known"as"the"shue"and"sort"
!The%Reducer%outputs%zero%or%more%nal%key/value%pairs%These"are"wri>en"to"HDFS"In"pracDce,"the"Reducer"usually"emits"a"single"key/value"pair"for"each"input"key"
MapReduce:"The"Reducer"
-
03#38%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Add%up%all%the%values%associated%with%each%intermediate%key%(pseudo#code):%
Example"Reducer:"Sum"Reducer"
let reduce(k, vals) = sum = 0 foreach int i in vals: sum += i emit(k, sum)
(bar', [9, 3, -17, 44]) -> (bar', 39) (foo', [123, 100, 77]) -> (foo', 300)
-
03#39%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!The%Iden/ty%Reducer%is%very%common%(pseudo#code):%
Example"Reducer:"IdenDty"Reducer"
let reduce(k, vals) = foreach v in vals: emit(k, v)
('bar', [123, 100, 77]) -> ('bar', 123), ('bar', 100), ('bar', 77)
('foo', [9, 3, -17, 44]) -> ('foo', 9), ('foo', 3), ('foo', -17), ('foo', 44)
-
03#40%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Count%the%number%of%occurrences%of%each%word%in%a%large%amount%of%input%data%This"is"the"hello"world"of"MapReduce"programming"
MapReduce"Example:"Word"Count"
map(String input_key, String input_value) foreach word w in input_value: emit(w, 1)
reduce(String output_key, Iterator intermediate_vals) set count = 0 foreach v in intermediate_vals: count += v emit(output_key, count)
-
03#41%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Input%to%the%Mapper:%
!Output%from%the%Mapper:%
MapReduce"Example:"Word"Count"(contd)"
(3414, 'the cat sat on the mat') (3437, 'the aardvark sat on the sofa')
('the', 1), ('cat', 1), ('sat', 1), ('on', 1), ('the', 1), ('mat', 1), ('the', 1), ('aardvark', 1), ('sat', 1), ('on', 1), ('the', 1), ('sofa', 1)
-
03#42%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Intermediate%data%sent%to%the%Reducer:%
!Final%Reducer%output:%
MapReduce"Example:"Word"Count"(contd)"
('aardvark', [1]) ('cat', [1]) ('mat', [1]) ('on', [1, 1]) ('sat', [1, 1]) ('sofa', [1]) ('the', [1, 1, 1, 1]) ('aardvark', 1) ('cat', 1) ('mat', 1) ('on', 2) ('sat', 2) ('sofa', 1) ('the', 4)
-
03#43%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Whenever%possible,%Hadoop%will%aTempt%to%ensure%that%a%Map%task%on%a%node%is%working%on%a%block%of%data%stored%locally%on%that%node%via%HDFS%
!If%this%is%not%possible,%the%Map%task%will%have%to%transfer%the%data%across%the%network%as%it%processes%that%data%
!Once%the%Map%tasks%have%nished,%data%is%then%transferred%across%the%network%to%the%Reducers%Although"the"Reducers"may"run"on"the"same"physical"machines"as"the"Map"tasks,"there"is"no"concept"of"data"locality"for"the"Reducers"All"Mappers"will,"in"general,"have"to"communicate"with"all"Reducers"
MapReduce:"Data"Locality"
-
03#44%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!It%appears%that%the%shue%and%sort%phase%is%a%boTleneck%The"reduce"method"in"the"Reducers"cannot"start"unDl"all"Mappers"have"nished"
!In%prac/ce,%Hadoop%will%start%to%transfer%data%from%Mappers%to%Reducers%as%the%Mappers%nish%work%This"miDgates"against"a"huge"amount"of"data"transfer"starDng"as"soon"as"the"last"Mapper"nishes"Note"that"this"behavior"is"congurable"
The"developer"can"specify"the"percentage"of"Mappers"which"should"nish"before"Reducers"start"retrieving"data"
The"developers"reduce"method"sDll"does"not"start"unDl"all"intermediate"data"has"been"transferred"and"sorted"
MapReduce:"Is"Shue"and"Sort"a"Bo>leneck?"
-
03#45%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!It%is%possible%for%one%Map%task%to%run%more%slowly%than%the%others%Perhaps"due"to"faulty"hardware,"or"just"a"very"slow"machine"
!It%would%appear%that%this%would%create%a%boTleneck%The"reduce"method"in"the"Reducer"cannot"start"unDl"every"Mapper"has"nished"
!Hadoop%uses%specula=ve&execu=on%to%mi/gate%against%this%If"a"Mapper"appears"to"be"running"signicantly"more"slowly"than"the"others,"a"new"instance"of"the"Mapper"will"be"started"on"another"machine,"operaDng"on"the"same"data"The"results"of"the"rst"Mapper"to"nish"will"be"used"Hadoop"will"kill"o"the"Mapper"which"is"sDll"running"
MapReduce:"Is"a"Slow"Mapper"a"Bo>leneck?"
-
03#46%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Write%the%Mapper%and%Reducer%classes%
!Write%a%Driver%class%that%congures%the%job%and%submits%it%to%the%cluster%Driver"classes"are"covered"in"the"next"chapter"
!Compile%the%Mapper,%Reducer,%and%Driver%classes%Example:""javac -classpath `hadoop classpath` *.java
!Create%a%jar%le%with%the%Mapper,%Reducer,%and%Driver%classes%Example:"jar cvf foo.jar *.class
!Run%the%hadoop jar%command%to%submit%the%job%to%the%Hadoop%cluster%Example:"hadoop jar foo.jar Foo in_dir out_dir
CreaDng"and"Running"a"MapReduce"Job"
-
03#47%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Chapter"Topics"
Introduc/on%to%Apache%Hadoop%and%its%Ecosystem%Hadoop:%Basic%Concepts%
! The"Hadoop"project"and"Hadoop"components"! The"Hadoop"Distributed"File"System"(HDFS)"! Hands/On"Exercise:"Using"HDFS"! How"MapReduce"works"! Hands#On%Exercise:%Running%a%MapReduce%Job%! How"a"Hadoop"cluster"operates"!Other"Hadoop"ecosystem"components"! Conclusion"
-
03#48%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!In%this%Hands#On%Exercise,%you%will%run%a%MapReduce%job%on%your%pseudo#distributed%Hadoop%cluster%
!Please%refer%to%the%Hands#On%Exercise%Manual%
Hands/On"Exercise:"Running"A"MapReduce"Job"
-
03#49%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Chapter"Topics"
Introduc/on%to%Apache%Hadoop%and%its%Ecosystem%Hadoop:%Basic%Concepts%
! The"Hadoop"project"and"Hadoop"components"! The"Hadoop"Distributed"File"System"(HDFS)"! Hands/On"Exercise:"Using"HDFS"! How"MapReduce"works"! Hands/On"Exercise:"Running"a"MapReduce"Job"! How%a%Hadoop%cluster%operates%!Other"Hadoop"ecosystem"components"! Conclusion"
-
03#50%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Cluster%installa/on%is%usually%performed%by%the%system%administrator,%and%is%outside%the%scope%of%this%course%Cloudera"oers"a"training"course"for"System"Administrators"specically"aimed"at"those"responsible"for"commissioning"and"maintaining"Hadoop"clusters"
!However,%its%very%useful%to%understand%how%the%component%parts%of%the%Hadoop%cluster%work%together%
!Typically,%a%developer%will%congure%their%machine%to%run%in%pseudo5distributed&mode%This"eecDvely"creates"a"single/machine"cluster"All"ve"Hadoop"daemons"are"running"on"the"same"machine"Very"useful"for"tesDng"code"before"it"is"deployed"to"the"real"cluster"
Installing"A"Hadoop"Cluster"
-
03#51%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Easiest%way%to%download%and%install%Hadoop,%either%for%a%full%cluster%or%in%pseudo#distributed%mode,%is%by%using%Clouderas%Distribu/on,%including%Apache%Hadoop%(CDH)%Vanilla"Hadoop"plus"many"patches,"backports,"bugxes"Supplied"as"a"Debian"package"(for"Linux"distribuDons"such"as"Ubuntu),"an"RPM"(for"CentOS/RedHat"Enterprise"Linux),"and"as"a"tarball"Full"documentaDon"available"at"http://cloudera.com/
Installing"A"Hadoop"Cluster"(contd)"
-
03#52%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Hadoop%is%comprised%of%ve%separate%daemons%
!NameNode%Holds"the"metadata"for"HDFS"
!Secondary%NameNode%Performs"housekeeping"funcDons"for"the"NameNode"Is"not"a"backup"or"hot"standby"for"the"NameNode!"
!DataNode%Stores"actual"HDFS"data"blocks"
!JobTracker%Manages"MapReduce"jobs,"distributes"individual"tasks"to"machines"running"the"
!TaskTracker%InstanDates"and"monitors"individual"Map"and"Reduce"tasks"
The"Five"Hadoop"Daemons"
-
03#53%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Each%daemon%runs%in%its%own%Java%Virtual%Machine%(JVM)%
!No%node%on%a%real%cluster%will%run%all%ve%daemons%Although"this"is"technically"possible"
!We%can%consider%nodes%to%be%in%two%dierent%categories:%Master"Nodes"
Run"the"NameNode,"Secondary"NameNode,"JobTracker"daemons"Only"one"of"each"of"these"daemons"runs"on"the"cluster"
Slave"Nodes"Run"the"DataNode"and"TaskTracker"daemons"
A"slave"node"will"run"both"of"these"daemons"
The"Five"Hadoop"Daemons"(contd)"
-
03#54%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!On%very%small%clusters,%the%NameNode,%JobTracker%and%Secondary%NameNode%daemons%can%all%reside%on%a%single%machine%It"is"typical"to"put"them"on"separate"machines"as"the"cluster"grows"beyond"20/30"nodes"
!Each%daemon%runs%in%a%separate%Java%Virtual%Machine%(JVM)%
Basic"Cluster"ConguraDon"
-
03#55%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!When%a%client%submits%a%job,%its%congura/on%informa/on%is%packaged%into%an%XML%le%
!This%le,%along%with%the%.jar%le%containing%the%actual%program%code,%is%handed%to%the%JobTracker%The"JobTracker"then"parcels"out"individual"tasks"to"TaskTracker"nodes"When"a"TaskTracker"receives"a"request"to"run"a"task,"it"instanDates"a"separate"JVM"for"that"task"TaskTracker"nodes"can"be"congured"to"run"mulDple"tasks"at"the"same"Dme"If"the"node"has"enough"processing"power"and"memory"
Submirng"A"Job"
-
03#56%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!The%intermediate%data%is%held%on%the%TaskTrackers%local%disk%
!As%Reducers%start%up,%the%intermediate%data%is%distributed%across%the%network%to%the%Reducers%
!Reducers%write%their%nal%output%to%HDFS%
!Once%the%job%has%completed,%the%TaskTracker%can%delete%the%intermediate%data%from%its%local%disk%Note"that"the"intermediate"data"is"not"deleted"unDl"the"enDre"job"completes"
Submirng"A"Job"(contd)"
-
03#57%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Chapter"Topics"
Introduc/on%to%Apache%Hadoop%and%its%Ecosystem%Hadoop:%Basic%Concepts%
! The"Hadoop"project"and"Hadoop"components"! The"Hadoop"Distributed"File"System"(HDFS)"! Hands/On"Exercise:"Using"HDFS"! How"MapReduce"works"! Hands/On"Exercise:"Running"a"MapReduce"Job"! How"a"Hadoop"cluster"operates"!Other%Hadoop%ecosystem%components%! Conclusion"
-
03#58%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!The%term%Hadoop%core%refers%to%HDFS%and%MapReduce%
!Many%other%projects%exist%which%use%Hadoop%core%Either"both"HDFS"and"MapReduce,"or"just"HDFS"
!Most%are%Apache%projects%or%Apache%Incubator%projects%Some"others"are"not"hosted"by"the"Apache"Soaware"FoundaDon"
These"are"oaen"hosted"on"GitHub"or"a"similar"repository"!We%will%inves/gate%many%of%these%projects%later%in%the%course%
!Following%is%an%introduc/on%to%some%of%the%most%signicant%projects%
Other"Ecosystem"Projects:"IntroducDon"
-
03#59%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Hive%is%an%abstrac/on%on%top%of%MapReduce%
!Allows%users%to%query%data%in%the%Hadoop%cluster%without%knowing%Java%or%MapReduce%
!Uses%the%HiveQL%language%Very"similar"to"SQL"
!The%Hive%Interpreter%runs%on%a%client%machine%Turns"HiveQL"queries"into"MapReduce"jobs"Submits"those"jobs"to"the"cluster"
!Note:%this%does%not%turn%the%cluster%into%a%rela/onal%database%server!%It"is"sDll"simply"running"MapReduce"jobs"Those"jobs"are"created"by"the"Hive"Interpreter"
Hive"
-
03#60%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Sample%Hive%query:%
%
!We%will%inves/gate%Hive%in%greater%detail%later%in%the%course%
Hive"(contd)"
SELECT stock.product, SUM(orders.purchases) FROM stock JOIN orders ON (stock.id = orders.stock_id) WHERE orders.quarter = 'Q1' GROUP BY stock.product;
-
03#61%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Pig%is%an%alterna/ve%abstrac/on%on%top%of%MapReduce%
!Uses%a%dataow%scrip/ng%language%Called"PigLaDn"
!The%Pig%interpreter%runs%on%the%client%machine%Takes"the"PigLaDn"script"and"turns"it"into"a"series"of"MapReduce"jobs"Submits"those"jobs"to"the"cluster"
!As%with%Hive,%nothing%magical%happens%on%the%cluster%It"is"sDll"simply"running"MapReduce"jobs"
Pig"
-
03#62%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Sample%Pig%script:%
%
!We%will%inves/gate%Pig%in%more%detail%later%in%the%course%
Pig"(contd)"
stock = LOAD '/user/fred/stock' AS (id, item); orders = LOAD '/user/fred/orders' AS (id, cost); grpd = GROUP orders BY id; totals = FOREACH grpd GENERATE group, SUM(orders.cost) AS t;
result = JOIN stock BY id, totals BY group; DUMP result;
-
03#63%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Impala%is%an%open#source%project%created%by%Cloudera%
!Facilitates%real#/me%queries%of%data%in%HDFS%
!Does%not%use%MapReduce%Uses"its"own"daemon,"running"on"each"slave"node"Queries"data"stored"in"HDFS"
!Uses%a%language%very%similar%to%HiveQL%But"produces"results"much,"much"faster"
Typically"between"ve"and"40"Dmes"faster"than"Hive"!Currently%in%beta%
Although"being"used"in"producDon"by"mulDple"organizaDons"
Impala"
-
03#64%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Flume%provides%a%method%to%import%data%into%HDFS%as%it%is%generated%Instead"of"batch/processing"the"data"later"For"example,"log"les"from"a"Web"server"
!Sqoop%provides%a%method%to%import%data%from%tables%in%a%rela/onal%database%into%HDFS%Does"this"very"eciently"via"a"Map/only"MapReduce"job"Can"also"go"the"other"way"
Populate"database"tables"from"les"in"HDFS"!We%will%inves/gate%Flume%and%Sqoop%later%in%the%course%
Flume"and"Sqoop"
-
03#65%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Oozie%allows%developers%to%create%a%workow%of%MapReduce%jobs%Including"dependencies"between"jobs"
!The%Oozie%server%submits%the%jobs%to%the%server%in%the%correct%sequence%
!We%will%inves/gate%Oozie%later%in%the%course%
Oozie"
-
03#66%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!HBase%is%the%Hadoop%database%
!A%NoSQL%datastore%
!Can%store%massive%amounts%of%data%Gigabytes,"terabytes,"and"even"petabytes"of"data"in"a"table"
!Scales%to%provide%very%high%write%throughput%Hundreds"of"thousands"of"inserts"per"second"
!Copes%well%with%sparse%data%Tables"can"have"many"thousands"of"columns"
Even"if"most"columns"are"empty"for"any"given"row"!Has%a%very%constrained%access%model%
Insert"a"row,"retrieve"a"row,"do"a"full"or"parDal"table"scan"Only"one"column"(the"row"key)"is"indexed"
HBase"
-
03#67%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
HBase"vs"TradiDonal"RDBMSs"
RDBMS% HBase%Data%layout% Row/oriented" Column/oriented"
Transac/ons% Yes" Single"row"only"
Query%language% SQL" get/put/scan"
Security% AuthenDcaDon/AuthorizaDon" Kerberos"
Indexes% On"arbitrary"columns" Row/key"only"
Max%data%size% TBs" PB+"
Read/write%throughput%limits%
1000s"queries/second" Millions"of"queries/second"
-
03#68%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Chapter"Topics"
Introduc/on%to%Apache%Hadoop%and%its%Ecosystem%Hadoop:%Basic%Concepts%
! The"Hadoop"project"and"Hadoop"components"! The"Hadoop"Distributed"File"System"(HDFS)"! Hands/On"Exercise:"Using"HDFS"! How"MapReduce"works"! Hands/On"Exercise:"Running"a"MapReduce"Job"! How"a"Hadoop"cluster"operates"!Other"Hadoop"ecosystem"components"! Conclusion%
-
03#69%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
In%this%chapter%you%have%learned%
!What%Hadoop%is%
!What%features%the%Hadoop%Distributed%File%System%(HDFS)%provides%
!The%concepts%behind%MapReduce%
!How%a%Hadoop%cluster%operates%
!What%other%Hadoop%Ecosystem%projects%exist%
Conclusion"
-
04#1%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
WriAng"a"MapReduce"Program"Chapter"4"
-
04#2%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Course"Chapters"
! IntroducAon"
!Wri*ng%a%MapReduce%Program%! Unit"TesAng"MapReduce"Programs"! Delving"Deeper"into"the"Hadoop"API"! PracAcal"Development"Tips"and"Techniques"! Data"Input"and"Output"! Common"MapReduce"Algorithms"! Joining"Data"Sets"in"MapReduce"Jobs"
! Conclusion"! Appendix:"Cloudera"Enterprise"! Appendix:"Graph"ManipulaAon"in"MapReduce"
! IntegraAng"Hadoop"into"the"Enterprise"Workow"!Machine"Learning"and"Mahout"! An"IntroducAon"to"Hive"and"Pig"! An"IntroducAon"to"Oozie"
IntroducAon"to"Apache"Hadoop""and"its"Ecosystem"
Basic%Programming%with%the%Hadoop%Core%API%
Problem"Solving"with"MapReduce"
Course"Conclusion"and"Appendices"
Course"IntroducAon"
The"Hadoop"Ecosystem"
! The"MoAvaAon"for"Hadoop"! Hadoop:"Basic"Concepts"
-
04#3%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
In%this%chapter%you%will%learn%
!The%MapReduce%ow%
!Basic%MapReduce%API%concepts%
!How%to%write%MapReduce%drivers,%Mappers,%and%Reducers%in%Java%
!How%to%write%Mappers%and%Reducers%in%other%languages%using%the%Streaming%API%
!How%to%speed%up%your%Hadoop%development%by%using%Eclipse%
!The%dierences%between%the%old%and%new%MapReduce%APIs%
WriAng"a"MapReduce"Program"
-
04#4%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Chapter"Topics"
Basic%Programming%with%the%%Hadoop%Core%API%Wri*ng%a%MapReduce%Program%
! The%MapReduce%ow%! Basic"MapReduce"API"concepts"!WriAng"MapReduce"applicaAons"in"Java" The"driver" The"Mapper" The"Reducer"
!WriAng"Mappers"and"Reducers"in"other"languages"with"the"Streaming"API"! Speeding"up"Hadoop"development"by"using"Eclipse"! Hands/On"Exercise:"WriAng"a"MapReduce"Program"! Dierences"between"the"Old"and"New"MapReduce"APIs"! Conclusion"
-
04#5%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!In%the%previous%chapter,%you%ran%a%sample%MapReduce%program%WordCount,"which"counted"the"number"of"occurrences"of"each"unique"word"in"a"set"of"les"
!In%this%chapter,%we%will%examine%the%code%for%WordCount%This"will"demonstrate"the"Hadoop"API"
!We%will%also%inves*gate%Hadoop%Streaming%Allows"you"to"write"MapReduce"programs"in"(virtually)"any"language"
A"Sample"MapReduce"Program:"IntroducAon"
-
04#6%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!On%the%following%slides%we%show%the%MapReduce%ow%
!Each%of%the%por*ons%(RecordReader,%Mapper,%Par**oner,%Reducer,%etc.)%can%be%created%by%the%developer%
!We%will%cover%each%of%these%as%we%move%through%the%course%
!You%will%always%create%at%least%a%Mapper,%Reducer,%and%driver%code%Those"are"the"porAons"we"will"invesAgate"in"this"chapter"
The"MapReduce"Flow:"IntroducAon"
-
04#7%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
The"MapReduce"Flow:"The"Mapper"
-
04#8%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
The"MapReduce"Flow:"Shue"and"Sort"
-
04#9%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
The"MapReduce"Flow:"Reducers"to"Outputs"
-
04#10%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Chapter"Topics"
Basic%Programming%with%the%%Hadoop%Core%API%Wri*ng%a%MapReduce%Program%
! The"MapReduce"ow"! Basic%MapReduce%API%concepts%!WriAng"MapReduce"applicaAons"in"Java" The"driver" The"Mapper" The"Reducer"
!WriAng"Mappers"and"Reducers"in"other"languages"with"the"Streaming"API"! Speeding"up"Hadoop"development"by"using"Eclipse"! Hands/On"Exercise:"WriAng"a"MapReduce"Program"! Dierences"between"the"Old"and"New"MapReduce"APIs"! Conclusion"
-
04#11%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!To%inves*gate%the%API,%we%will%dissect%the%WordCount%program%you%ran%in%the%previous%chapter%
!This%consists%of%three%por*ons%The"driver"code"
Code"that"runs"on"the"client"to"congure"and"submit"the"job"The"Mapper"The"Reducer"
!Before%we%look%at%the%code,%we%need%to%cover%some%basic%Hadoop%API%concepts%
Our"MapReduce"Program:"WordCount"
-
04#12%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!The%data%passed%to%the%Mapper%is%specied%by%an%InputFormat+Specied"in"the"driver"code"Denes"the"locaAon"of"the"input"data"
A"le"or"directory,"for"example"Determines"how"to"split"the"input"data"into"input&splits"
Each"Mapper"deals"with"a"single"input"split""InputFormat"is"a"factory"for"RecordReader"objects"to"extract""(key,"value)"records"from"the"input"source"
Gecng"Data"to"the"Mapper"
-
04#13%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Gecng"Data"to"the"Mapper"(contd)"
-
04#14%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!FileInputFormat%The"base"class"used"for"all"le/based"InputFormats"
!TextInputFormat The"default"Treats"each"\n/terminated"line"of"a"le"as"a"value"Key"is"the"byte"oset"within"the"le"of"that"line"
!KeyValueTextInputFormat Maps"\n/terminated"lines"as"key"SEP"value"
By"default,"separator"is"a"tab"!SequenceFileInputFormat
Binary"le"of"(key,"value)"pairs"with"some"addiAonal"metadata"!SequenceFileAsTextInputFormat
Similar,"but"maps"(key.toString(),"value.toString())"
Some"Standard"InputFormats"
-
04#15%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Keys%and%values%in%Hadoop%are%Objects%!Values%are%objects%which%implement%Writable !Keys%are%objects%which%implement%WritableComparable
Keys"and"Values"are"Objects"
-
04#16%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!Hadoop%denes%its%own%box%classes%for%strings,%integers%and%so%on%IntWritable"for"ints"LongWritable"for"longs"FloatWritable"for"oats"DoubleWritable"for"doubles"Text"for"strings"Etc.""
!The%Writable%interface%makes%serializa*on%quick%and%easy%for%Hadoop%%!Any%values%type%must%implement%the%Writable%interface%
What"is"Writable?"
-
04#17%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!A%WritableComparable%is%a%Writable%which%is%also%Comparable Two"WritableComparables"can"be"compared"against"each"other"to"determine"their"order"Keys"must"be"WritableComparables"because"they"are"passed"to"the"Reducer"in"sorted"order"We"will"talk"more"about"WritableComparables"later"
!Note%that%despite%their%names,%all%Hadoop%box%classes%implement%both%Writable%and%WritableComparable For"example,"IntWritable"is"actually"a"WritableComparable
What"is"WritableComparable?"
-
04#18%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Chapter"Topics"
Basic%Programming%with%the%%Hadoop%Core%API%Wri*ng%a%MapReduce%Program%
! The"MapReduce"ow"! Basic"MapReduce"API"concepts"!Wri*ng%MapReduce%applica*ons%in%Java% The%driver% The"Mapper" The"Reducer"
!WriAng"Mappers"and"Reducers"in"other"languages"with"the"Streaming"API"! Speeding"up"Hadoop"development"by"using"Eclipse"! Hands/On"Exercise:"WriAng"a"MapReduce"Program"! Dierences"between"the"Old"and"New"MapReduce"APIs"! Conclusion"
-
04#19%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!The%driver%code%runs%on%the%client%machine%
!It%congures%the%job,%then%submits%it%to%the%cluster%
The"Driver"Code:"IntroducAon"
-
04#20%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
The"Driver:"Complete"Code"
import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.Job; public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount \n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class);
-
04#21%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
The"Driver:"Complete"Code"(contd)"
job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }
-
04#22%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
The"Driver:"Import"Statements"
import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.Job; public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount \n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class);
You"will"typically"import"these"classes"into"every"MapReduce"job"you"write."We"will"omit"the"import statements"in"future"slides"for"brevity.""
-
04#23%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
The"Driver:"Main"Code"
public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount \n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }
-
04#24%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
The"Driver"Class:"main"Method"
public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount \n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }
The"main"method"accepts"two"command/line"arguments:"the"input"and"output"directories."
-
04#25%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Sanity"Checking"The"Jobs"InvocaAon"
public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount \n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }
The"rst"step"is"to"ensure"that"we"have"been"given"two"command/line"arguments."If"not,"print"a"help"message"and"exit."
-
04#26%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Conguring"The"Job"With"the"Job"Object
public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount \n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }
To"congure"the"job,"create"a"new"Job"object"and"specify"the"class"which"will"be"called"to"run"the"job."
-
04#27%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!The%Job%class%allows%you%to%set%congura*on%op*ons%for%your%MapReduce%job%The"classes"to"be"used"for"your"Mapper"and"Reducer"The"input"and"output"directories"Many"other"opAons"
!Any%op*ons%not%explicitly%set%in%your%driver%code%will%be%read%from%your%Hadoop%congura*on%les%Usually"located"in"/etc/hadoop/conf
!Any%op*ons%not%specied%in%your%congura*on%les%will%receive%Hadoops%default%values%
!You%can%also%use%the%Job%object%to%submit%the%job,%control%its%execu*on,%and%query%its%state%%
CreaAng"a"New"Job"Object"
-
04#28%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Naming"The"Job"
public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount \n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }
Give"the"job"a"meaningful"name."
-
04#29%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Specifying"Input"and"Output"Directories"
public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount \n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }
Next,"specify"the"input"directory"from"which"data"will"be"read,"and"the"output"directory"to"which"nal"output"will"be"wri>en.""
-
04#30%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!The%default%InputFormat%(TextInputFormat)%will%be%used%unless%you%specify%otherwise%
!To%use%an%InputFormat%other%than%the%default,%use%e.g.%job.setInputFormatClass(KeyValueTextInputFormat.class)
Specifying"the"InputFormat"
-
04#31%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!By%default,%FileInputFormat.setInputPaths()%will%read%all%les%from%a%specied%directory%and%send%them%to%Mappers%ExcepAons:"items"whose"names"begin"with"a"period"(.)"or"underscore"(_)"Globs"can"be"specied"to"restrict"input"
For"example,"/2010/*/01/* !Alterna*vely,%FileInputFormat.addInputPath()%can%be%called%mul*ple%*mes,%specifying%a%single%le%or%directory%each%*me%
!More%advanced%ltering%can%be%performed%by%implemen*ng%a%PathFilter%Interface"with"a"method"named"accept
Takes"a"path"to"a"le,"returns"true"or"false"depending"on"whether"or"not"the"le"should"be"processed"
Determining"Which"Files"To"Read"
-
04#32%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!FileOutputFormat.setOutputPath()%species%the%directory%to%which%the%Reducers%will%write%their%nal%output%
!The%driver%can%also%specify%the%format%of%the%output%data%Default"is"a"plain"text"le"Could"be"explicitly"wri>en"as"job.setOutputFormatClass(TextOutputFormat.class)
!We%will%discuss%OutputFormats%in%more%depth%in%a%later%chapter%
Specifying"Final"Output"With"OutputFormat"
-
04#33%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Specify"The"Classes"for"Mapper"and"Reducer"
public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount \n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }
Give"the"Job"object"informaAon"about"which"classes"are"to"be"instanAated"as"the"Mapper"and"Reducer."
-
04#34%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Specify"The"Intermediate"Data"Types"
public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount \n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }
Specify"the"types"for"the"intermediate"output"key"and"value"produced"by"the"Mapper."
-
04#35%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Specify"The"Final"Output"Data"Types"
public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount \n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }
Specify"the"types"for"the"Reducers"output"key"and"value."
-
04#36%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Running"The"Job"
public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount \n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }
Start"the"job,"wait"for"it"to"complete,"and"exit"with"a"return"code."
-
04#37%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
!There%are%two%ways%to%run%your%MapReduce%job:%job.waitForCompletion()
Blocks"(waits"for"the"job"to"complete"before"conAnuing)"job.submit()
Does"not"block"(driver"code"conAnues"as"the"job"is"running)"!The%job%determines%the%proper%division%of%input%data%into%InputSplits,%and%then%sends%the%job%informa*on%to%the%JobTracker%daemon%on%the%cluster%
Running"The"Job"(contd)"
-
04#38%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Reprise:"Driver"Code"
public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount \n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }
-
04#39%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Chapter"Topics"
Basic%Programming%with%the%%Hadoop%Core%API%Wri*ng%a%MapReduce%Program%
! The"MapReduce"ow"! Basic"MapReduce"API"concepts"!Wri*ng%MapReduce%applica*ons%in%Java% The"driver" The%Mapper% The"Reducer"
!WriAng"Mappers"and"Reducers"in"other"languages"with"the"Streaming"API"! Speeding"up"Hadoop"development"by"using"Eclipse"! Hands/On"Exercise:"WriAng"a"MapReduce"Program"! Dierences"between"the"Old"and"New"MapReduce"APIs"! Conclusion"
-
04#40%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
The"Mapper:"Complete"Code"
import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class WordMapper extends Mapper { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } }
-
04#41%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
The"Mapper:"import"Statements"
import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class WordMapper extends Mapper { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } }
You"will"typically"import"java.io.IOException,"and"the"org.apache.hadoop"classes"shown,"in"every"Mapper"you"write."We"will"omit"the"import"statements"in"future"slides"for"brevity.""
-
04#42%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
The"Mapper:"Main"Code"
public class WordMapper extends Mapper { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } }
-
04#43%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
public class WordMapper extends Mapper { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } }
The"Mapper:"Main"Code"(contd)"
Your"Mapper"class"should"extend"the"Mapper"class."The"Mapper class"expects"four"generics,"which"dene"the"types"of"the"input"and"output"key/value"pairs."The"rst"two"parameters"dene"the"input"key"and"value"types,"the"second"two"dene"the"output"key"and"value"types."
-
04#44%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
public class WordMapper extends Mapper { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } }
The"map"Method"
The"map"methods"signature"looks"like"this."It"will"be"passed"a"key,"a"value,"and"a"Context"object."The"Context"is"used"to"write"the"intermediate"data."It"also"contains"informaAon"about"the"jobs"conguraAon"(see"later)."
-
04#45%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
public class WordMapper extends Mapper { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } }
The"map"Method:"Processing"The"Line"
value"is"a"Text"object,"so"we"retrieve"the"string"it"contains."
-
04#46%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
public class WordMapper extends Mapper { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } }
The"map"Method:"Processing"The"Line"(contd)"
We"split"the"string"up"into"words"using"a"regular"expression"with"non/alphanumeric"characters"as"the"delimiter,"and"then"loop"through"the"words."
-
04#47%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
public class WordMapper extends Mapper { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } }
Outpucng"Intermediate"Data"
To"emit"a"(key,"value)"pair,"we"call"the"write"method"of"our"Context object."The"key"will"be"the"word"itself,"the"value"will"be"the"number"1."Recall"that"the"output"key"must"be"a"WritableComparable,"and"the"value"must"be"a"Writable.
-
04#48%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Reprise:"The"Map"Method"
public class WordMapper extends Mapper { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } }
-
04#49%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
Chapter"Topics"
Basic%Programming%with%the%%Hadoop%Core%API%Wri*ng%a%MapReduce%Program%
! The"MapReduce"ow"! Basic"MapReduce"API"concepts"!Wri*ng%MapReduce%applica*ons%in%Java% The"driver" The"Mapper" The%Reducer%
!WriAng"Mappers"and"Reducers"in"other"languages"with"the"Streaming"API"! Speeding"up"Hadoop"development"by"using"Eclipse"! Hands/On"Exercise:"WriAng"a"MapReduce"Program"! Dierences"between"the"Old"and"New"MapReduce"APIs"! Conclusion"
-
04#50%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
The"Reducer:"Complete"Code"
import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class SumReducer extends Reducer { @Override public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int wordCount = 0;
for (IntWritable value : values) { wordCount += value.get(); }
context.write(key, new IntWritable(wordCount)); } }
-
04#51%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class SumReducer extends Reducer { @Override public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int wordCount = 0;
for (IntWritable value : values) { wordCount += value.get(); }
context.write(key, new IntWritable(wordCount)); } }
The"Reducer:"Import"Statements"
As"with"the"Mapper,"you"will"typically"import"java.io.IOException,"and"the"org.apache.hadoop classes"shown,"in"every"Reducer"you"write."We"will"omit"the"import"statements"in"future"slides"for"brevity.""
-
04#52%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
The"Reducer:"Main"Code"
public class SumReducer extends Reducer { @Override public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int wordCount = 0;
for (IntWritable value : values) { wordCount += value.get(); }
context.write(key, new IntWritable(wordCount)); } }
-
04#53%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
public class SumReducer extends Reducer { @Override public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int wordCount = 0;
for (IntWritable value : values) { wordCount += value.get(); }
context.write(key, new IntWritable(wordCount)); } }
The"Reducer:"Main"Code"(contd)"
Your"Reducer"class"should"extend"Reducer."The"Reducer class"expects"four"generics,"which"dene"the"types"of"the"input"and"output"key/value"pairs."The"rst"two"parameters"dene"the"intermediate"key"and"value"types,"the"second"two"dene"the"nal"output"key"and"value"types."The"keys"are"WritableComparables,"the"values"are"Writables."
-
04#54%"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."
public class SumReducer extends Reducer { @Override public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int wordCount