mapreduce - cornell university
TRANSCRIPT
![Page 1: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/1.jpg)
MapReduce SimplifiedDataProcessingonLargeClusters
(WithouttheAgonizingPain)
PresentedbyAaronNathan
![Page 2: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/2.jpg)
TheProblem
• Massiveamountsofdata– >100TB(theinternet)– Needssimpleprocessing
• Computersaren’tperfect– Slow– Unreliable– Misconfigured
• Requirescomplex(i.e.bugprone)code
![Page 3: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/3.jpg)
MapReducetotheRescue!
• CommonFuncKonalProgrammingModel– MapStep
map (in_key, in_value) -> list(out_key, intermediate_value) • Splitaproblemintoalotofsmallersubproblems
– ReduceStepreduce (out_key, list(intermediate_value)) -> list(out_value) • Combinetheoutputsofthesubproblemstogivetheoriginalproblem’sanswer
• Each“funcKon”isindependent• HighlyParallelizable
![Page 4: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/4.jpg)
Answer!Answer!
AlgorithmPicture
MAP
REDUCE
Answer!
MAPMAP MAP MAP
DATA
K1:vK1:vK2:v K2:v K1:v K2:vK3:v K3:v
K1:v,v,v K2:v,v,v K3:v,v,v
REDUCE REDUCE
aggregator
![Page 5: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/5.jpg)
SomeExampleCodemap(String input_key, String input_value): // input_key: document name
// input_value: document contents
for each word w in input_value: EmitIntermediate(w, "1");
reduce(String output_key, Iterator intermediate_values): // output_key: a word
// output_values: a list of counts int result = 0; for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));
![Page 6: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/6.jpg)
SomeExampleApplicaKons
• DistributedGrep• URLAccessFrequencyCounter• ReverseWebLinkGraph
• Term‐VectorperHost
• DistributedSort• InvertedIndex
![Page 7: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/7.jpg)
TheImplementaKon
• GoogleClusters– 100s‐1000sDualCorex86CommodityMachines
– CommodityNetworking(100mbps/1Gbps)– GFS
• GoogleJobScheduler• Librarylinkedinc++
![Page 8: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/8.jpg)
ExecuKon
![Page 9: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/9.jpg)
TheMaster
• MaintainsthestateandidenKfyofallworkers• Managesintermediatevalues
• ReceivessignalsfromMapworkersuponcompleKon
• BroadcastssignalstoReduceworkersastheywork
• CanretaskcompletedMapworkerstoReduceworkers.
![Page 10: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/10.jpg)
InCaseofFailure
• PeriodicPingsfromMaster‐>Workers– Onfailureresetsstateofassignedtaskofdeadworker
• Simplesystemprovesresilient– Worksincaseofa80simultaneousmachinefailures!
• Masterfailureisunhandled.• WorkerFailuredoesn’teffectoutput
(outputidenKcalwhetherfailureoccursornot)– Eachmapwritestolocaldiskonly– Ifamapperislost,thedataisjustreprocessed– Non‐determinisKcmapfuncKonsaren’tguaranteed
![Page 11: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/11.jpg)
PreservingBandwidth
• Machinesareinrackswithsmallinterconnects– UselocaKoninformaKonfromGFS
– Ahemptstoputtasksforworkersandinputslicesonthesamerack
– UsuallyresultsinLOCALreads!
![Page 12: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/12.jpg)
BackupExecuKonTasks
• Whatifonemachineisslow?• CandelaythecompleKonoftheenKreMROperaKon!
• Answer:Backup(Redundant)ExecuKons– Whoeverfinishesfirstcompletesthetask!
– Enabledtowardstheendofprocessing
![Page 13: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/13.jpg)
ParKKoning
• M=numberofMapTasks(thenumberofinputsplits)
• R=numberofReduceTasks(thenumberofintermediatekeysplits)
• W=numberofworkercomputers• InGeneral:
– M=sizeof(Input)/64MB– R=W*n(wherenisasmallnumber)
• TypicalScenario:InputSize=12TB,M=200,000,R=5000W=2000
![Page 14: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/14.jpg)
CustomParKKoning
• DefaultParKKonedonintermediatekey– Hash(intermediate_key)modR
• Whatifuserhasaprioriknowledgeaboutthekey?– Allowforuser‐definedhashingfuncKon– Ex.Hash(Hostname(url_key))
![Page 15: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/15.jpg)
TheCombiner
• IfreducerisassociaKveandcommuniviKve– (2+5)+4=11or2+(5+4)=11– (15+x)+2=2+(15+x)
• Repeatedintermediatekeyscanbemerged– Savesnetworkbandwidth– EssenKallylikealocalreducetask
![Page 16: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/16.jpg)
I/OAbstracKons
• HowtogetiniKalkeyvaluepairstomap?– Defineaninput“format”
– Makesuresplitsoccurinreasonableplaces– Ex:Text
• Eachlineisakey/pair• CancomefromGFS,bigTable,oranywherereally!
– Outputworksanalogously
![Page 17: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/17.jpg)
SkippingBadRecords
• Whatifausermakesamistakeinmap/reduce• Andonlyapparentonfewjobs..
• WorkersendsmessagetoMaster
• Skiprecordon>1workerfailureandtellotherstoignorethisrecord
![Page 18: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/18.jpg)
RemovingUnnecessaryDevelopmentPain
• LocalMapReduceImplementaKonthatrunsondevelopmentmachine
• MasterhasHTTPpagewithstatusofenKreoperaKon– Shows“badrecords”
• ProvideaCounterFacility– Masteraggregates“counts”anddisplayedonMasterHTTPpage
![Page 19: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/19.jpg)
AlookattheUI(in1994)
h6p://labs.google.com/papers/mapreduce‐osdi04‐slides/index‐auto‐0013.html
![Page 20: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/20.jpg)
PerformanceBenchmarks
SorBng AND Searching
![Page 21: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/21.jpg)
Search(Grep)
• Scanthrough1010100byterecords(1TB)
• M=15000,R=1
• StartupKme– GFSLocalizaKon– ProgramPropagaKon
• Peak‐>30GB/sec!
![Page 22: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/22.jpg)
Sort
• 50linesofcode• Map‐>key+textline
• Reduce‐>IdenKty• M=15000,R=4000
– ParKKononinitbytesofintermediatekey
• Sortsin891sec!
![Page 23: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/23.jpg)
WhataboutBackupTasks?
![Page 24: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/24.jpg)
Andwait…it’suseful!
NB:August2004
![Page 25: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/25.jpg)
OpenSourceImplementaKon
• Hadoop• hhp://hadoop.apache.org/core/
– ReliesonHDFS– AllinterfaceslookalmostexactlylikeMapReducepaper
• Thereisevenatalkaboutittoday!– 4:15B17CSColloquium:MikeCafarella(Uwash)
![Page 26: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/26.jpg)
AcBveDisksforLarge‐ScaleDataProcessing
![Page 27: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/27.jpg)
TheConcept
• Useaggregateprocessingpower– Networkeddisksallowforhigherthroughput
• WhynotmovepartoftheapplicaKonontothediskdevice?– Reducedatatraffic
– Increaseparallelismfurther
![Page 28: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/28.jpg)
ShrinkingSupportHardware…
![Page 29: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/29.jpg)
ExampleApplicaKons
• MediaDatabase– Findsimilarmediadataby“fingerprint”
• RealTimeApplicaKons– CollectmulKplesensordataquickly
• DataMining– POSAnalysisrequiredadhocdatabasequeries
![Page 30: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/30.jpg)
Approach
• Leveragetheparallelismavailableinsystemswithmanydisks
• Operatewithasmallamountofstate,processingdataasitstreamsoffthedisk
• ExecuterelaKvelyfewinstrucKonsperbyteofdata
![Page 31: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/31.jpg)
Results‐NearestNeighborSearch
• Problem:DeterminekitemsclosesttoaparKculariteminadatabase– Performcomparisonsonthedrive– Returnsthedisksclosestmatches– Serverdoesfinalmerge
![Page 32: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/32.jpg)
MediaMiningExample
• Performlowlevelimagetasksonthedisk!
• EdgeDetecKonperformedondisk– Senttoserverasedgeimage
– Serverdoeshigherlevelprocessing
![Page 33: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/33.jpg)
WhynotjustuseabunchofPC’s?
• Theperformanceinceaseissimilar• Infact,thepaperessenKallyusedthissetuptoactuallybenchmarktheirresults!
• Supposedlythiscouldbecheaper• Thepaperdoesn’treallygiveagoodargumentforthis…– PossiblyreducedbandwidthondiskIOchannel– Butwhocares?
![Page 34: MapReduce - Cornell University](https://reader030.vdocuments.us/reader030/viewer/2022012523/61970259bfca16100f2f22b3/html5/thumbnails/34.jpg)
SomeQuesKons
• Whatcouldadiskpossiblydobeherthanthehostprocessor?
• WhataddedcostisassociatedwiththismediocreprocessorontheHDD?
• Arenewdependenciesareintroducedonhardwareandsosware?
• Perhapsother(beher)placestodothistypeoflocalparallelprocessing?
• Maybein2001thismademoresense?