using map-reduce to teach parallel programming concepts · using map-reduce to teach parallel...
TRANSCRIPT
csinparallel.org
Using Map-Reduce to Teach Parallel Programming
Concepts
Dick Brown, St. Olaf CollegeLibby Shoop, Macalester College
Joel Adams, Calvin College
csinparallel.org
Workshopsite
CSinParallel.org->Workshops->WMRWorkshopSeealsoworkshophandout
csinparallel.org
Introductorycomments
– Roleofundergraduateresearchers:Therewouldbenoworkshopwithoutthem!
– ThankstoAmazonWebServicesforprovidingcreditstohostourWMRinstance
– Disclaimer:Wearenotproposingmap-reduceastheonlyapproachtointroducingparallelism,concurrency
– Avalue:Honorthyneighbor'scurricularapproach
csinparallel.org
Goals
– Introducemap-reducecompuLng,usingtheWebMapReduce(WMR)simplifiedinterfacetoHadoop
• Whyusemap-reduceinthecurriculum?
– Hands-onexerciseswithWMRforfoundaLoncourses
csinparallel.org
Goals
Part1–IntroducLon– Map-reducecompuLng,andtheWebMapReduce(WMR)simplifiedinterfacetoHadoop
– Hands-onexerciseswithWMRforfoundaLoncourses
Part2–TeachingwithWMR– Whyusemap-reduceinthecurriculum?– UseofWMRforintermediateandadvancedcourses– Hands-onexercisesformoreadvanceduse
Part3(opLonal)–What’sunderthehood?
csinparallel.org
SneakPreview:Materialsavailable
(Incaseyoualreadyknowyourmap-reduce…)
• CSinParallelmodule:Map-reduceCompu;ngforIntroductoryStudentsusingWebMapReduce– Seecsinparallel.org
csinparallel.org
Part1:IntroducLontoMap-ReduceCompuLngandWMR
csinparallel.org
IntroducLontoMap-ReduceCompuLng
csinparallel.org
History
– ThecomputaLonalmodelofusingmapandreduceoperaLonswasdevelopeddecadesago,forLISP
– GoogledevelopedMapReducesystemforsearchengine,published(DeanandGhemawat,2004)
– Yahoo!createdHadoop,anopen-sourceimplementaLon(underApache);Javamappersandreducers
csinparallel.org
Map-Reduce:The2-minuteoverview
Whatifyouwantedtocountthefrequenciesofallwords
in1,000,000books?
csinparallel.org
Map-Reduce:The2-minuteoverview
Whatifyouwantedtocountthefrequenciesofallwords
in1,000,000books?
1. Breakupthelinesoftext:generateonelabelledpieceperword
• Usethatwordaslabel;value1foreachpiece
2. Groupthepiecesaccordingtolabel(word)3. Addupthe1’sineachgroup
csinparallel.org
Map-ReduceConcept
csinparallel.org
Themap-reducecomputaLonalmodel• Map-reduceisatwo-stageprocesswitha"shuffletwist"
betweenthestages.
• StagesarecontrolledbyfuncLons:mapper(),reducer()
csinparallel.org
Themap-reducecomputaLonalmodel
• mapper()funcLon:– Argumentisonelineofinputfromafile– Produces(key,value)pairs
• Example:word-countmapper()"thecatinthehat”-->[mapperforthisline]("the","1"),("cat","1"),("in","1"),("the","1"),("hat","1")
csinparallel.org
Themap-reducecomputaLonalmodel
• Shufflestage:– groupallmappers’(key,value)pairstogetherthathavethesamekey,andfeedeachgrouptoitsowncallofreduce()
– Input:all(key,value)pairsfromallmappers– Output:Thosepairsrearranged,senttocallsofreduce()accordingtokey
• Note:Shufflealsosorts(opLmizaLon)
csinparallel.org
Themap-reducecomputaLonalmodel
• reducer()funcLon:– Receivesallkey-valuepairsforonekey– Producesanaggregateresult
• Example:word-countreducer()("the","1"),("the","1")-->[reducerfor"the"]("the","2")
csinparallel.org
Themap-reducecomputaLonalmodel
– Inmap-reduce,aprogrammercodesonlytwofuncLons(plusconfiginformaLon)
• Amodelforfutureparallel-programmingframeworks
– Underlyingmap-reducesystemreusescodefor• ParLLoningthedataintochunksandlines,• Runsmappers/reducerswherethechunksarelocal• Movingdatabetweenmappersandreducers• Auto-recoveringfromanycrashesthatmayoccur• ...
– OpLmized,Distributed,Fault-tolerant,Scalable
csinparallel.org
Themap-reducecomputaLonalmodel
• OpLmized,Distributed,Fault-tolerant,Scalable
1.mappers 2.shuffle 3.reducersLocalI/O
GlobalI/O
csinparallel.org
DemoofWMR
cumulus.cs.stolaf.edu/wmr
Intromodule
csinparallel.org
Materialsavailable
• CSinParallelmodule:Map-reduceCompu;ngforIntroductoryStudentsusingWebMapReduce– Seecsinparallel.org
csinparallel.org
Overviewofsuggestedexercises
Availableonthecsinparallel.orgsite
– Runwordcount(provided),withsmallandlargedata
– Modify,runvariaLonsonwordcount:strippunctuaLon;caseinsensiLve;etc.
– AlternaLveexercises
csinparallel.org
AddiLonalexercises
Beyondyourfirstsimpleexercises,considerexploringthefollowing:• Variousdatasets
– Note:PleaseavoidlargeGutenberg"groups"forthisworkshop
• ExtendedsetofexercisesforCS1(textanalysis)
csinparallel.org
Hands-onexploraLonofWMR
csinparallel.org
Part2:TeachingwithWMR
Whymap-reduce?WhyWMR?
TeachingWMRinCS1;inothercourses
csinparallel.org
Whyteachmap-reduce?
csinparallel.org
Whymap-reduceforteachingnoLonsofparallelism/concurrency?
– Concepts:• dataparallelism;• taskparallelism;• locality;• effectsofscale;• exampleeffecLveparallelprogrammingmodel;• distributeddatawithredundancyforfaulttolerance;...
csinparallel.org
Whymap-reduceforteachingnoLonsofparallelism/concurrency?
– Real-World:Hadoopwidelyused– ExciLng:theappealofGoogle,Facebook,etc.– Useful:forappropriateapplicaLons– Powerful:scalabilitytolargeclusters,largedata
csinparallel.org
WhyWMR?
– Introduceconceptsofparallelism– Lowbarforentry,feasibleforCS1(andbeyond)– CapturetheimaginaLonsofstudents
• SupportsrapidintroducLonofconceptsofparallelismforeveryCSstudent– Intromoduledesignedfor1-3daysofclass
csinparallel.org
WebMapReduce(WMR)
– SimplifiedwebinterfaceforHadoopcomputaLons
– Goals:• StrategicallySimplesuitableforCS1,butnotatoy
• Configurablewritemappers/reducersinanylanguage
• AccessiblewebapplicaLon• MulL-plamorm,front-endandback-end
csinparallel.org
WMRFeatures(Briefly)
– TesLnginterface• Errorfeedback• BypassesHadoop--smalldataonly!
– StudentsenterthefollowinginformaLon:• choiceoflanguage• datatoprocess• definiLonofmapperinthatlanguage• definiLonofreducerinthatlanguage
csinparallel.org
WMRsysteminformaLon
– Languagescurrentlysupported:Java,C++,Python,Scheme,C,C#,Javascript
• Rcomingsoon
– Backendstodate:localcluster,AmazonEC2cloudimages
• Versionlimitsandmorebackendscomingsoon
Moredetailsaboutthesystemin(opLonal)Part3oftheworkshop
csinparallel.org
Teachingmap-reducewithWMRintheintroductorysequence
csinparallel.org
KinestheLcstudentacLvity• VisualizaLonsofmap-reducecomputaLonsareenoughforsomestudents,butnotall
• Anin-classac/vitytoactoutthemap/shuffle/reduceprocesshelpsothers
• Alsohelpful:imagesofclusters;sequenLalversions;contextofwell-knownwebservices
csinparallel.org
WMRinadvancedcourses
Example:PDCelecLve• CS1module• Map-reduceprogrammingtechniques
– FeaturesofWMR– Contextforwarding– Structuredvalues;structuredkeys– Mul;-casemappers;mul;-casereducers– Broadcas;ngdatavalues
csinparallel.org
ExamplesforTextProcessingTechniques
• Combiningdatawithinamapper– Mapper:Tallycountsofwordsbeforesendingtoreducer
• ComputaLonallinguisLcs:– wordsthatareco-located
• FindandcountpairsExampleIn:thecatinthecathatEmits:1cat|in1in|the1cat|hat2the|cat
• Usecombiningproceduretofind‘stripes’ExampleIn:thecatandthedogfoughtoverthedogboneEmits:(the,{cat:1,dog:2}
Thanksto:DataIntensiveTextProcessing,byJimmyLinandChrisDyer
csinparallel.org
ApplicaLonideas
• Examplesintheintroductorymodule• Bigdatasetspeoplecareabout
• Especiallyforunstructureddata• Convenientforcertainkindsofprojects
– E.g.,mostcommonmedicalterminology
csinparallel.org
WMRHands-on,conLnuedModuleexercisesExtendedexercisesetDatasetsavailable/shared/MovieLens2/movieRaLngs/shared/gutenberg/WarAndPeace.txt/shared/gutenberg/CompleteShakespeare.txt
csinparallel.org
Part3:What’sunderthehood
csinparallel.org
AboutWMR
• WMRanditsarchitecture
• ObtainingandinstallingWMR– WebMapReduce.sf.com
ClusterHeadNode
WebServer
UserBrowser
UserBrowser
Cluster
csinparallel.orgBasic
Hadoopcomponents• Internals:
– Jobmanagement(percluster)– Taskmanagement(percomputaLonnode)
• Somecomponentsvisibletotheuser:– HadoopAPI–Java,orarbitraryexecutables(“Streaming”)
– HadoopDistributedFileSystem(HDFS)– Supporttools,includinghadoop command– Limitedjobmonitoring…
csinparallel.org
Goals
– Introducemap-reducecompuLng,usingtheWebMapReduce(WMR)simplifiedinterfacetoHadoop
• Whyusemap-reduceinthecurriculum?– Hands-onexerciseswithWMRforfoundaLoncourses
– UseofWMRforintermediateandadvancedcourses
• What’sunderthehoodwithWMR• ApeekatHadoop…
– Hands-onexercisesformoreadvanceduse
csinparallel.org
WMRinadvancedcourses
csinparallel.org
InverLng"Chapter1:CallmeIshmael.Some…”"Chapter2:Istuffedashirtortwo...""Chapter3:Enteringthatgable-ended...”
-->[mapper]("call","1"),("me","1"),...,("i","2"),("stuffed”,"2"),...,("entering","3"),...
-->[reducer]
"a""1,1,1,1,...,2,2,2,...""aback""3,7,7,8,...”...
csinparallel.org
Whenismap-reduceappropriate?
• Massive,unstructuredorirregularlystructured“bigdata”(Terascaleandupward)– Rawtext– Webpages– XML– Unstructuredstreamsofdata
• Otherapproachesmayfitstructured“bigdata”– Scalabledatabases– Large-scalestaLsLcalapproaches
csinparallel.org
UsingHadoopdirectly(Java)
WordCount.javaexample
csinparallel.org
DirectHadoopExamples
• Wordcount– Java
csinparallel.org
QuickquesLons/commentssofar?
csinparallel.org
Hands-on
csinparallel.org
Overviewofsuggestedexercises
– ComputaLonswithMovieLens2data;mulLplemap-reducecycles
– Trafficdataanalysis– NetworkanalysisusingFlixterdata– TheMillionSongdataset
csinparallel.org
Discussion
csinparallel.org
EvaluaLons!
Linksat:CSinParallel.org->Workshops->WMRWorkshop(endofthepage)
csinparallel.org
SomeconsideraLonswithHadoop
– Numbersofmappersandreducers– DFS– Faulttolerance– I/Oformats
– Note:wehavefurtherslideswithaddiLonalinformaLonabouttheseaspects,foryoutolookatonyourown.
csinparallel.org
DirectHadoopexercisesetup
– Edityourownfiles,locally– scptocluster'sadminnode(oncloud)– sshtocompile,launchjob– Percentageprogressoutputisprovided– MulLplecyclesupportviaDFS– (Cleanup)
csinparallel.org
AddiLonalDetailsaboutHadoop
csinparallel.org
ThehadoopprojectdocumentaLon
• h}p://hadoop.apache.org/common/docs/current/index.html
csinparallel.org
Howmanymappers?
• TheHadoopMap/ReduceframeworkspawnsonemappertaskforeachInputSplitgeneratedbytheInputFormatforthejob.
• Thenumberofmappersisusuallydrivenbythetotalsizeoftheinputs,thatis,thetotalnumberofblocksoftheinputfiles.
csinparallel.org
Howmanyreducers?
• ThenumberofreducersforthejobissetbytheuserviaJobConf.setNumReduceTasks(int)
• Thesizeofyoureventualoutputmaydictatehowmanyreducersyouchoose.
csinparallel.org
HDFS
• Fault-tolerantdistributedfilesystemmodeleda~ertheGoogleFileSystem– we'vehadstudentsreadtheoriginalGFSpaperinanadvancedcourse
• h}p://hadoop.apache.org/hdfs/docs/current/index.html
• NotethesecLonaboutthefilesystemcommandsyoucanrunfromthecommandline:hadoopfs-lsHadoopfs-getor-put
csinparallel.org
HDFSAssumpLonsandGoals
• Hardwarefailure– HardwarefailureisthenormratherthantheexcepLon.
• Streamingdataaccess– ApplicaLonsthatrunonHDFSneedstreamingaccesstotheirdatasets.TheyarenotgeneralpurposeapplicaLonsthattypicallyrunongeneralpurposefilesystems.
• LargeDataSets• Simplecoherencymodel
– Readmany,writeonce• MovingcomputaLonsissimplerthanmovingdata• Portabilityacrossvarioushardware/so~ware
csinparallel.org
csinparallel.org
csinparallel.org
Input/Outputformats
– InputintomappersareinterpretedusingclassesimplemenLngtheinterfaceInputFormat,andoutputfromreducersareimplementedusingclassesimplemenLngtheinterfaceOutputFormat.
– InWMR,themapperinputandreduceroutputisperformedwithkey-valuepairs.ThiscorrespondstousingtheclassesKeyValueTextInputFormatandTextOutputFormat.
– IndirectHadoop,thedefaultinputformatisTextInputFormat,inwhichvaluesarelinesofthefileandkeysareposiLonswithinthatfile.
csinparallel.org
SomefurtherfeaturesofHadoop
– Combiner,anopLmizaLon:performsome"reducLon"duringthemapphase,a~ermapper()andbeforeshuffle
– SorLngcontrol• Note:hardtosortonsecondarykey
– Threeprogramminginterfaces:Java;pipes(C++);streaming(executables)
csinparallel.org
Pagerankalgorithmideas
• Originaldata:onewebpageperline
– mapperproduces("dest","1/kPn")foreachlinkinpagePnwhereklinksappearwithinthatpagePnreducerproduces("dest","weight_0P1P2P2P3P4...")whereweightissumoftheweightsfromkeyvaluepairsemi}edbyP1,P2,...
– SubsequentmappersandreducersproducerefinedweightsthattakeintoaccountdeeperchainsofpagespoinLngtopages
– Finalreducerdelivers("dest","weight_k")[dropPns]