pattern language for parallel programming, 2004
TRANSCRIPT
-
"Ifyoubuildit,theywillcome."
Andsowebuiltthem.Multiprocessorworkstations,massivelyparallelsupercomputers,aclusterineverydepartment...andtheyhaven'tcome.Programmershaven'tcometoprogramthesewonderfulmachines.Oh,afewprogrammersinlovewiththechallengehaveshownthatmosttypesofproblemscanbeforcefitontoparallelcomputers,butgeneralprogrammers,especiallyprofessionalprogrammerswho"havelives",ignoreparallelcomputers.
Andtheydosoattheirownperil.Parallelcomputersaregoingmainstream.Multithreadedmicroprocessors,multicoreCPUs,multiprocessorPCs,clusters,parallelgameconsoles...parallelcomputersaretakingovertheworldofcomputing.Thecomputerindustryisreadytofloodthemarketwithhardwarethatwillonlyrunatfullspeedwithparallelprograms.Butwhowillwritetheseprograms?
Thisisanoldproblem.Evenintheearly1980s,whenthe"killermicros"startedtheirassaultontraditionalvectorsupercomputers,weworriedendlesslyabouthowtoattractnormalprogrammers.Wetriedeverythingwecouldthinkof:highlevelhardwareabstractions,implicitlyparallelprogramminglanguages,parallellanguageextensions,andportablemessagepassinglibraries.Butaftermanyyearsofhardwork,thefactofthematteristhat"they"didn'tcome.Theoverwhelmingmajorityofprogrammerswillnotinvesttheefforttowriteparallelsoftware.
Acommonviewisthatyoucan'tteacholdprogrammersnewtricks,sotheproblemwillnotbesolveduntiltheoldprogrammersfadeawayandanewgenerationtakesover.
Butwedon'tbuyintothatdefeatistattitude.Programmershaveshownaremarkableabilitytoadoptnewsoftwaretechnologiesovertheyears.LookathowmanyoldFortranprogrammersarenowwritingelegantJavaprogramswithsophisticatedobjectorienteddesigns.Theproblemisn'twitholdprogrammers.Theproblemiswitholdparallelcomputingexpertsandthewaythey'vetriedtocreateapoolofcapableparallelprogrammers.
Andthat'swherethisbookcomesin.Wewanttocapturetheessenceofhowexpertparallelprogrammersthinkaboutparallelalgorithmsandcommunicatethatessentialunderstandinginawayprofessionalprogrammerscanreadilymaster.Thetechnologywe'veadoptedtoaccomplishthistaskisapatternlanguage.Wemadethischoicenotbecausewestartedtheprojectasdevoteesofdesignpatternslookingforanewfieldtoconquer,butbecausepatternshavebeenshowntoworkinwaysthatwouldbeapplicableinparallelprogramming.Forexample,patternshavebeenveryeffectiveinthefieldofobjectorienteddesign.Theyhaveprovidedacommonlanguageexpertscanusetotalkabouttheelementsofdesignandhavebeenextremelyeffectiveathelpingprogrammersmasterobjectorienteddesign.
-
Thisbookcontainsourpatternlanguageforparallelprogramming.Thebookopenswithacoupleofchapterstointroducethekeyconceptsinparallelcomputing.Thesechaptersfocusontheparallelcomputingconceptsandjargonusedinthepatternlanguageasopposedtobeinganexhaustiveintroductiontothefield.
Thepatternlanguageitselfispresentedinfourpartscorrespondingtothefourphasesofcreatingaparallelprogram:
*
FindingConcurrency.Theprogrammerworksintheproblemdomaintoidentifytheavailableconcurrencyandexposeitforuseinthealgorithmdesign.
*
AlgorithmStructure.Theprogrammerworkswithhighlevelstructuresfororganizingaparallelalgorithm.
*
SupportingStructures.Weshiftfromalgorithmstosourcecodeandconsiderhowtheparallelprogramwillbeorganizedandthetechniquesusedtomanageshareddata.
*
ImplementationMechanisms.Thefinalstepistolookatspecificsoftwareconstructsforimplementingaparallelprogram.
Thepatternsmakingupthesefourdesignspacesaretightlylinked.Youstartatthetop(FindingConcurrency),workthroughthepatterns,andbythetimeyougettothebottom(ImplementationMechanisms),youwillhaveadetaileddesignforyourparallelprogram.
Ifthegoalisaparallelprogram,however,youneedmorethanjustaparallelalgorithm.Youalsoneedaprogrammingenvironmentandanotationforexpressingtheconcurrencywithintheprogram'ssourcecode.Programmersusedtobeconfrontedbyalargeandconfusingarrayofparallelprogrammingenvironments.Fortunately,overtheyearstheparallelprogrammingcommunityhasconvergedaroundthreeprogrammingenvironments.
-
*
OpenMP.AsimplelanguageextensiontoC,C++,orFortrantowriteparallelprogramsforsharedmemorycomputers.
*
MPI.Amessagepassinglibraryusedonclustersandotherdistributedmemorycomputers.
*
Java.Anobjectorientedprogramminglanguagewithlanguagefeaturessupportingparallelprogrammingonsharedmemorycomputersandstandardclasslibrariessupportingdistributedcomputing.
Manyreaderswillalreadybefamiliarwithoneormoreoftheseprogrammingnotations,butforreaderscompletelynewtoparallelcomputing,we'veincludedadiscussionoftheseprogrammingenvironmentsintheappendixes.
Inclosing,wehavebeenworkingformanyyearsonthispatternlanguage.Presentingitasabooksopeoplecanstartusingitisanexcitingdevelopmentforus.Butwedon'tseethisastheendofthiseffort.Weexpectthatotherswillhavetheirownideasaboutnewandbetterpatternsforparallelprogramming.We'veassuredlymissedsomeimportantfeaturesthatreallybelonginthispatternlanguage.Weembracechangeandlookforwardtoengagingwiththelargerparallelcomputingcommunitytoiterateonthislanguage.Overtime,we'llupdateandimprovethepatternlanguageuntilittrulyrepresentstheconsensusviewoftheparallelprogrammingcommunity.Thenourrealworkwillbeginusingthepatternlanguagetoguidethecreationofbetterparallelprogrammingenvironmentsandhelpingpeopletousethesetechnologiestowriteparallelsoftware.Wewon'trestuntilthedaysequentialsoftwareisrare.
ACKNOWLEDGMENTS
Westartedworkingtogetheronthispatternlanguagein1998.It'sbeenalongandtwistedroad,startingwithavagueideaaboutanewwaytothinkaboutparallelalgorithmsandfinishingwiththisbook.Wecouldn'thavedonethiswithoutagreatdealofhelp.
ManiChandy,whothoughtwewouldmakeagoodteam,introducedTimtoBeverlyandBerna.TheNationalScienceFoundation,IntelCorp.,andTrinityUniversityhavesupportedthisresearchatvarioustimesovertheyears.HelpwiththepatternsthemselvescamefromthepeopleatthePatternLanguagesofPrograms(PLoP)workshopsheldinIllinoiseachsummer.Theformatofthese
-
workshopsandtheresultingreviewprocesswaschallengingandsometimesdifficult,butwithoutthemwewouldhaveneverfinishedthispatternlanguage.Wewouldalsoliketothankthereviewerswhocarefullyreadearlymanuscriptsandpointedoutcountlesserrorsandwaystoimprovethebook.
Finally,wethankourfamilies.Writingabookishardontheauthors,butthatistobeexpected.Whatwedidn'tfullyappreciatewashowharditwouldbeonourfamilies.WearegratefultoBeverly'sfamily(DanielandSteve),Tim'sfamily(Noah,August,andMartha),andBerna'sfamily(Billie)forthesacrificesthey'vemadetosupportthisproject.
TimMattson,Olympia,Washington,April2004
BeverlySanders,Gainesville,Florida,April2004
BernaMassingill,SanAntonio,Texas,April2004
-
Chapter 1. APatternLanguageforParallelProgramming Section1.1. INTRODUCTION Section1.2. PARALLELPROGRAMMING Section1.3. DESIGNPATTERNSANDPATTERNLANGUAGES Section1.4. APATTERNLANGUAGEFORPARALLELPROGRAMMING
Chapter 2. BackgroundandJargonofParallelComputing Section2.1. CONCURRENCYINPARALLELPROGRAMSVERSUSOPERATINGSYSTEMS Section2.2. PARALLELARCHITECTURES:ABRIEFINTRODUCTION Section2.3. PARALLELPROGRAMMINGENVIRONMENTS Section2.4. THEJARGONOFPARALLELCOMPUTING Section2.5. AQUANTITATIVELOOKATPARALLELCOMPUTATION Section2.6. COMMUNICATION Section2.7. SUMMARY
Chapter 3. TheFindingConcurrencyDesignSpace Section3.1. ABOUTTHEDESIGNSPACE Section3.2. THETASKDECOMPOSITIONPATTERN Section3.3. THEDATADECOMPOSITIONPATTERN Section3.4. THEGROUPTASKSPATTERN Section3.5. THEORDERTASKSPATTERN Section3.6. THEDATASHARINGPATTERN Section3.7. THEDESIGNEVALUATIONPATTERN Section3.8. SUMMARY
Chapter 4. TheAlgorithmStructureDesignSpace Section4.1. INTRODUCTION Section4.2. CHOOSINGANALGORITHMSTRUCTUREPATTERN Section4.3. EXAMPLES
-
Section4.4. THETASKPARALLELISMPATTERN Section4.5. THEDIVIDEANDCONQUERPATTERN Section4.6. THEGEOMETRICDECOMPOSITIONPATTERN Section4.7. THERECURSIVEDATAPATTERN Section4.8. THEPIPELINEPATTERN Section4.9. THEEVENTBASEDCOORDINATIONPATTERN
Chapter 5. TheSupportingStructuresDesignSpace Section5.1. INTRODUCTION Section5.2. FORCES Section5.3. CHOOSINGTHEPATTERNS Section5.4. THESPMDPATTERN Section5.5. THEMASTER/WORKERPATTERN Section5.6. THELOOPPARALLELISMPATTERN Section5.7. THEFORK/JOINPATTERN Section5.8. THESHAREDDATAPATTERN Section5.9. THESHAREDQUEUEPATTERN Section5.10. THEDISTRIBUTEDARRAYPATTERN Section5.11. OTHERSUPPORTINGSTRUCTURES
Chapter 6. TheImplementationMechanismsDesignSpace Section6.1. OVERVIEW Section6.2. UEMANAGEMENT Section6.3. SYNCHRONIZATION Section6.4. COMMUNICATION Endnotes
Appendix A: ABriefIntroductiontoOpenMP SectionA.1. CORECONCEPTS SectionA.2. STRUCTUREDBLOCKSANDDIRECTIVEFORMATS SectionA.3. WORKSHARING SectionA.4. DATAENVIRONMENTCLAUSES SectionA.5. THEOpenMPRUNTIMELIBRARY SectionA.6. SYNCHRONIZATION SectionA.7. THESCHEDULECLAUSE SectionA.8. THERESTOFTHELANGUAGE
Appendix B: ABriefIntroductiontoMPI SectionB.1. CONCEPTS SectionB.2. GETTINGSTARTED SectionB.3. BASICPOINTTOPOINTMESSAGEPASSING SectionB.4. COLLECTIVEOPERATIONS SectionB.5. ADVANCEDPOINTTOPOINTMESSAGEPASSING SectionB.6. MPIANDFORTRAN SectionB.7. CONCLUSION
Appendix C: ABriefIntroductiontoConcurrentProgramminginJava SectionC.1. CREATINGTHREADS SectionC.2. ATOMICITY,MEMORYSYNCHRONIZATION,ANDTHEvolatileKEYWORD
-
SectionC.3. SYNCHRONIZEDBLOCKS SectionC.4. WAITANDNOTIFY SectionC.5. LOCKS SectionC.6. OTHERSYNCHRONIZATIONMECHANISMSANDSHAREDDATA STRUCTURESSectionC.7. INTERRUPTS GlossaryBibliography
AbouttheAuthorsIndex
-
APatternLanguageforParallelProgramming>INTRODUCTION
Chapter 1. A Pattern Language for Parallel Programming
1.1INTRODUCTION
1.2PARALLELPROGRAMMING
1.3DESIGNPATTERNSANDPATTERNLANGUAGES
1.4APATTERNLANGUAGEFORPARALLELPROGRAMMING
1.1. INTRODUCTIONComputersareusedtomodelphysicalsystemsinmanyfieldsofscience,medicine,andengineering.Modelers,whethertryingtopredicttheweatherorrenderasceneinthenextblockbustermovie,canusuallyusewhatevercomputingpowerisavailabletomakeevermoredetailedsimulations.Vastamountsofdata,whethercustomershoppingpatterns,telemetrydatafromspace,orDNAsequences,requireanalysis.Todelivertherequiredpower,computerdesignerscombinemultipleprocessingelementsintoasinglelargersystem.Thesesocalledparallelcomputersrunmultipletaskssimultaneouslyandsolvebiggerproblemsinlesstime.
Traditionally,parallelcomputerswererareandavailableforonlythemostcriticalproblems.Sincethemid1990s,however,theavailabilityofparallelcomputershaschangeddramatically.Withmultithreadingsupportbuiltintothelatestmicroprocessorsandtheemergenceofmultipleprocessorcoresonasinglesilicondie,parallelcomputersarebecomingubiquitous.Now,almosteveryuniversitycomputersciencedepartmenthasatleastoneparallelcomputer.Virtuallyalloilcompanies,automobilemanufacturers,drugdevelopmentcompanies,andspecialeffectsstudiosuseparallelcomputing.
Forexample,incomputeranimation,renderingisthestepwhereinformationfromtheanimationfiles,suchaslighting,textures,andshading,isappliedto3Dmodelstogeneratethe2Dimagethatmakesupaframeofthefilm.Parallelcomputingisessentialtogeneratetheneedednumberofframes(24persecond)forafeaturelengthfilm.ToyStory,thefirstcompletelycomputergeneratedfeaturelengthfilm,releasedbyPixarin1995,wasprocessedona"renderfarm"consistingof100dual
-
processormachines[PS00].By1999,forToyStory2,Pixarwasusinga1,400processorsystemwiththeimprovementinprocessingpowerfullyreflectedintheimproveddetailsintextures,clothing,andatmosphericeffects.Monsters,Inc.(2001)usedasystemof250enterpriseserverseachcontaining14processorsforatotalof3,500processors.Itisinterestingthattheamountoftimerequiredtogenerateaframehasremainedrelativelyconstantascomputingpower(boththenumberofprocessorsandthespeedofeachprocessor)hasincreased,ithasbeenexploitedtoimprovethequalityoftheanimation.
ThebiologicalscienceshavetakendramaticleapsforwardwiththeavailabilityofDNAsequenceinformationfromavarietyoforganisms,includinghumans.Oneapproachtosequencing,championedandusedwithsuccessbyCeleraCorp.,iscalledthewholegenomeshotgunalgorithm.Theideaistobreakthegenomeintosmallsegments,experimentallydeterminetheDNAsequencesofthesegments,andthenuseacomputertoconstructtheentiresequencefromthesegmentsbyfindingoverlappingareas.ThecomputingfacilitiesusedbyCeleratosequencethehumangenomeincluded150fourwayserversplusaserverwith16processorsand64GBofmemory.Thecalculationinvolved500milliontrillionbasetobasecomparisons[Ein00].
TheSETI@homeproject[SET,ACK +02 ]providesafascinatingexampleofthepowerofparallelcomputing.Theprojectseeksevidenceofextraterrestrialintelligencebyscanningtheskywiththeworld'slargestradiotelescope,theAreciboTelescopeinPuertoRico.Thecollecteddataisthenanalyzedforcandidatesignalsthatmightindicateanintelligentsource.Thecomputationaltaskisbeyondeventhelargestsupercomputer,andcertainlybeyondthecapabilitiesofthefacilitiesavailabletotheSETI@homeproject.Theproblemissolvedwithpublicresourcecomputing,whichturnsPCsaroundtheworldintoahugeparallelcomputerconnectedbytheInternet.DataisbrokenupintoworkunitsanddistributedovertheInternettoclientcomputerswhoseownersdonatesparecomputingtimetosupporttheproject.EachclientperiodicallyconnectswiththeSETI@homeserver,downloadsthedatatoanalyze,andthensendstheresultsbacktotheserver.TheclientprogramistypicallyimplementedasascreensaversothatitwilldevoteCPUcyclestotheSETIproblemonlywhenthecomputerisotherwiseidle.AworkunitcurrentlyrequiresanaverageofbetweensevenandeighthoursofCPUtimeonaclient.Morethan205,000,000workunitshavebeenprocessedsincethestartoftheproject.Morerecently,similartechnologytothatdemonstratedbySETI@homehasbeenusedforavarietyofpublicresourcecomputingprojectsaswellasinternalprojectswithinlargecompaniesutilizingtheiridlePCstosolveproblemsrangingfromdrugscreeningtochipdesignvalidation.
Althoughcomputinginlesstimeisbeneficial,andmayenableproblemstobesolvedthatcouldn'tbeotherwise,itcomesatacost.Writingsoftwaretorunonparallelcomputerscanbedifficult.Onlyasmallminorityofprogrammershaveexperiencewithparallelprogramming.Ifallthesecomputersdesignedtoexploitparallelismaregoingtoachievetheirpotential,moreprogrammersneedtolearnhowtowriteparallelprograms.
Thisbookaddressesthisneedbyshowingcompetentprogrammersofsequentialmachineshowtodesignprogramsthatcanrunonparallelcomputers.Althoughmanyexcellentbooksshowhowtouseparticularparallelprogrammingenvironments,thisbookisuniqueinthatitfocusesonhowtothinkaboutanddesignparallelalgorithms.Toaccomplishthisgoal,wewillbeusingtheconceptofapatternlanguage.Thishighlystructuredrepresentationofexpertdesignexperiencehasbeenheavilyusedintheobjectorienteddesigncommunity.
-
Thebookopenswithtwointroductorychapters.Thefirstgivesanoverviewoftheparallelcomputinglandscapeandbackgroundneededtounderstandandusethepatternlanguage.Thisisfollowedbyamoredetailedchapterinwhichwelayoutthebasicconceptsandjargonusedbyparallelprogrammers.Thebookthenmovesintothepatternlanguageitself.
1.2. PARALLEL PROGRAMMINGThekeytoparallelcomputingisexploitableconcurrency.Concurrencyexistsinacomputationalproblemwhentheproblemcanbedecomposedintosubproblemsthatcansafelyexecuteatthesametime.Tobeofanyuse,however,itmustbepossibletostructurethecodetoexposeandlaterexploittheconcurrencyandpermitthesubproblemstoactuallyrunconcurrently;thatis,theconcurrencymustbeexploitable.
Mostlargecomputationalproblemscontainexploitableconcurrency.Aprogrammerworkswithexploitableconcurrencybycreatingaparallelalgorithmandimplementingthealgorithmusingaparallelprogrammingenvironment.Whentheresultingparallelprogramisrunonasystemwithmultipleprocessors,theamountoftimewehavetowaitfortheresultsofthecomputationisreduced.Inaddition,multipleprocessorsmayallowlargerproblemstobesolvedthancouldbedoneonasingleprocessorsystem.
Asasimpleexample,supposepartofacomputationinvolvescomputingthesummationofalargesetofvalues.Ifmultipleprocessorsareavailable,insteadofaddingthevaluestogethersequentially,thesetcanbepartitionedandthesummationsofthesubsetscomputedsimultaneously,eachonadifferentprocessor.Thepartialsumsarethencombinedtogetthefinalanswer.Thus,usingmultipleprocessorstocomputeinparallelmayallowustoobtainasolutionsooner.Also,ifeachprocessorhasitsownmemory,partitioningthedatabetweentheprocessorsmayallowlargerproblemstobehandledthancouldbehandledonasingleprocessor.
Thissimpleexampleshowstheessenceofparallelcomputing.Thegoalistousemultipleprocessorstosolveproblemsinlesstimeand/ortosolvebiggerproblemsthanwouldbepossibleonasingleprocessor.Theprogrammer'staskistoidentifytheconcurrencyintheproblem,structurethealgorithmsothatthisconcurrencycanbeexploited,andthenimplementthesolutionusingasuitableprogrammingenvironment.Thefinalstepistosolvetheproblembyexecutingthecodeonaparallelsystem.
Parallelprogrammingpresentsuniquechallenges.Often,theconcurrenttasksmakinguptheproblemincludedependenciesthatmustbeidentifiedandcorrectlymanaged.Theorderinwhichthetasksexecutemaychangetheanswersofthecomputationsinnondeterministicways.Forexample,intheparallelsummationdescribedearlier,apartialsumcannotbecombinedwithothersuntilitsowncomputationhascompleted.Thealgorithmimposesapartialorderonthetasks(thatis,theymustcompletebeforethesumscanbecombined).Moresubtly,thenumericalvalueofthesummationsmaychangeslightlydependingontheorderoftheoperationswithinthesumsbecausefloatingpoint
-
arithmeticisnonassociative.Agoodparallelprogrammermusttakecaretoensurethatnondeterministicissuessuchasthesedonotaffectthequalityofthefinalanswer.Creatingsafeparallelprogramscantakeconsiderableeffortfromtheprogrammer.
Evenwhenaparallelprogramis"correct",itmayfailtodelivertheanticipatedperformanceimprovementfromexploitingconcurrency.Caremustbetakentoensurethattheoverheadincurredbymanagingtheconcurrencydoesnotoverwhelmtheprogramruntime.Also,partitioningtheworkamongtheprocessorsinabalancedwayisoftennotaseasyasthesummationexamplesuggests.Theeffectivenessofaparallelalgorithmdependsonhowwellitmapsontotheunderlyingparallelcomputer,soaparallelalgorithmcouldbeveryeffectiveononeparallelarchitectureandadisasteronanother.
Wewillrevisittheseissuesandprovideamorequantitativeviewofparallelcomputationinthenextchapter.
1.3. DESIGN PATTERNS AND PATTERN LANGUAGESAdesignpatterndescribesagoodsolutiontoarecurringprobleminaparticularcontext.Thepatternfollowsaprescribedformatthatincludesthepatternname,adescriptionofthecontext,theforces(goalsandconstraints),andthesolution.Theideaistorecordtheexperienceofexpertsinawaythatcanbeusedbyothersfacingasimilarproblem.Inadditiontothesolutionitself,thenameofthepatternisimportantandcanformthebasisforadomainspecificvocabularythatcansignificantlyenhancecommunicationbetweendesignersinthesamearea.
DesignpatternswerefirstproposedbyChristopherAlexander.Thedomainwascityplanningandarchitecture[AIS77].DesignpatternswereoriginallyintroducedtothesoftwareengineeringcommunitybyBeckandCunningham[BC87]andbecameprominentintheareaofobjectorientedprogrammingwiththepublicationofthebookbyGamma,Helm,Johnson,andVlissides[GHJV95],affectionatelyknownastheGoF(GangofFour)book.Thisbookgivesalargecollectionofdesignpatternsforobjectorientedprogramming.Togiveoneexample,theVisitorpatterndescribesawaytostructureclassessothatthecodeimplementingaheterogeneousdatastructurecanbekeptseparatefromthecodetotraverseit.Thus,whathappensinatraversaldependsonboththetypeofeachnodeandtheclassthatimplementsthetraversal.Thisallowsmultiplefunctionalityfordatastructuretraversals,andsignificantflexibilityasnewfunctionalitycanbeaddedwithouthavingtochangethedatastructureclass.ThepatternsintheGoFbookhaveenteredthelexiconofobjectorientedprogrammingreferencestoitspatternsarefoundintheacademicliterature,tradepublications,andsystemdocumentation.Thesepatternshavebynowbecomepartoftheexpectedknowledgeofanycompetentsoftwareengineer.
AneducationalnonprofitorganizationcalledtheHillsideGroup[Hil]wasformedin1993topromotetheuseofpatternsandpatternlanguagesand,moregenerally,toimprovehumancommunicationaboutcomputers"byencouragingpeopletocodifycommonprogramminganddesignpractice".Todevelopnewpatternsandhelppatternwritershonetheirskills,theHillsideGroupsponsorsanannualPatternLanguagesofPrograms(PLoP)workshopandseveralspinoffsinotherpartsoftheworld,suchasChiliPLoP(inthewesternUnitedStates),KoalaPLoP(Australia),EuroPLoP(Europe),and
-
MensorePLoP(Japan).Theproceedingsoftheseworkshops[Pat]providearichsourceofpatternscoveringavastrangeofapplicationdomainsinsoftwaredevelopmentandhavebeenusedasabasisforseveralbooks[CS95,VCK96,MRB97,HFR99].
Inhisoriginalworkonpatterns,Alexanderprovidednotonlyacatalogofpatterns,butalsoapatternlanguagethatintroducedanewapproachtodesign.Inapatternlanguage,thepatternsareorganizedintoastructurethatleadstheuserthroughthecollectionofpatternsinsuchawaythatcomplexsystemscanbedesignedusingthepatterns.Ateachdecisionpoint,thedesignerselectsanappropriatepattern.Eachpatternleadstootherpatterns,resultinginafinaldesignintermsofawebofpatterns.Thus,apatternlanguageembodiesadesignmethodologyandprovidesdomainspecificadvicetotheapplicationdesigner.(Inspiteoftheoverlappingterminology,apatternlanguageisnotaprogramminglanguage.)
1.4. A PATTERN LANGUAGE FOR PARALLEL PROGRAMMINGThisbookdescribesapatternlanguageforparallelprogrammingthatprovidesseveralbenefits.Theimmediatebenefitsareawaytodisseminatetheexperienceofexpertsbyprovidingacatalogofgoodsolutionstoimportantproblems,anexpandedvocabulary,andamethodologyforthedesignofparallelprograms.Wehopetolowerthebarriertoparallelprogrammingbyprovidingguidancethroughtheentireprocessofdevelopingaparallelprogram.Theprogrammerbringstotheprocessagoodunderstandingoftheactualproblemtobesolvedandthenworksthroughthepatternlanguage,eventuallyobtainingadetailedparalleldesignorpossiblyworkingcode.Inthelongerterm,wehopethatthispatternlanguagecanprovideabasisforbothadisciplinedapproachtothequalitativeevaluationofdifferentprogrammingmodelsandthedevelopmentofparallelprogrammingtools.
ThepatternlanguageisorganizedintofourdesignspacesFindingConcurrency,AlgorithmStructure,SupportingStructures,andImplementationMechanismswhichformalinearhierarchy,withFindingConcurrencyatthetopandImplementationMechanismsatthebottom,asshowninFig.1.1.
Figure 1.1. Overview of the pattern language
TheFindingConcurrencydesignspaceisconcernedwithstructuringtheproblemtoexposeexploitableconcurrency.Thedesignerworkingatthislevelfocusesonhighlevelalgorithmicissues
-
andreasonsabouttheproblemtoexposepotentialconcurrency.TheAlgorithmStructuredesignspaceisconcernedwithstructuringthealgorithmtotakeadvantageofpotentialconcurrency.Thatis,thedesignerworkingatthislevelreasonsabouthowtousetheconcurrencyexposedinworkingwiththeFindingConcurrencypatterns.TheAlgorithmStructurepatternsdescribeoverallstrategiesforexploitingconcurrency.TheSupportingStructuresdesignspacerepresentsanintermediatestagebetweentheAlgorithmStructureandImplementationMechanismsdesignspaces.Twoimportantgroupsofpatternsinthisspacearethosethatrepresentprogramstructuringapproachesandthosethatrepresentcommonlyusedshareddatastructures.TheImplementationMechanismsdesignspaceisconcernedwithhowthepatternsofthehigherlevelspacesaremappedintoparticularprogrammingenvironments.Weuseittoprovidedescriptionsofcommonmechanismsforprocess/threadmanagement(forexample,creatingordestroyingprocesses/threads)andprocess/threadinteraction(forexample,semaphores,barriers,ormessagepassing).Theitemsinthisdesignspacearenotpresentedaspatternsbecauseinmanycasestheymapdirectlyontoelementswithinparticularparallelprogrammingenvironments.Theyareincludedinthepatternlanguageanyway,however,toprovideacompletepathfromproblemdescriptiontocode.
Chapter 2. Background and Jargon of Parallel Computing
2.1CONCURRENCYINPARALLELPROGRAMSVERSUSOPERATINGSYSTEMS
2.2PARALLELARCHITECTURES:ABRIEFINTRODUCTION
2.3PARALLELPROGRAMMINGENVIRONMENTS
2.4THEJARGONOFPARALLELCOMPUTING
2.5AQUANTITATIVELOOKATPARALLELCOMPUTATION
2.6COMMUNICATION
2.7SUMMARY
Inthischapter,wegiveanoverviewoftheparallelprogramminglandscape,anddefineanyspecializedparallelcomputingterminologythatwewilluseinthepatterns.Becausemanytermsincomputingareoverloaded,takingdifferentmeaningsindifferentcontexts,wesuggestthatevenreadersfamiliarwithparallelprogrammingatleastskimthischapter.
2.1. CONCURRENCY IN PARALLEL PROGRAMS VERSUS OPERATING SYSTEMSConcurrencywasfirstexploitedincomputingtobetterutilizeorshareresourceswithinacomputer.Modernoperatingsystemssupportcontextswitchingtoallowmultipletaskstoappeartoexecuteconcurrently,therebyallowingusefulworktooccurwhiletheprocessorisstalledononetask.Thisapplicationofconcurrency,forexample,allowstheprocessortostaybusybyswappinginanewtasktoexecutewhileanothertaskiswaitingforI/O.Byquicklyswappingtasksinandout,givingeach
-
taska"slice"oftheprocessortime,theoperatingsystemcanallowmultipleuserstousethesystemasifeachwereusingitalone(butwithdegradedperformance).
Mostmodernoperatingsystemscanusemultipleprocessorstoincreasethethroughputofthesystem.TheUNIXshellusesconcurrencyalongwithacommunicationabstractionknownaspipestoprovideapowerfulformofmodularity:Commandsarewrittentoacceptastreamofbytesasinput(theconsumer)andproduceastreamofbytesasoutput(theproducer).Multiplecommandscanbechainedtogetherwithapipeconnectingtheoutputofonecommandtotheinputofthenext,allowingcomplexcommandstobebuiltfromsimplebuildingblocks.Eachcommandisexecutedinitsownprocess,withallprocessesexecutingconcurrently.Becausetheproducerblocksifbufferspaceinthepipeisnotavailable,andtheconsumerblocksifdataisnotavailable,thejobofmanagingthestreamofresultsmovingbetweencommandsisgreatlysimplified.Morerecently,withoperatingsystemswithwindowsthatinviteuserstodomorethanonethingatatime,andtheInternet,whichoftenintroducesI/Odelaysperceptibletotheuser,almosteveryprogramthatcontainsaGUIincorporatesconcurrency.
Althoughthefundamentalconceptsforsafelyhandlingconcurrencyarethesameinparallelprogramsandoperatingsystems,therearesomeimportantdifferences.Foranoperatingsystem,theproblemisnotfindingconcurrencytheconcurrencyisinherentinthewaytheoperatingsystemfunctionsinmanagingacollectionofconcurrentlyexecutingprocesses(representingusers,applications,andbackgroundactivitiessuchasprintspooling)andprovidingsynchronizationmechanismssoresourcescanbesafelyshared.However,anoperatingsystemmustsupportconcurrencyinarobustandsecureway:Processesshouldnotbeabletointerferewitheachother(intentionallyornot),andtheentiresystemshouldnotcrashifsomethinggoeswrongwithoneprocess.Inaparallelprogram,findingandexploitingconcurrencycanbeachallenge,whileisolatingprocessesfromeachotherisnotthecriticalconcernitiswithanoperatingsystem.Performancegoalsaredifferentaswell.Inanoperatingsystem,performancegoalsarenormallyrelatedtothroughputorresponsetime,anditmaybeacceptabletosacrificesomeefficiencytomaintainrobustnessandfairnessinresourceallocation.Inaparallelprogram,thegoalistominimizetherunningtimeofasingleprogram.
2.2. PARALLEL ARCHITECTURES: A BRIEF INTRODUCTIONTherearedozensofdifferentparallelarchitectures,amongthemnetworksofworkstations,clustersofofftheshelfPCs,massivelyparallelsupercomputers,tightlycoupledsymmetricmultiprocessors,andmultiprocessorworkstations.Inthissection,wegiveanoverviewofthesesystems,focusingonthecharacteristicsrelevanttotheprogrammer.
2.2.1. Flynn's TaxonomyByfarthemostcommonwaytocharacterizethesearchitecturesisFlynn'staxonomy[Fly72].Hecategorizesallcomputersaccordingtothenumberofinstructionstreamsanddatastreamstheyhave,whereastreamisasequenceofinstructionsordataonwhichacomputeroperates.InFlynn'staxonomy,therearefourpossibilities:SISD,SIMD,MISD,andMIMD.
-
SingleInstruction,SingleData(SISD).InaSISDsystem,onestreamofinstructionsprocessesasinglestreamofdata,asshowninFig.2.1.ThisisthecommonvonNeumannmodelusedinvirtuallyallsingleprocessorcomputers.
Figure 2.1. The Single Instruction, Single Data (SISD) architecture
SingleInstruction,MultipleData(SIMD).InaSIMDsystem,asingleinstructionstreamisconcurrentlybroadcasttomultipleprocessors,eachwithitsowndatastream(asshowninFig.2.2).TheoriginalsystemsfromThinkingMachinesandMasParcanbeclassifiedasSIMD.TheCPPDAPGammaIIandQuadricsApemillearemorerecentexamples;thesearetypicallydeployedinspecializedapplications,suchasdigitalsignalprocessing,thataresuitedtofinegrainedparallelismandrequirelittleinterprocesscommunication.Vectorprocessors,whichoperateonvectordatainapipelinedfashion,canalsobecategorizedasSIMD.Exploitingthisparallelismisusuallydonebythecompiler.
Figure 2.2. The Single Instruction, Multiple Data (SIMD) architecture
-
MultipleInstruction,SingleData(MISD).Nowellknownsystemsfitthisdesignation.Itismentionedforthesakeofcompleteness.
MultipleInstruction,MultipleData(MIMD).InaMIMDsystem,eachprocessingelementhasitsownstreamofinstructionsoperatingonitsowndata.Thisarchitecture,showninFig.2.3,isthemostgeneralofthearchitecturesinthateachoftheothercasescanbemappedontotheMIMDarchitecture.Thevastmajorityofmodernparallelsystemsfitintothiscategory.
Figure 2.3. The Multiple Instruction, Multiple Data (MIMD) architecture
2.2.2. A Further Breakdown of MIMDTheMIMDcategoryofFlynn'staxonomyistoobroadtobeusefulonitsown;thiscategoryistypicallydecomposedaccordingtomemoryorganization.
Sharedmemory.Inasharedmemorysystem,allprocessesshareasingleaddressspaceandcommunicatewitheachotherbywritingandreadingsharedvariables.
OneclassofsharedmemorysystemsiscalledSMPs(symmetricmultiprocessors).AsshowninFig.2.4,allprocessorsshareaconnectiontoacommonmemoryandaccessallmemorylocationsatequalspeeds.SMPsystemsarearguablytheeasiestparallelsystemstoprogrambecauseprogrammersdonotneedtodistributedatastructuresamongprocessors.Becauseincreasingthenumberofprocessorsincreasescontentionforthememory,theprocessor/memorybandwidthistypicallyalimitingfactor.Thus,SMPsystemsdonotscalewellandarelimitedtosmallnumbersofprocessors.
Figure 2.4. The Symmetric Multiprocessor (SMP) architecture
-
TheothermainclassofsharedmemorysystemsiscalledNUMA(nonuniformmemoryaccess).AsshowninFig.2.5,thememoryisshared;thatis,itisuniformlyaddressablefromallprocessors,butsomeblocksofmemorymaybephysicallymorecloselyassociatedwithsomeprocessorsthanothers.Thisreducesthememorybandwidthbottleneckandallowssystemswithmoreprocessors;however,asaresult,theaccesstimefromaprocessortoamemorylocationcanbesignificantlydifferentdependingonhow"close"thememorylocationistotheprocessor.Tomitigatetheeffectsofnonuniformaccess,eachprocessorhasacache,alongwithaprotocoltokeepcacheentriescoherent.Hence,anothernameforthesearchitecturesiscachecoherentnonuniformmemoryaccesssystems(ccNUMA).Logically,programmingaccNUMAsystemisthesameasprogramminganSMP,buttoobtainthebestperformance,theprogrammerwillneedtobemorecarefulaboutlocalityissuesandcacheeffects.
Figure 2.5. An example of the nonuniform memory access (NUMA) architecture
Distributedmemory.Inadistributedmemorysystem,eachprocesshasitsownaddressspaceandcommunicateswithotherprocessesbymessagepassing(sendingandreceivingmessages).AschematicrepresentationofadistributedmemorycomputerisshowninFig.2.6.
Figure 2.6. The distributed-memory architecture
Dependingonthetopologyandtechnologyusedfortheprocessorinterconnection,communication
-
speedcanrangefromalmostasfastassharedmemory(intightlyintegratedsupercomputers)toordersofmagnitudeslower(forexample,inaclusterofPCsinterconnectedwithanEthernetnetwork).Theprogrammermustexplicitlyprogramallthecommunicationbetweenprocessorsandbeconcernedwiththedistributionofdata.
Distributedmemorycomputersaretraditionallydividedintotwoclasses:MPP(massivelyparallelprocessors)andclusters.InanMPP,theprocessorsandthenetworkinfrastructurearetightlycoupledandspecializedforuseinaparallelcomputer.Thesesystemsareextremelyscalable,insomecasessupportingtheuseofmanythousandsofprocessorsinasinglesystem[MSW96,IBM02].
Clustersaredistributedmemorysystemscomposedofofftheshelfcomputersconnectedbyanofftheshelfnetwork.WhenthecomputersarePCsrunningtheLinuxoperatingsystem,theseclustersarecalledBeowulfclusters.Asofftheshelfnetworkingtechnologyimproves,systemsofthistypearebecomingmorecommonandmuchmorepowerful.Clustersprovideaninexpensivewayforanorganizationtoobtainparallelcomputingcapabilities[Beo].Preconfiguredclustersarenowavailablefrommanyvendors.OnefrugalgroupevenreportedconstructingausefulparallelsystembyusingaclustertoharnessthecombinedpowerofobsoletePCsthatotherwisewouldhavebeendiscarded[HHS01].
Hybridsystems.Thesesystemsareclustersofnodeswithseparateaddressspacesinwhicheachnodecontainsseveralprocessorsthatsharememory.
AccordingtovanderSteenandDongarra's"OverviewofRecentSupercomputers"[vdSD03],whichcontainsabriefdescriptionofthesupercomputerscurrentlyorsoontobecommerciallyavailable,hybridsystemsformedfromclustersofSMPsconnectedbyafastnetworkarecurrentlythedominanttrendinhighperformancecomputing.Forexample,inlate2003,fourofthefivefastestcomputersintheworldwerehybridsystems[Top].
Grids.Gridsaresystemsthatusedistributed,heterogeneousresourcesconnectedbyLANsand/orWANs[FK03].OftentheinterconnectionnetworkistheInternet.Gridswereoriginallyenvisionedasawaytolinkmultiplesupercomputerstoenablelargerproblemstobesolved,andthuscouldbeviewedasaspecialtypeofdistributedmemoryorhybridMIMDmachine.Morerecently,theideaofgridcomputinghasevolvedintoageneralwaytoshareheterogeneousresources,suchascomputationservers,storage,applicationservers,informationservices,orevenscientificinstruments.Gridsdifferfromclustersinthatthevariousresourcesinthegridneednothaveacommonpointofadministration.Inmostcases,theresourcesonagridareownedbydifferentorganizationsthatmaintaincontroloverthepoliciesgoverninguseoftheresources.Thisaffectsthewaythesesystemsareused,themiddlewarecreatedtomanagethem,andmostimportantlyforthisdiscussion,theoverheadincurredwhencommunicatingbetweenresourceswithinthegrid.
-
2.2.3. SummaryWehaveclassifiedthesesystemsaccordingtothecharacteristicsofthehardware.Thesecharacteristicstypicallyinfluencethenativeprogrammingmodelusedtoexpressconcurrencyonasystem;however,thisisnotalwaysthecase.Itispossibleforaprogrammingenvironmentforasharedmemorymachinetoprovidetheprogrammerwiththeabstractionofdistributedmemoryandmessagepassing.Virtualdistributedsharedmemorysystemscontainmiddlewaretoprovidetheopposite:theabstractionofsharedmemoryonadistributedmemorymachine.
2.3. PARALLEL PROGRAMMING ENVIRONMENTSParallelprogrammingenvironmentsprovidethebasictools,languagefeatures,andapplication programminginterfaces(APIs)neededtoconstructaparallelprogram.Aprogrammingenvironmentimpliesaparticularabstractionofthecomputersystemcalledaprogrammingmodel.TraditionalsequentialcomputersusethewellknownvonNeumannmodel.Becauseallsequentialcomputersusethismodel,softwaredesignerscandesignsoftwaretoasingleabstractionandreasonablyexpectittomapontomost,ifnotall,sequentialcomputers.
Unfortunately,therearemanypossiblemodelsforparallelcomputing,reflectingthedifferentwaysprocessorscanbeinterconnectedtoconstructaparallelsystem.Themostcommonmodelsarebasedononeofthewidelydeployedparallelarchitectures:sharedmemory,distributedmemorywithmessagepassing,orahybridcombinationofthetwo.
Programmingmodelstoocloselyalignedtoaparticularparallelsystemleadtoprogramsthatarenotportablebetweenparallelcomputers.Becausetheeffectivelifespanofsoftwareislongerthanthatofhardware,manyorganizationshavemorethanonetypeofparallelcomputer,andmostprogrammersinsistonprogrammingenvironmentsthatallowthemtowriteportableparallelprograms.Also,explicitlymanaginglargenumbersofresourcesinaparallelcomputerisdifficult,suggestingthathigherlevelabstractionsoftheparallelcomputermightbeuseful.Theresultisthatasofthemid1990s,therewasaveritableglutofparallelprogrammingenvironments.ApartiallistoftheseisshowninTable2.1.Thiscreatedagreatdealofconfusionforapplicationdevelopersandhinderedtheadoptionofparallelcomputingformainstreamapplications.
Table 2.1. Some Parallel Programming Environments from the Mid-1990s
"C*inC CUMULVS JavaRMI PRIO Quake
ABCPL DAGGER javaPG P3L Quark
ACE DAPPLE JAVAR P4Linda QuickThreads
ACT++ DataParallelC JavaSpaces Pablo Sage++
ADDAP DC++ JIDL PADE SAM
Adl DCE++ Joyce PADRE SCANDAL
Adsmith DDD Karma Panda SCHEDULE
-
AFAPI DICE Khoros Papers SciTL
ALWAN DIPC KOAN/FortranS Para++ SDDA
AM DistributedSmalltalk
LAM Paradigm SHMEM
AMDC DOLIB Legion Parafrase2 SIMPLE
Amoeba DOME Lilac Paralation Sina
AppLeS DOSMOS Linda Parallaxis SISAL
ARTS DRL LiPS ParallelHaskell
SMI
AthapascanOb DSMThreads Locust ParallelC++ SONiC
Aurora Ease Lparx ParC SplitC
Automap ECO Lucid ParLib++ SR
bb_threads Eilean Maisie ParLin Sthreads
Blaze Emerald Manifold Parlog Strand
BlockComm EPL Mentat Parmacs SUIF
BSP Excalibur MetaChaos Parti SuperPascal
C* Express Midway pC Synergy
C** Falcon Millipede pC++ TCGMSG
C4 Filaments Mirage PCN Telegraphos
CarlOS FLASH Modula2* PCP: TheFORCE
Cashmere FM ModulaP PCU Threads.h++
CC++ Fork MOSIX PEACE TRAPPER
Charlotte FortranM MpC PENNY TreadMarks
Charm FX MPC++ PET UC
Charm++ GA MPI PETSc uC++
Chu GAMMA Multipol PH UNITY
Cid Glenda Munin Phosphorus V
Cilk GLU NanoThreads POET Vic*
CMFortran GUARD NESL Polaris VisifoldVNUS
Code HAsL NetClasses++ POOLT VPE
-
ConcurrentML HORUS Nexus POOMA Win32threads
Converse HPC Nimrod POSYBL WinPar
COOL HPC++ NOW PRESTO WWWinda
CORRELATE HPF ObjectiveLinda Prospero XENOOPS
CparPar IMPACT Occam Proteus XPC
CPS ISETLLinda Omega PSDM Zounds
CRL ISIS OOF90 PSI ZPL
CSP JADA Orca PVM
Cthreads JADE P++ QPC++
Fortunately,bythelate1990s,theparallelprogrammingcommunityconvergedpredominantlyontwoenvironmentsforparallelprogramming:OpenMP[OMP]forsharedmemoryandMPI[Mesb]formessagepassing.
OpenMPisasetoflanguageextensionsimplementedascompilerdirectives.ImplementationsarecurrentlyavailableforFortran,C,andC++.OpenMPisfrequentlyusedtoincrementallyaddparallelismtosequentialcode.Byaddingacompilerdirectivearoundaloop,forexample,thecompilercanbeinstructedtogeneratecodetoexecutetheiterationsoftheloopinparallel.Thecompilertakescareofmostofthedetailsofthreadcreationandmanagement.OpenMPprogramstendtoworkverywellonSMPs,butbecauseitsunderlyingprogrammingmodeldoesnotincludeanotionofnonuniformmemoryaccesstimes,itislessidealforccNUMAanddistributedmemorymachines.
MPIisasetoflibraryroutinesthatprovideforprocessmanagement,messagepassing,andsomecollectivecommunicationoperations(theseareoperationsthatinvolvealltheprocessesinvolvedinaprogram,suchasbarrier,broadcast,andreduction).MPIprogramscanbedifficulttowritebecausetheprogrammerisresponsiblefordatadistributionandexplicitinterprocesscommunicationusingmessages.Becausetheprogrammingmodelassumesdistributedmemory,MPIisagoodchoiceforMPPsandotherdistributedmemorymachines.
NeitherOpenMPnorMPIisanidealfitforhybridarchitecturesthatcombinemultiprocessornodes,eachwithmultipleprocessesandasharedmemory,intoalargersystemwithseparateaddressspacesforeachnode:TheOpenMPmodeldoesnotrecognizenonuniformmemoryaccesstimes,soitsdataallocationcanleadtopoorperformanceonmachinesthatarenotSMPs,whileMPIdoesnotincludeconstructstomanagedatastructuresresidinginasharedmemory.OnesolutionisahybridmodelinwhichOpenMPisusedoneachsharedmemorynodeandMPIisusedbetweenthenodes.Thisworkswell,butitrequirestheprogrammertoworkwithtwodifferentprogrammingmodelswithinasingleprogram.AnotheroptionistouseMPIonboththesharedmemoryanddistributedmemoryportionsofthealgorithmandgiveuptheadvantagesofasharedmemoryprogrammingmodel,evenwhenthehardwaredirectlysupportsit.
Newhighlevelprogrammingenvironmentsthatsimplifyportableparallelprogrammingandmoreaccuratelyreflecttheunderlyingparallelarchitecturesaretopicsofcurrentresearch[Cen].Another
-
approachmorepopularinthecommercialsectoristoextendMPIandOpenMP.Inthemid1990s,theMPIForumdefinedanextendedMPIcalledMPI2.0,althoughimplementationsarenotwidelyavailableatthetimethiswaswritten.ItisalargecomplexextensiontoMPIthatincludesdynamicprocesscreation,parallelI/O,andmanyotherfeatures.Ofparticularinteresttoprogrammersofmodernhybridarchitecturesistheinclusionofonesidedcommunication.Onesidedcommunicationmimicssomeofthefeaturesofasharedmemorysystembylettingoneprocesswriteintoorreadfromthememoryregionsofotherprocesses.Theterm"onesided"referstothefactthatthereadorwriteislaunchedbytheinitiatingprocesswithouttheexplicitinvolvementoftheotherparticipatingprocess.AmoresophisticatedabstractionofonesidedcommunicationisavailableaspartoftheGlobalArrays[NHL96,NHK +02 ,Gloa]package.GlobalArraysworkstogetherwithMPItohelpaprogrammermanagedistributedarraydata.Aftertheprogrammerdefinesthearrayandhowitislaidoutinmemory,theprogramexecutes"puts"or"gets"intothearraywithoutneedingtoexplicitlymanagewhichMPIprocess"owns"theparticularsectionofthearray.Inessence,theglobalarrayprovidesanabstractionofagloballysharedarray.Thisonlyworksforarrays,butthesearesuchcommondatastructuresinparallelcomputingthatthispackage,althoughlimited,canbeveryuseful.
JustasMPIhasbeenextendedtomimicsomeofthebenefitsofasharedmemoryenvironment,OpenMPhasbeenextendedtorunindistributedmemoryenvironments.TheannualWOMPAT(WorkshoponOpenMPApplicationsandTools)workshopscontainmanypapersdiscussingvariousapproachesandexperienceswithOpenMPinclustersandccNUMAenvironments.
MPIisimplementedasalibraryofroutinestobecalledfromprogramswritteninasequentialprogramminglanguage,whereasOpenMPisasetofextensionstosequentialprogramminglanguages.Theyrepresenttwoofthepossiblecategoriesofparallelprogrammingenvironments(librariesandlanguageextensions),andthesetwoparticularenvironmentsaccountfortheoverwhelmingmajorityofparallelcomputingbeingdonetoday.Thereis,however,onemorecategoryofparallelprogrammingenvironments,namelylanguageswithbuiltinfeaturestosupportparallelprogramming.Javaissuchalanguage.Ratherthanbeingdesignedtosupporthighperformancecomputing,Javaisanobjectoriented,generalpurposeprogrammingenvironmentwithfeaturesforexplicitlyspecifyingconcurrentprocessingwithsharedmemory.Inaddition,thestandardI/OandnetworkpackagesprovideclassesthatmakeiteasyforJavatoperforminterprocesscommunicationbetweenmachines,thusmakingitpossibletowriteprogramsbasedonboththesharedmemoryandthedistributedmemorymodels.Thenewerjava.niopackagessupportI/Oinawaythatislessconvenientfortheprogrammer,butgivessignificantlybetterperformance,andJava21.5includesnewsupportforconcurrentprogramming,mostsignificantlyinthejava.util.concurrent.*packages.Additionalpackagesthatsupportdifferentapproachestoparallelcomputingarewidelyavailable.
Althoughtherehavebeenothergeneralpurposelanguages,bothpriortoJavaandmorerecent(forexample,C#),thatcontainedconstructsforspecifyingconcurrency,Javaisthefirsttobecomewidelyused.Asaresult,itmaybethefirstexposureformanyprogrammerstoconcurrentandparallelprogramming.AlthoughJavaprovidessoftwareengineeringbenefits,currentlytheperformanceofparallelJavaprogramscannotcompetewithOpenMPorMPIprogramsfortypicalscientificcomputingapplications.TheJavadesignhasalsobeencriticizedforseveraldeficienciesthatmatterinthisdomain(forexample,afloatingpointmodelthatemphasizesportabilityandmorereproducibleresultsoverexploitingtheavailablefloatingpointhardwaretothefullest,inefficienthandlingof
-
arrays,andlackofalightweightmechanismtohandlecomplexnumbers).TheperformancedifferencebetweenJavaandotheralternativescanbeexpectedtodecrease,especiallyforsymbolicorothernonnumericproblems,ascompilertechnologyforJavaimprovesandasnewpackagesandlanguageextensionsbecomeavailable.TheTitaniumproject[Tita]isanexampleofaJavadialectdesignedforhighperformancecomputinginaccNUMAenvironment.
Forthepurposesofthisbook,wehavechosenOpenMP,MPI,andJavaasthethreeenvironmentswewilluseinourexamplesOpenMPandMPIfortheirpopularityandJavabecauseitislikelytobemanyprogrammers'firstexposuretoconcurrentprogramming.Abriefoverviewofeachcanbefoundintheappendixes.
2.4. THE JARGON OF PARALLEL COMPUTINGInthissection,wedefinesometermsthatarefrequentlyusedthroughoutthepatternlanguage.Additionaldefinitionscanbefoundintheglossary.
Task.Thefirststepindesigningaparallelprogramistobreaktheproblemupintotasks.Ataskisasequenceofinstructionsthatoperatetogetherasagroup.Thisgroupcorrespondstosomelogicalpartofanalgorithmorprogram.Forexample,considerthemultiplicationoftwoorderNmatrices.Dependingonhowweconstructthealgorithm,thetaskscouldbe(1)themultiplicationofsubblocksofthematrices,(2)innerproductsbetweenrowsandcolumnsofthematrices,or(3)individualiterationsoftheloopsinvolvedinthematrixmultiplication.Thesearealllegitimatewaystodefinetasksformatrixmultiplication;thatis,thetaskdefinitionfollowsfromthewaythealgorithmdesignerthinksabouttheproblem.
Unitofexecution(UE).Tobeexecuted,ataskneedstobemappedtoaUEsuchasaprocessorthread.Aprocessisacollectionofresourcesthatenablestheexecutionofprograminstructions.Theseresourcescanincludevirtualmemory,I/Odescriptors,aruntimestack,signalhandlers,userandgroupIDs,andaccesscontroltokens.Amorehighlevelviewisthataprocessisa"heavyweight"unitofexecutionwithitsownaddressspace.AthreadisthefundamentalUEinmodernoperatingsystems.Athreadisassociatedwithaprocessandsharestheprocess'senvironment.Thismakesthreadslightweight(thatis,acontextswitchbetweenthreadstakesonlyasmallamountoftime).Amorehighlevelviewisthatathreadisa"lightweight"UEthatsharesanaddressspacewithotherthreads.
WewilluseunitofexecutionorUEasagenerictermforoneofacollectionofpossiblyconcurrentlyexecutingentities,usuallyeitherprocessesorthreads.Thisisconvenientintheearlystagesofprogramdesignwhenthedistinctionsbetweenprocessesandthreadsarelessimportant.
Processingelement(PE).Weusethetermprocessingelement(PE)asagenerictermforahardwareelementthatexecutesastreamofinstructions.TheunitofhardwareconsideredtobeaPEdependsonthecontext.Forexample,someprogrammingenvironmentsvieweachworkstationinaclusterofSMPworkstationsasexecutingasingleinstructionstream;inthissituation,thePEwouldbetheworkstation.Adifferentprogrammingenvironmentrunningonthesamehardware,however,mightvieweachprocessorofeachworkstationasexecutinganindividualinstructionstream;inthiscase,thePEistheindividualprocessor,andeachworkstationcontainsseveralPEs.
-
Loadbalanceandloadbalancing.Toexecuteaparallelprogram,thetasksmustbemappedtoUEs,andtheUEstoPEs.Howthemappingsaredonecanhaveasignificantimpactontheoverallperformanceofaparallelalgorithm.ItiscrucialtoavoidthesituationinwhichasubsetofthePEsisdoingmostoftheworkwhileothersareidle.LoadbalancereferstohowwelltheworkisdistributedamongPEs.LoadbalancingistheprocessofallocatingworktoPEs,eitherstaticallyordynamically,sothattheworkisdistributedasevenlyaspossible.
Synchronization.Inaparallelprogram,duetothenondeterminismoftaskschedulingandotherfactors,eventsinthecomputationmightnotalwaysoccurinthesameorder.Forexample,inonerun,ataskmightreadvariablexbeforeanothertaskreadsvariabley;inthenextrunwiththesameinput,theeventsmightoccurintheoppositeorder.Inmanycases,theorderinwhichtwoeventsoccurdoesnotmatter.Inothersituations,theorderdoesmatter,andtoensurethattheprogramiscorrect,theprogrammermustintroducesynchronizationtoenforcethenecessaryorderingconstraints.TheprimitivesprovidedforthispurposeinourselectedenvironmentsarediscussedintheImplementationMechanismsdesignspace(Section6.3).
Synchronousversusasynchronous.Weusethesetwotermstoqualitativelyrefertohowtightlycoupledintimetwoeventsare.Iftwoeventsmusthappenatthesametime,theyaresynchronous;otherwisetheyareasynchronous.Forexample,messagepassing(thatis,communicationbetweenUEsbysendingandreceivingmessages)issynchronousifamessagesentmustbereceivedbeforethesendercancontinue.Messagepassingisasynchronousifthesendercancontinueitscomputationregardlessofwhathappensatthereceiver,orifthereceivercancontinuecomputationswhilewaitingforareceivetocomplete.
Raceconditions.Araceconditionisakindoferrorpeculiartoparallelprograms.ItoccurswhentheoutcomeofaprogramchangesastherelativeschedulingofUEsvaries.BecausetheoperatingsystemandnottheprogrammercontrolstheschedulingoftheUEs,raceconditionsresultinprogramsthatpotentiallygivedifferentanswersevenwhenrunonthesamesystemwiththesamedata.Raceconditionsareparticularlydifficulterrorstodebugbecausebytheirnaturetheycannotbereliablyreproduced.Testinghelps,butisnotaseffectiveaswithsequentialprograms:Aprogrammayruncorrectlythefirstthousandtimesandthenfailcatastrophicallyonthethousandandfirstexecutionandthenrunagaincorrectlywhentheprogrammerattemptstoreproducetheerrorasthefirststepindebugging.
Raceconditionsresultfromerrorsinsynchronization.IfmultipleUEsreadandwritesharedvariables,theprogrammermustprotectaccesstothesesharedvariablessothereadsandwritesoccurinavalidorderregardlessofhowthetasksareinterleaved.Whenmanyvariablesaresharedorwhentheyareaccessedthroughmultiplelevelsofindirection,verifyingbyinspectionthatnoraceconditionsexistcanbeverydifficult.Toolsareavailablethathelpdetectandfixraceconditions,suchasThreadCheckerfromIntelCorporation,andtheproblemremainsanareaofactiveandimportantresearch[NM92].
Deadlocks.Deadlocksareanothertypeoferrorpeculiartoparallelprograms.Adeadlockoccurswhenthereisacycleoftasksinwhicheachtaskisblockedwaitingforanothertoproceed.Becauseallarewaitingforanothertasktodosomething,theywillallbeblockedforever.Asasimpleexample,considertwotasksinamessagepassingenvironment.TaskAattemptstoreceiveamessagefromtaskB,afterwhichAwillreplybysendingamessageofitsowntotaskB.Meanwhile,taskBattemptsto
-
receiveamessagefromtaskA,afterwhichBwillsendamessagetoA.Becauseeachtaskiswaitingfortheothertosenditamessagefirst,bothtaskswillbeblockedforever.Fortunately,deadlocksarenotdifficulttodiscover,asthetaskswillstopatthepointofthedeadlock.
2.5. A QUANTITATIVE LOOK AT PARALLEL COMPUTATIONThetwomainreasonsforimplementingaparallelprogramaretoobtainbetterperformanceandtosolvelargerproblems.Performancecanbebothmodeledandmeasured,sointhissectionwewilltakeaanotherlookatparallelcomputationsbygivingsomesimpleanalyticalmodelsthatillustratesomeofthefactorsthatinfluencetheperformanceofaparallelprogram.
Consideracomputationconsistingofthreeparts:asetupsection,acomputationsection,andafinalizationsection.ThetotalrunningtimeofthisprogramononePEisthengivenasthesumofthetimesforthethreeparts.
Equation2.1
WhathappenswhenwerunthiscomputationonaparallelcomputerwithmultiplePEs?Supposethatthesetupandfinalizationsectionscannotbecarriedoutconcurrentlywithanyotheractivities,butthatthecomputationsectioncouldbedividedintotasksthatwouldrunindependentlyonasmanyPEsasareavailable,withthesametotalnumberofcomputationstepsasintheoriginalcomputation.ThetimeforthefullcomputationonPPEscanthereforebegivenbyOfcourse,Eq.2.2describesaveryidealizedsituation.However,theideathatcomputationshaveaserialpart(forwhichadditionalPEsareuseless)andaparallelizablepart(forwhichmorePEsdecreasetherunningtime)isrealistic.Thus,thissimplemodelcapturesanimportantrelationship.
Equation2.2
AnimportantmeasureofhowmuchadditionalPEshelpistherelativespeedupS,whichdescribeshowmuchfasteraproblemrunsinawaythatnormalizesawaytheactualrunningtime.
Equation2.3
ArelatedmeasureistheefficiencyE,whichisthespeedupnormalizedbythenumberofPEs.
Equation2.4
-
Equation2.5
Ideally,wewouldwantthespeeduptobeequaltoP,thenumberofPEs.Thisissometimescalledperfectlinearspeedup.Unfortunately,thisisanidealthatcanrarelybeachievedbecausetimesforsetupandfinalizationarenotimprovedbyaddingmorePEs,limitingthespeedup.Thetermsthatcannotberunconcurrentlyarecalledtheserialterms.Theirrunningtimesrepresentsomefractionofthetotal,calledtheserialfraction,denoted .Equation2.6
Thefractionoftimespentintheparallelizablepartoftheprogramisthen(1 ).Wecanthus rewritetheexpressionfortotalcomputationtimewithPPEsas
Equation2.7
Now,rewritingSintermsofthenewexpressionforTtotal(P),weobtainthefamousAmdahl'slaw:
Equation2.8
Equation2.9
Thus,inanidealparallelalgorithmwithnooverheadintheparallelpart,thespeedupshouldfollowEq.2.9.Whathappenstothespeedupifwetakeouridealparallelalgorithmanduseaverylargenumberofprocessors?TakingthelimitasPgoestoinfinityinourexpressionforSyields
-
Equation2.10
Eq.2.10thusgivesanupperboundonthespeedupobtainableinanalgorithmwhoseserialpartrepresents ofthetotalcomputation.Theseconceptsarevitaltotheparallelalgorithmdesigner.Indesigningaparallelalgorithm,itisimportanttounderstandthevalueoftheserialfractionsothatrealisticexpectationscanbesetforperformance.Itmaynotmakesensetoimplementacomplex,arbitrarilyscalableparallelalgorithmif10%ormoreofthealgorithmisserialand10%isfairlycommon.
Ofcourse,Amdahl'slawisbasedonassumptionsthatmayormaynotbetrueinpractice.Inreallife,anumberoffactorsmaymaketheactualrunningtimelongerthanthisformulaimplies.Forexample,creatingadditionalparalleltasksmayincreaseoverheadandthechancesofcontentionforsharedresources.Ontheotherhand,iftheoriginalserialcomputationislimitedbyresourcesotherthantheavailabilityofCPUcycles,theactualperformancecouldbemuchbetterthanAmdahl'slawwouldpredict.Forexample,alargeparallelmachinemayallowbiggerproblemstobeheldinmemory,thusreducingvirtualmemorypaging,ormultipleprocessorseachwithitsowncachemayallowmuchmoreoftheproblemtoremaininthecache.Amdahl'slawalsorestsontheassumptionthatforanygiveninput,theparallelandserialimplementationsperformexactlythesamenumberofcomputationalsteps.Iftheserialalgorithmbeingusedintheformulaisnotthebestpossiblealgorithmfortheproblem,thenacleverparallelalgorithmthatstructuresthecomputationdifferentlycanreducethetotalnumberofcomputationalsteps.
Ithasalsobeenobserved[Gus88]thattheexerciseunderlyingAmdahl'slaw,namelyrunningexactlythesameproblemwithvaryingnumbersofprocessors,isartificialinsomecircumstances.If,say,theparallelapplicationwereaweathersimulation,thenwhennewprocessorswereadded,onewouldmostlikelyincreasetheproblemsizebyaddingmoredetailstothemodelwhilekeepingthetotalexecutiontimeconstant.Ifthisisthecase,thenAmdahl'slaw,orfixedsizespeedup,givesapessimisticviewofthebenefitsofadditionalprocessors.
Toseethis,wecanreformulatetheequationtogivethespeedupintermsofperformanceonaPprocessorsystem.EarlierinEq.2.2,weobtainedtheexecutiontimeforTprocessors,Ttotal(P),from
theexecutiontimeoftheserialtermsandtheexecutiontimeoftheparallelizablepartwhenexecutedononeprocessor.Here,wedotheoppositeandobtainTtotal(1)fromtheserialandparallelterms
whenexecutedonPprocessors.
Equation2.11
Now,wedefinethesocalledscaledserialfraction,denotedscaled,as
-
Equation2.12
andthen
Equation2.13
Rewritingtheequationforspeedup(Eq.2.3)andsimplifying,weobtainthescaled(orfixedtime)speedup.[1]
[1]Thisequation,sometimesknownasGustafson'slaw,wasattributedin[Gus88]toE.Barsis.
Equation2.14
ThisgivesexactlythesamespeedupasAmdahl'slaw,butallowsadifferentquestiontobeaskedwhenthenumberofprocessorsisincreased.SincescaleddependsonP,theresultoftakingthelimitisn'timmediatelyobvious,butwouldgivethesameresultasthelimitinAmdahl'slaw.However,supposewetakethelimitinPwhileholdingTcomputeandthusscaledconstant.Theinterpretationisthatweareincreasingthesizeoftheproblemsothatthetotalrunningtimeremainsconstantwhenmoreprocessorsareadded.(Thiscontainstheimplicitassumptionthattheexecutiontimeoftheserialtermsdoesnotchangeastheproblemsizegrows.)Inthiscase,thespeedupislinearinP.Thus,whileaddingmoreprocessorstosolveafixedproblemmayhitthespeeduplimitsofAmdahl'slawwitharelativelysmallnumberofprocessors,iftheproblemgrowsasmoreprocessorsareadded,Amdahl'slawwillbepessimistic.Thesetwomodelsofspeedup,alongwithafixedmemoryversionofspeedup,arediscussedin[SN90].
2.6. COMMUNICATION
2.6.1. Latency and BandwidthAsimplebutusefulmodelcharacterizesthetotaltimeformessagetransferasthesumofafixedcost
-
plusavariablecostthatdependsonthelengthofthemessage.
Equation2.15
Thefixedcost iscalledlatencyandisessentiallythetimeittakestosendanemptymessageover thecommunicationmedium,fromthetimethesendroutineiscalledtothetimethedataisreceivedbytherecipient.Latency(giveninsomeappropriatetimeunit)includesoverheadduetosoftwareandnetworkhardwareplusthetimeittakesforthemessagetotraversethecommunicationmedium.Thebandwidth (giveninsomemeasureofbytespertimeunit)isameasureofthecapacityofthe communicationmedium.Nisthelengthofthemessage.
Thelatencyandbandwidthcanvarysignificantlybetweensystemsdependingonboththehardwareusedandthequalityofthesoftwareimplementingthecommunicationprotocols.Becausethesevaluescanbemeasuredwithfairlysimplebenchmarks[DD97],itissometimesworthwhiletomeasurevaluesfor and ,asthesecanhelpguideoptimizationstoimprovecommunicationperformance. Forexample,inasysteminwhich isrelativelylarge,itmightbeworthwhiletotrytorestructurea programthatsendsmanysmallmessagestoaggregatethecommunicationintoafewlargemessagesinstead.Dataforseveralrecentsystemshasbeenpresentedin[BBC +03 ].
2.6.2. Overlapping Communication and Computation and Latency HidingIfwelookmorecloselyatthecomputationtimewithinasingletaskonasingleprocessor,itcanroughlybedecomposedintocomputationtime,communicationtime,andidletime.Thecommunicationtimeisthetimespentsendingandreceivingmessages(andthusonlyappliestodistributedmemorymachines),whereastheidletimeistimethatnoworkisbeingdonebecausethetaskiswaitingforanevent,suchasthereleaseofaresourceheldbyanothertask.
Acommonsituationinwhichataskmaybeidleiswhenitiswaitingforamessagetobetransmittedthroughthesystem.Thiscanoccurwhensendingamessage(astheUEwaitsforareplybeforeproceeding)orwhenreceivingamessage.Sometimesitispossibletoeliminatethiswaitbyrestructuringthetasktosendthemessageand/orpostthereceive(thatis,indicatethatitwantstoreceiveamessage)andthencontinuethecomputation.Thisallowstheprogrammertooverlapcommunicationandcomputation.WeshowanexampleofthistechniqueinFig.2.7.Thisstyleofmessagepassingismorecomplicatedfortheprogrammer,becausetheprogrammermusttakecaretowaitforthereceivetocompleteafteranyworkthatcanbeoverlappedwithcommunicationiscompleted.
-
Figure 2.7. Communication without (left) and with (right) support for overlapping communication and computation. Although UE 0 in the computation on the right still has some idle time waiting for the reply from UE 1, the idle time is reduced and the
computation requires less total time because of UE 1 's earlier start.
AnothertechniqueusedonmanyparallelcomputersistoassignmultipleUEstoeachPE,sothatwhenoneUEiswaitingforcommunication,itwillbepossibletocontextswitchtoanotherUEandkeeptheprocessorbusy.Thisisanexampleoflatencyhiding.Itisincreasinglybeingusedonmodernhighperformancecomputingsystems,themostfamousexamplebeingtheMTAsystemfromCrayResearch[ACC +90 ].
2.7. SUMMARYThischapterhasgivenabriefoverviewofsomeoftheconceptsandvocabularyusedinparallelcomputing.Additionaltermsaredefinedintheglossary.Wealsodiscussedthemajorprogrammingenvironmentsinuseforparallelcomputing:OpenMP,MPI,andJava.Throughoutthebook,wewillusethesethreeprogrammingenvironmentsforourexamples.MoredetailsaboutOpenMP,MPI,andJavaandhowtousethemtowriteparallelprogramsareprovidedintheappendixes.
Chapter 3. The Finding Concurrency Design Space
3.1ABOUTTHEDESIGNSPACE
3.2THETASKDECOMPOSITIONPATTERN
-
3.3THEDATADECOMPOSITIONPATTERN
3.4THEGROUPTASKSPATTERN
3.5THEORDERTASKSPATTERN
3.6THEDATASHARINGPATTERN
3.7THEDESIGNEVALUATIONPATTERN
3.8SUMMARY
3.1. ABOUT THE DESIGN SPACEThesoftwaredesignerworksinanumberofdomains.Thedesignprocessstartsintheproblemdomainwithdesignelementsdirectlyrelevanttotheproblembeingsolved(forexample,fluidflows,decisiontrees,atoms,etc.).Theultimateaimofthedesignissoftware,soatsomepoint,thedesignelementschangeintoonesrelevanttoaprogram(forexample,datastructuresandsoftwaremodules).Wecallthistheprogramdomain.Althoughitisoftentemptingtomoveintotheprogramdomainassoonaspossible,adesignerwhomovesoutoftheproblemdomaintoosoonmaymissvaluabledesignoptions.
Thisisparticularlyrelevantinparallelprogramming.Parallelprogramsattempttosolvebiggerproblemsinlesstimebysimultaneouslysolvingdifferentpartsoftheproblemondifferentprocessingelements.Thiscanonlywork,however,iftheproblemcontainsexploitableconcurrency,thatis,multipleactivitiesortasksthatcanexecuteatthesametime.Afteraproblemhasbeenmappedontotheprogramdomain,however,itcanbedifficulttoseeopportunitiestoexploitconcurrency.
Hence,programmersshouldstarttheirdesignofaparallelsolutionbyanalyzingtheproblemwithintheproblemdomaintoexposeexploitableconcurrency.WecallthedesignspaceinwhichthisanalysisiscarriedouttheFindingConcurrencydesignspace.Thepatternsinthisdesignspacewillhelpidentifyandanalyzetheexploitableconcurrencyinaproblem.Afterthisisdone,oneormorepatternsfromtheAlgorithmStructurespacecanbechosentohelpdesigntheappropriatealgorithmstructuretoexploittheidentifiedconcurrency.
AnoverviewofthisdesignspaceanditsplaceinthepatternlanguageisshowninFig.3.1.
Figure 3.1. Overview of the Finding Concurrency design space and its place in the pattern language
-
ExperienceddesignersworkinginafamiliardomainmayseetheexploitableconcurrencyimmediatelyandcouldmovedirectlytothepatternsintheAlgorithmStructuredesignspace.
3.1.1. OverviewBeforestartingtoworkwiththepatternsinthisdesignspace,thealgorithmdesignermustfirstconsidertheproblemtobesolvedandmakesuretheefforttocreateaparallelprogramwillbejustified:Istheproblemlargeenoughandtheresultssignificantenoughtojustifyexpendingefforttosolveitfaster?Ifso,thenextstepistomakesurethekeyfeaturesanddataelementswithintheproblemarewellunderstood.Finally,thedesignerneedstounderstandwhichpartsoftheproblemaremostcomputationallyintensive,becausetheefforttoparallelizetheproblemshouldbefocusedonthoseparts.
Afterthisanalysisiscomplete,thepatternsintheFindingConcurrencydesignspacecanbeusedtostartdesigningaparallelalgorithm.Thepatternsinthisdesignspacecanbeorganizedintothreegroups.
DecompositionPatterns.Thetwodecompositionpatterns,TaskDecompositionandDataDecomposition,areusedtodecomposetheproblemintopiecesthatcanexecuteconcurrently.
DependencyAnalysisPatterns.Thisgroupcontainsthreepatternsthathelpgroupthetasksandanalyzethedependenciesamongthem:GroupTasks,OrderTasks,andDataSharing.Nominally,thepatternsareappliedinthisorder.Inpractice,however,itisoftennecessarytoworkbackandforthbetweenthem,orpossiblyevenrevisitthedecompositionpatterns.
DesignEvaluationPattern.ThefinalpatterninthisspaceguidesthealgorithmdesignerthroughananalysisofwhathasbeendonesofarbeforemovingontothepatternsintheAlgorithmStructuredesignspace.Thispatternisimportantbecauseitoftenhappensthatthebestdesignisnotfoundonthefirstattempt,andtheearlierdesignflawsareidentified,the
-
easiertheyaretocorrect.Ingeneral,workingthroughthepatternsinthisspaceisaniterativeprocess.
3.1.2. Using the Decomposition PatternsThefirststepindesigningaparallelalgorithmistodecomposetheproblemintoelementsthatcanexecuteconcurrently.Wecanthinkofthisdecompositionasoccurringintwodimensions.
Thetaskdecompositiondimensionviewstheproblemasastreamofinstructionsthatcanbebrokenintosequencescalledtasksthatcanexecutesimultaneously.Forthecomputationtobeefficient,theoperationsthatmakeupthetaskshouldbelargelyindependentoftheoperationstakingplaceinsideothertasks.
Thedatadecompositiondimensionfocusesonthedatarequiredbythetasksandhowitcanbedecomposedintodistinctchunks.Thecomputationassociatedwiththedatachunkswillonlybeefficientifthedatachunkscanbeoperateduponrelativelyindependently.
Viewingtheproblemdecompositionintermsoftwodistinctdimensionsissomewhatartificial.Ataskdecompositionimpliesadatadecompositionandviceversa;hence,thetwodecompositionsarereallydifferentfacetsofthesamefundamentaldecomposition.Wedividethemintoseparatedimensions,however,becauseaproblemdecompositionusuallyproceedsmostnaturallybyemphasizingonedimensionofthedecompositionovertheother.Bymakingthemdistinct,wemakethisdesignemphasisexplicitandeasierforthedesignertounderstand.
3.1.3. Background for ExamplesInthissection,wegivebackgroundinformationonsomeoftheexamplesthatareusedinseveralpatterns.Itcanbeskippedforthetimebeingandrevisitedlaterwhenreadingapatternthatreferstooneoftheexamples.
Medical imaging
PET(PositronEmissionTomography)scansprovideanimportantdiagnostictoolbyallowingphysicianstoobservehowaradioactivesubstancepropagatesthroughapatient'sbody.Unfortunately,theimagesformedfromthedistributionofemittedradiationareoflowresolution,dueinparttothescatteringoftheradiationasitpassesthroughthebody.Itisalsodifficulttoreasonfromtheabsoluteradiationintensities,becausedifferentpathwaysthroughthebodyattenuatetheradiationdifferently.
Tosolvethisproblem,modelsofhowradiationpropagatesthroughthebodyareusedtocorrecttheimages.AcommonapproachistobuildaMonteCarlomodel,asdescribedbyLjungbergandKing[LK98].Randomlyselectedpointswithinthebodyareassumedtoemitradiation(usuallyagammaray),andthetrajectoryofeachrayisfollowed.Asaparticle(ray)passesthroughthebody,itisattenuatedbythedifferentorgansittraverses,continuinguntiltheparticleleavesthebodyandhitsacameramodel,therebydefiningafulltrajectory.Tocreateastatisticallysignificantsimulation,thousands,ifnotmillions,oftrajectoriesarefollowed.
Thisproblemcanbeparallelizedintwoways.Becauseeachtrajectoryisindependent,itispossibletoparallelizetheapplicationbyassociatingeachtrajectorywithatask.ThisapproachisdiscussedintheExamplessectionoftheTaskDecompositionpattern.Anotherapproachwouldbetopartitionthe
-
bodyintosectionsandassigndifferentsectionstodifferentprocessingelements.ThisapproachisdiscussedintheExamplessectionoftheDataDecompositionpattern.
Linear algebra
Linearalgebraisanimportanttoolinappliedmathematics:Itprovidesthemachineryrequiredtoanalyzesolutionsoflargesystemsoflinearequations.Theclassiclinearalgebraproblemasks,formatrixAandvectorb,whatvaluesforxwillsolvetheequation
Equation3.1
ThematrixAinEq.3.1takesonacentralroleinlinearalgebra.Manyproblemsareexpressedintermsoftransformationsofthismatrix.Thesetransformationsareappliedbymeansofamatrixmultiplication
Equation3.2
IfT,A,andCaresquarematricesoforderN,matrixmultiplicationisdefinedsuchthateachelementoftheresultingmatrixCis
Equation3.3
wherethesubscriptsdenoteparticularelementsofthematrices.Inotherwords,theelementoftheproductmatrixCinrowiandcolumnjisthedotproductoftheithrowofTandthejthcolumnofA.Hence,computingeachoftheN2elementsofCrequiresNmultiplicationsandN1additions,makingtheoverallcomplexityofmatrixmultiplicationO(N3).
Therearemanywaystoparallelizeamatrixmultiplicationoperation.Itcanbeparallelizedusingeitherataskbaseddecomposition(asdiscussedintheExamplessectionoftheTaskDecompositionpattern)oradatabaseddecomposition(asdiscussedintheExamplessectionoftheDataDecompositionpattern).
Molecular dynamics
Moleculardynamicsisusedtosimulatethemotionsofalargemolecularsystem.Forexample,moleculardynamicssimulationsshowhowalargeproteinmovesaroundandhowdifferentlyshapeddrugsmightinteractwiththeprotein.Notsurprisingly,moleculardynamicsisextremelyimportantin
-
thepharmaceuticalindustry.Itisalsoausefultestproblemforcomputerscientistsworkingonparallelcomputing:Itisstraightforwardtounderstand,relevanttoscienceatlarge,anddifficulttoparallelizeeffectively.Asaresult,ithasbeenthesubjectofmuchresearch[Mat94,PH95,Pli95].
Thebasicideaistotreatamoleculeasalargecollectionofballsconnectedbysprings.Theballsrepresenttheatomsinthemolecule,whilethespringsrepresentthechemicalbondsbetweentheatoms.Themoleculardynamicssimulationitselfisanexplicittimesteppingprocess.Ateachtimestep,theforceoneachatomiscomputedandthenstandardclassicalmechanicstechniquesareusedtocomputehowtheforcemovestheatoms.Thisprocessiscarriedoutrepeatedlytostepthroughtimeandcomputeatrajectoryforthemolecularsystem.
Theforcesduetothechemicalbonds(the"springs")arerelativelysimpletocompute.Thesecorrespondtothevibrationsandrotationsofthechemicalbondsthemselves.Theseareshortrangeforcesthatcanbecomputedwithknowledgeofthehandfulofatomsthatsharechemicalbonds.Themajordifficultyarisesbecausetheatomshavepartialelectricalcharges.Hence,whileatomsonlyinteractwithasmallneighborhoodofatomsthroughtheirchemicalbonds,theelectricalchargescauseeveryatomtoapplyaforceoneveryotheratom.
ThisisthefamousNbodyproblem.OntheorderofN2termsmustbecomputedtofindthesenonbondedforces.BecauseNislarge(tensorhundredsofthousands)andthenumberoftimestepsinasimulationishuge(tensofthousands),thetimerequiredtocomputethesenonbondedforcesdominatesthecomputation.SeveralwayshavebeenproposedtoreducetheeffortrequiredtosolvetheNbodyproblem.Weareonlygoingtodiscussthesimplestone:thecutoffmethod.
Theideaissimple.Eventhougheachatomexertsaforceoneveryotheratom,thisforcedecreaseswiththesquareofthedistancebetweentheatoms.Hence,itshouldbepossibletopickadistancebeyondwhichtheforcecontributionissosmallthatitcanbeignored.Byignoringtheatomsthatexceedthiscutoff,theproblemisreducedtoonethatscalesasO(Nxn),wherenisthenumberofatomswithinthecutoffvolume,usuallyhundreds.Thecomputationisstillhuge,anditdominatestheoverallruntimeforthesimulation,butatleasttheproblemistractable.
Thereareahostofdetails,butthebasicsimulationcanbesummarizedasinFig.3.2.
Theprimarydatastructuresholdtheatomicpositions(atoms),thevelocitiesofeachatom(velocity),theforcesexertedoneachatom(forces),andlistsofatomswithinthecutoffdistanceofeachatoms(neighbors).Theprogramitselfisatimesteppingloop,inwhicheachiterationcomputestheshortrangeforceterms,updatestheneighborlists,andthenfindsthenonbondedforces.Aftertheforceoneachatomhasbeencomputed,asimpleordinarydifferentialequationissolvedtoupdatethepositionsandvelocities.Physicalpropertiesbasedonatomicmotionsarethenupdated,andwegotothenexttimestep.
Therearemanywaystoparallelizethemoleculardynamicsproblem.Weconsiderthemostcommonapproach,startingwiththetaskdecomposition(discussedintheTaskDecompositionpattern)andfollowingwiththeassociateddatadecomposition(discussedintheDataDecompositionpattern).Thisexampleshowshowthetwodecompositionsfittogethertoguidethedesignoftheparallelalgorithm.
-
Figure 3.2. Pseudocode for the molecular dynamics example
Int const N // number of atoms
Array of Real :: atoms (3,N) //3D coordinatesArray of Real :: velocities (3,N) //velocity vectorArray of Real :: forces (3,N) //force in each dimensionArray of List :: neighbors(N) //atoms in cutoff volume
loop over time steps vibrational_forces (N, atoms, forces) rotational_forces (N, atoms, forces) neighbor_list (N, atoms, neighbors) non_bonded_forces (N, atoms, neighbors, forces) update_atom_positions_and_velocities( N, atoms, velocities, forces) physical_properties ( ... Lots of stuff ... )end loop
3.2. THE TASK DECOMPOSITION PATTERN
ProblemHowcanaproblembedecomposedintotasksthatcanexecuteconcurrently?
ContextEveryparallelalgorithmdesignstartsfromthesamepoint,namelyagoodunderstandingoftheproblembeingsolved.Theprogrammermustunderstandwhicharethecomputationallyintensivepartsoftheproblem,thekeydatastructures,andhowthedataisusedastheproblem'ssolutionunfolds.
Thenextstepistodefinethetasksthatmakeuptheproblemandthedatadecompositionimpliedbythetasks.Fundamentally,everyparallelalgorithminvolvesacollectionoftasksthatcanexecuteconcurrently.Thechallengeistofindthesetasksandcraftanalgorithmthatletsthemrunconcurrently.
Insomecases,theproblemwillnaturallybreakdownintoacollectionofindependent(ornearlyindependent)tasks,anditiseasiesttostartwithataskbaseddecomposition.Inothercases,thetasksaredifficulttoisolateandthedecompositionofthedata(asdiscussedintheDataDecompositionpattern)isabetterstartingpoint.Itisnotalwaysclearwhichapproachisbest,andoftenthealgorithmdesignerneedstoconsiderboth.
Regardlessofwhetherthestartingpointisataskbasedoradatabaseddecomposition,however,aparallelalgorithmultimatelyneedstasksthatwillexecuteconcurrently,sothesetasksmustbeidentified.
-
ForcesThemainforcesinfluencingthedesignatthispointareflexibility,efficiency,andsimplicity.
Flexibility.Flexibilityinthedesignwillallowittobeadaptedtodifferentimplementationrequirements.Forexample,itisusuallynotagoodideatonarrowtheoptionstoasinglecomputersystemorstyleofprogrammingatthisstageofthedesign.
Efficiency.Aparallelprogramisonlyusefulifitscalesefficientlywiththesizeoftheparallelcomputer(intermsofreducedruntimeand/ormemoryutilization).Forataskdecomposition,thismeansweneedenoughtaskstokeepallthePEsbusy,withenoughworkpertasktocompensateforoverheadincurredtomanagedependencies.However,thedriveforefficiencycanleadtocomplexdecompositionsthatlackflexibility.
Simplicity.Thetaskdecompositionneedstobecomplexenoughtogetthejobdone,butsimpleenoughtolettheprogrambedebuggedandmaintainedwithreasonableeffort.
SolutionThekeytoaneffectivetaskdecompositionistoensurethatthetasksaresufficientlyindependentsothatmanagingdependenciestakesonlyasmallfractionoftheprogram'soverallexecutiontime.ItisalsoimportanttoensurethattheexecutionofthetaskscanbeevenlydistributedamongtheensembleofPEs(theloadbalancingproblem).
Inanidealworld,thecompilerwouldfindthetasksfortheprogrammer.Unfortunately,thisalmostneverhappens.Instead,itmustusuallybedonebyhandbasedonknowledgeoftheproblemandthecoderequiredtosolveit.Insomecases,itmightbenecessarytocompletelyrecasttheproblemintoaformthatexposesrelativelyindependenttasks.
Inataskbaseddecomposition,welookattheproblemasacollectionofdistincttasks,payingparticularattentionto
Theactionsthatarecarriedouttosolvetheproblem.(Arethereenoughofthemtokeeptheprocessingelementsonthetargetmachinesbusy?)
Whethertheseactionsaredistinctandrelativelyindependent.
Asafirstpass,wetrytoidentifyasmanytasksaspossible;itismucheasiertostartwithtoomanytasksandmergethemlateronthantostartwithtoofewtasksandlatertrytosplitthem.
Taskscanbefoundinmanydifferentplaces.
Insomecases,eachtaskcorrespondstoadistinctcalltoafunction.Definingataskforeachfunctioncallleadstowhatissometimescalledafunctionaldecomposition.
Anotherplacetofindtasksisindistinctiterationsoftheloopswithinanalgorithm.Iftheiterationsareindependentandthereareenoughofthem,thenitmightworkwelltobaseataskdecompositiononmappingeachiterationontoatask.Thisstyleoftaskbaseddecompositionleadstowhataresometimescalledloopsplittingalgorithms.
Tasksalsoplayakeyroleindatadrivendecompositions.Inthiscase,alargedatastructureisdecomposedandmultipleunitsofexecutionconcurrentlyupdatedifferentchunksofthedatastructure.Inthiscase,thetasksarethoseupdatesonindividualchunks.
-
AlsokeepinmindtheforcesgivenintheForcessection:
Flexibility.Thedesignneedstobeflexibleinthenumberoftasksgenerated.Usuallythisisdonebyparameterizingthenumberandsizeoftasksonsomeappropriatedimension.Thiswillletthedesignbeadaptedtoawiderangeofparallelcomputerswithdifferentnumbersofprocessors.
Efficiency.Therearetwomajorefficiencyissuestoconsiderinthetaskdecomposition.First,eachtaskmustincludeenoughworktocompensatefortheoverheadincurredbycreatingthetasksandmanagingtheirdependencies.Second,thenumberoftasksshouldbelargeenoughsothatalltheunitsofexecutionarebusywithusefulworkthroughoutthecomputation.
Simplicity.Tasksshouldbedefinedinawaythatmakesdebuggingandmaintenancesimple.Whenpossible,tasksshouldbedefinedsotheyreusecodefromexistingsequentialprogramsthatsolverelatedproblems.
Afterthetaskshavebeenidentified,thenextstepistolookatthedatadecompositionimpliedbythetasks.TheDataDecompositionpatternmayhelpwiththisanalysis.
Examples
Medical imaging
ConsiderthemedicalimagingproblemdescribedinSec.3.1.3.Inthisapplication,apointinsideamodelofthebodyisselectedrandomly,aradioactivedecayisallowedtooccuratthispoint,andthetrajectoryoftheemittedparticleisfollowed.Tocreateastatisticallysignificantsimulation,thousands,ifnotmillions,oftrajectoriesarefollowed.
Itisnaturaltoassociateataskwitheachtrajectory.Thesetasksareparticularlysimpletomanageconcurrentlybecausetheyarecompletelyindependent.Furthermore,therearelargenumbersoftrajectories,sotherewillbemanytasks,makingthisdecompositionsuitableforalargerangeofcomputersystems,fromasharedmemorysystemwithasmallnumberofprocessingelementstoalargeclusterwithhundredsofprocessingelements.
Withthebasictasksdefined,wenowconsiderthecorrespondingdatadecompositionthatis,wedefinethedataassociatedwitheachtask.Eachtaskneedstoholdtheinformationdefiningthetrajectory.Butthatisnotall:Thetasksneedaccesstothemodelofthebodyaswell.Althoughitmightnotbeapparentfromourdescriptionoftheproblem,thebodymodelcanbeextremelylarge.Becauseitisareadonlymodel,thisisnoproblemifthereisaneffectivesharedmemorysystem;eachtaskcanreaddataasneeded.Ifthetargetplatformisbasedonadistributedmemoryarchitecture,however,thebodymodelwillneedtobereplicatedoneachPE.Thiscanbeverytimeconsumingandcanwasteagreatdealofmemory.ForsystemswithsmallmemoriesperPEand/orwithslownetworksbetweenPEs,adecompositionoftheproblembasedonthebodymodelmightbemoreeffective.
Thisisacommonsituationinparallelprogramming:Manyproblemscanbedecomposedprimarilyintermsofdataorprimarilyintermsoftasks.Ifataskbaseddecompositionavoidstheneedtobreakupanddistributecomplexdatastructures,itwillbeamuchsimplerprogramtowriteanddebug.Ontheotherhand,ifmemoryand/ornetworkbandwidthisalimitingfactor,adecompositionthatfocuseson
-
thedatamightbemoreeffective.Itisnotsomuchamatterofoneapproachbeing"better"thananotherasamatterofbalancingtheneedsofthemachinewiththeneedsoftheprogrammer.WediscussthisinmoredetailintheDataDecompositionpattern.
Matrix multiplication
Considerthemultiplicationoftwomatrices(C=AB),asdescribedinSec.3.1.3.Wecanproduceataskbaseddecompositionofthisproblembyconsideringthecalculationofeachelementoftheproductmatrixasaseparatetask.EachtaskneedsaccesstoonerowofAandonecolumnofB.Thisdecompositionhastheadvantagethatallthetasksareindependent,andbecauseallthedatathatissharedamongtasks(AandB)isreadonly,itwillbestraightforwardtoimplementinasharedmemoryenvironment.
Theperformanceofthisalgorithm,however,wouldbepoor.ConsiderthecasewherethethreematricesaresquareandoforderN.ForeachelementofC,NelementsfromAandNelementsfromBwouldberequired,resultingin2NmemoryreferencesforNmultiply/addoperations.Memoryaccesstimeisslowcomparedtofloatingpointarithmetic,sothebandwidthofthememorysubsystemwouldlimittheperformance.
Abetterapproachwouldbetodesignanalgorithmthatmaximizesreuseofdataloadedintoaprocessor'scaches.Wecanarriveatthisalgorithmintwodifferentways.First,wecouldgrouptogethertheelementwisetaskswedefinedearliersothetasksthatusesimilarelementsoftheAandBmatricesrunonthesameUE(seetheGroupTaskspattern).Alternatively,wecouldstartwiththedatadecompositionanddesignthealgorithmfromthebeginningaroundthewaythematricesfitintothecaches.WediscussthisexamplefurtherintheExamplessectionoftheDataDecompositionpattern.
Molecular dynamics
ConsiderthemoleculardynamicsproblemdescribedinSec.3.1.3.PseudocodeforthisexampleisshownagaininFig.3.3.
Beforeperformingthetaskdecomposition,weneedtobetterunderstandsomedetailsoftheproblem.First,theneighbor_list ()computationistimeconsuming.Thegistofthecomputationisaloopovereachatom,insideofwhicheveryotheratomischeckedtodeterminewhetheritfallswithintheindicatedcutoffvolume.Fortunately,thetimestepsareverysmall,andtheatomsdon'tmoveverymuchinanygiventimestep.Hence,thistimeconsumingcomputationisonlycarriedoutevery10to100steps.
Figure 3.3. Pseudocode for the molecular dynamics example
Int const N // number of atoms
Array of Real :: atoms (3,N) //3D coordinatesArray of Real :: velocities (3,N) //velocity vectorArray of Real :: forces (3,N) //force in each dimensionArray of List :: neighbors(N) //atoms in cutoff volume
loop over time steps vibrational_forces (N, atoms, forces) rotational_forces (N, atoms, forces) neighbor_list (N, atoms, neighbors)
-
non_bonded_forces (N, atoms, neighbors, forces) update_atom_positions_and_velocities( N, atoms, velocities, forces) physical_properties ( ... Lots of stuff ... )end loop
Second,thephysical_properties()functioncomputesenergies,correlationcoefficients,andahostofinterestingphysicalproperties.Thesecomputations,however,aresimpleanddonotsignificantlyaffecttheprogram'soverallruntime,sowewillignoretheminthisdiscussion.
Becausethebulkofthecomputationtimewillbeinnon_bonded_forces(),wemustpickaproblemdecompositionthatmakesthatcomputationrunefficientlyinparallel.Theproblemismadeeasierbythefactthateachofthefunctionsinsidethetimeloophasasimilarstructure:Inthesequentialversion,eachfunctionincludesaloopoveratomstocomputecontributionstotheforcevector.Thus,anaturaltaskdefinitionistheupdaterequiredbyeachatom,whichcorrespondstoaloopiterationinthesequentialversion.Afterperformingthetaskdecomposition,therefore,weobtainthefollowingtasks.
Tasksthatfindthevibrationalforcesonanatom
Tasksthatfindtherotationalforcesonanatom
Tasksthatfindthenonbondedforcesonanatom
Tasksthatupdatethepositionandvelocityofanatom
Atasktoupdatetheneighborlistforalltheatoms(whichwewillleavesequential)
Withourcollectionoftasksinhand,wecanconsidertheaccompanyingdatadecomposition.Thekeydatastructuresaretheneighborlist,theatomiccoordinates,theatomicvelocities,andtheforcevector.Everyiterationthatupdatestheforcevectorneedsthecoordinatesofaneighborhoodofatoms.Thecomputationofnonbondedforces,however,potentiallyneedsthecoordinatesofalltheatoms,becausethemoleculebeingsimulatedmightfoldbackonitselfinunpredictableways.Wewillusethisinformationtocarryoutthedatadecomposition(intheDataDecompositionpattern)andthedatasharinganalysis(intheDataSharingpattern).
Known uses
Taskbaseddecompositionsareextremelycommoninparallelcomputing.Forexample,thedistancegeometrycodeDGEOM[Mat96]usesataskbaseddecomposition,asdoestheparallelWESDYNmoleculardynamicsprogram[MR95].
3.3. THE DATA DECOMPOSITION PATTERN
-
ProblemHowcanaproblem'sdatabedecomposedintounitsthatcanbeoperatedonrelativelyindependently?
ContextTheparallelalgorithmdesignermusthaveadetailedunderstandingoftheproblembeingsolved.Inaddition,thedesignershouldidentifythemostcomputationallyintensivepartsoftheproblem,thekeydatastructuresrequiredtosolvetheproblem,andhowdataisusedastheproblem'ssolutionunfolds.
Afterthebasicproblemisunderstood,theparallelalgorithmdesignershouldconsiderthetasksthatmakeuptheproblemandthedatadecompositionimpliedbythetasks.Boththetaskanddatadecompositionsneedtobeaddressedtocreateaparallelalgorithm.Thequestionisnotwhichdecompositiontodo.Thequestioniswhichonetostartwith.Adatabaseddecompositionisagoodstartingpointifthefollowingistrue.
Themostcomputationallyintensivepartoftheproblemisorganizedaroundthemanipulationofalargedatastructure.
Similaroperationsarebeingappliedtodifferentpartsofthedatastructure,insuchawaythatthedifferentpartscanbeoperatedonrelativelyindependently.
Forexample,manylinearalgebraproblemsupdatelargematrices,applyingasimilarsetofoperationstoeachelementofthematrix.Inthesecases,itisstraightforwardtodrivetheparallelalgorithmdesignbylookingathowthematrixcanbebrokenupintoblocksthatareupdatedconcurrently.Thetaskdefinitionsthenfollowfromhowtheblocksaredefinedandmappedontotheprocessingelementsoftheparallelcomputer.
ForcesThemainforcesinfluencingthedesignatthispointareflexibility,efficiency,andsimplicity.
Flexibility.Flexibilitywillallowthedesigntobeadaptedtodifferentimplementationrequirements.Forexample,itisusuallynotagoodideatonarrowtheoptionstoasinglecomputersystemorstyleofprogrammingatthisstageofthedesign.
Efficiency.Aparallelprogramisonlyusefulifitscalesefficientlywiththesizeoftheparallelcomputer(intermsofreducedruntimeand/ormemoryutilization).
Simplicity.Thedecompositionneedstobecomplexenoughtogetthejobdone,butsimpleenoughtolettheprogrambedebuggedandmaintainedwithreasonableeffort.
SolutionInsharedmemoryprogrammingenvironmentssuchasOpenMP,thedatadecompositionwillfrequentlybeimpliedbythetaskdecomposition.Inmostcases,however,thedecompositionwillneedtobedonebyhand,becausethememoryisphysicallydistributed,becausedatadependenciesaretoocomplexwithoutexplicitlydecomposingthedata,ortoachieveacceptableefficiencyonaNUMAcomputer.
Ifataskbaseddecompositionhasalreadybeendone,thedatadecompositionisdrivenbytheneedsofeachtask.Ifwelldefinedanddistinctdatacanbeassociatedwitheachtask,thedecompositionshould
-
besimple.
Whenstartingwithadatadecomposition,however,weneedtolooknotatthetasks,butatthecentraldatastructuresdefiningtheproblemandconsiderwhethertheycantheybebrokendownintochunksthatcanbeoperatedonconcurrently.Afewcommonexamplesincludethefollowing.
Arraybasedcomputations.Concurrencycanbedefinedintermsofupdatesofdifferentsegmentsofthearray.Ifthearrayismultidimensional,itcanbedecomposedinavarietyofways(rows,columns,orblocksofvaryingshapes).
Recursivedatastructures.Wecanthinkof,forexample,decomposingtheparallelupdateofalargetreedatastructurebydecomposingthedatastructureintosubtreesthatcanbeupdatedconcurrently.
Regardlessofthenatureoftheunderlyingdatastructure,ifthedatadecompositionistheprimaryfactordrivingthesolutiontotheproblem,itservesastheorganizingprincipleoftheparallelalgorithm.
Whenconsideringhowtodecomposetheproblem'sdatastructures,keepinmindthecompetingforces.
Flexibility.Thesizeandnumberofdatachunksshouldbeflexibletosupportthewidestrangeofparallelsystems.Oneapproachistodefinechunkswhosesizeandnumberarecontrolledbyasmallnumberofparameters.Theseparametersdefinegranularityknobsthatcanbevariedtomodifythesizeofthedatachunkstomatchtheneedsoftheunderlyinghardware.(Note,however,thatmanydesignsarenotinfinitelyadaptablewithrespecttogranularity.)
Theeasiestplacetoseetheimpactofgranularityonthedatadecompositionisintheoverheadrequiredtomanagedependenciesbetweenchunks.Thetimerequiredtomanagedependenciesmustbesmallcomparedtotheoverallruntime.Inagooddatadecomposition,thedependenciesscaleatalowerdimensionthanthecomputationaleffortassociatedwitheachchunk.Forexample,inmanyfinitedifferenceprograms,thecellsattheboundariesbetweenchunks,thatis,thesurfacesofthechunks,mustbeshared.Thesizeofthesetofdependentcellsscalesasthesurfacearea,whiletheeffortrequiredinthecomputationscalesasthevolumeofthechunk.Thismeansthatthecomputationaleffortcanbescaled(basedonthechunk'svolume)tooffsetoverheadsassociatedwithdatadependencies(basedonthesurfaceareaofthechunk).
Efficiency.Itisimportantthatthedatachunksbelargeenoughthattheamountofworktoupdatethechunkoffsetstheoverheadofmanagingdependencies.AmoresubtleissuetoconsiderishowthechunksmapontoUEs.AneffectiveparallelalgorithmmustbalancetheloadbetweenUEs.Ifthisisn'tdonewell,somePEsmighthaveadisproportionateamountofwork,andtheoverallscalabilitywillsuffer.Thismayrequirecleverwaystobreakuptheproblem.Forexample,iftheproblemclearsthecolumnsinamatrixfromlefttoright,acolumnmappingofthematrixwillcauseproblemsastheUEswiththeleftmostcolumnswillfinishtheirworkbeforetheothers.Arowbasedblockdecompositionorevenablockcyclicdecomposition(inwhichrowsareassignedcyclicallytoPEs)woulddoamuchbetterjobofkeepingalltheprocessorsfullyoccupied.TheseissuesarediscussedinmoredetailintheDistributedArraypattern.
-
Simplicity.Overlycomplexdatadecompositionscanbeverydifficulttodebug.Adatadecompositionwillusuallyrequireamappingofaglobalindexspaceontoatasklocalindexspace.Makingthismappingabstractallowsittobeeasilyisolatedandtested.
Afterthedatahasbeendecomposed,ifithasnotalreadybeendone,thenextstepistolookatthetaskdecompositionimpliedbythetasks.TheTaskDecompositionpatternmayhelpwiththisanalysis.
Examples
Medical imaging
ConsiderthemedicalimagingproblemdescribedinSec.3.1.3.Inthisapplication,apointinsideamodelofthebodyisselectedrandomly,aradioactivedecayisallowedtooccuratthispoint,andthetrajectoryoftheemittedparticleisfollowed.Tocreateastatisticallysignificantsimulation,thousandsifnotmillionsoftrajectoriesarefollowed.
Inadatabaseddecompositionofthisproblem,thebodymodelisthelargecentraldatastructurearoundwhichthecomputationcanbeorganized.Themodelisbrokenintosegments,andoneormoresegmentsareassociatedwitheachprocessingelement.Thebodysegmentsareonlyread,notwritten,duringthetrajectorycomputations,sotherearenodatadependenciescreatedbythedecompositionofthebodymodel.
Afterthedatahasbeendecomposed,weneedtolookatthetasksassociatedwitheachdatasegment.Inthiscase,eachtrajectorypassingthroughthedatasegmentdefinesatask.Thetrajectoriesareinitiatedandpropagatedwithinasegment.Whenasegmentboundaryisencountered,thetrajectorymustbepassedbetweensegments.Itisthistransferthatdefinesthedependenciesbetweendatachunks.
Ontheotherhand,inataskbasedapproachtothisproblem(asdiscussedintheTaskDecompositionpattern),thetrajectoriesforeachparticledrivethealgorithmdesign.EachPEpotentiallyneedstoaccessthefullbodymodeltoserviceitssetoftrajectories.Inasharedmemoryenvironment,thisiseasybecausethebodymodelisareadonlydataset.Inadistributedmemoryenvironment,however,thiswouldrequiresubstantialstartupoverheadasthebodymodelisbroadcastacrossthesystem.
Thisisacommonsituationinparallelprogramming:Differentpointsofviewleadtodifferentalgorithmswithpotentiallyverydifferentperformancecharacteristics.Thetaskbasedalgorithmissimple,butitonlyworksifeachprocessingelementhasaccesstoalargememoryandiftheoverheadincurredloadingthedataintomemoryisinsignificantcomparedtotheprogram'sruntime.Analgorithmdrivenbyadatadecomposition,ontheotherhand,makesefficientuseofmemoryand(indistributedmemoryenvironments)lessuseofnetworkbandwidth,butitincursmorecommunicationoverheadduringtheconcurrentpartofcomputationandissignificantlymorecomplex.ChoosingwhichistheappropriateapproachcanbedifficultandisdiscussedfurtherintheDesignEvaluationpattern.
Matrix multiplication
Considerthestandardmultiplicationoftwomatrice