pattern language for parallel programming, 2004

Upload: nicoleta-nico

Post on 30-Oct-2015

35 views

Category:

Documents


0 download

TRANSCRIPT

  • "Ifyoubuildit,theywillcome."

    Andsowebuiltthem.Multiprocessorworkstations,massivelyparallelsupercomputers,aclusterineverydepartment...andtheyhaven'tcome.Programmershaven'tcometoprogramthesewonderfulmachines.Oh,afewprogrammersinlovewiththechallengehaveshownthatmosttypesofproblemscanbeforcefitontoparallelcomputers,butgeneralprogrammers,especiallyprofessionalprogrammerswho"havelives",ignoreparallelcomputers.

    Andtheydosoattheirownperil.Parallelcomputersaregoingmainstream.Multithreadedmicroprocessors,multicoreCPUs,multiprocessorPCs,clusters,parallelgameconsoles...parallelcomputersaretakingovertheworldofcomputing.Thecomputerindustryisreadytofloodthemarketwithhardwarethatwillonlyrunatfullspeedwithparallelprograms.Butwhowillwritetheseprograms?

    Thisisanoldproblem.Evenintheearly1980s,whenthe"killermicros"startedtheirassaultontraditionalvectorsupercomputers,weworriedendlesslyabouthowtoattractnormalprogrammers.Wetriedeverythingwecouldthinkof:highlevelhardwareabstractions,implicitlyparallelprogramminglanguages,parallellanguageextensions,andportablemessagepassinglibraries.Butaftermanyyearsofhardwork,thefactofthematteristhat"they"didn'tcome.Theoverwhelmingmajorityofprogrammerswillnotinvesttheefforttowriteparallelsoftware.

    Acommonviewisthatyoucan'tteacholdprogrammersnewtricks,sotheproblemwillnotbesolveduntiltheoldprogrammersfadeawayandanewgenerationtakesover.

    Butwedon'tbuyintothatdefeatistattitude.Programmershaveshownaremarkableabilitytoadoptnewsoftwaretechnologiesovertheyears.LookathowmanyoldFortranprogrammersarenowwritingelegantJavaprogramswithsophisticatedobjectorienteddesigns.Theproblemisn'twitholdprogrammers.Theproblemiswitholdparallelcomputingexpertsandthewaythey'vetriedtocreateapoolofcapableparallelprogrammers.

    Andthat'swherethisbookcomesin.Wewanttocapturetheessenceofhowexpertparallelprogrammersthinkaboutparallelalgorithmsandcommunicatethatessentialunderstandinginawayprofessionalprogrammerscanreadilymaster.Thetechnologywe'veadoptedtoaccomplishthistaskisapatternlanguage.Wemadethischoicenotbecausewestartedtheprojectasdevoteesofdesignpatternslookingforanewfieldtoconquer,butbecausepatternshavebeenshowntoworkinwaysthatwouldbeapplicableinparallelprogramming.Forexample,patternshavebeenveryeffectiveinthefieldofobjectorienteddesign.Theyhaveprovidedacommonlanguageexpertscanusetotalkabouttheelementsofdesignandhavebeenextremelyeffectiveathelpingprogrammersmasterobjectorienteddesign.

  • Thisbookcontainsourpatternlanguageforparallelprogramming.Thebookopenswithacoupleofchapterstointroducethekeyconceptsinparallelcomputing.Thesechaptersfocusontheparallelcomputingconceptsandjargonusedinthepatternlanguageasopposedtobeinganexhaustiveintroductiontothefield.

    Thepatternlanguageitselfispresentedinfourpartscorrespondingtothefourphasesofcreatingaparallelprogram:

    *

    FindingConcurrency.Theprogrammerworksintheproblemdomaintoidentifytheavailableconcurrencyandexposeitforuseinthealgorithmdesign.

    *

    AlgorithmStructure.Theprogrammerworkswithhighlevelstructuresfororganizingaparallelalgorithm.

    *

    SupportingStructures.Weshiftfromalgorithmstosourcecodeandconsiderhowtheparallelprogramwillbeorganizedandthetechniquesusedtomanageshareddata.

    *

    ImplementationMechanisms.Thefinalstepistolookatspecificsoftwareconstructsforimplementingaparallelprogram.

    Thepatternsmakingupthesefourdesignspacesaretightlylinked.Youstartatthetop(FindingConcurrency),workthroughthepatterns,andbythetimeyougettothebottom(ImplementationMechanisms),youwillhaveadetaileddesignforyourparallelprogram.

    Ifthegoalisaparallelprogram,however,youneedmorethanjustaparallelalgorithm.Youalsoneedaprogrammingenvironmentandanotationforexpressingtheconcurrencywithintheprogram'ssourcecode.Programmersusedtobeconfrontedbyalargeandconfusingarrayofparallelprogrammingenvironments.Fortunately,overtheyearstheparallelprogrammingcommunityhasconvergedaroundthreeprogrammingenvironments.

  • *

    OpenMP.AsimplelanguageextensiontoC,C++,orFortrantowriteparallelprogramsforsharedmemorycomputers.

    *

    MPI.Amessagepassinglibraryusedonclustersandotherdistributedmemorycomputers.

    *

    Java.Anobjectorientedprogramminglanguagewithlanguagefeaturessupportingparallelprogrammingonsharedmemorycomputersandstandardclasslibrariessupportingdistributedcomputing.

    Manyreaderswillalreadybefamiliarwithoneormoreoftheseprogrammingnotations,butforreaderscompletelynewtoparallelcomputing,we'veincludedadiscussionoftheseprogrammingenvironmentsintheappendixes.

    Inclosing,wehavebeenworkingformanyyearsonthispatternlanguage.Presentingitasabooksopeoplecanstartusingitisanexcitingdevelopmentforus.Butwedon'tseethisastheendofthiseffort.Weexpectthatotherswillhavetheirownideasaboutnewandbetterpatternsforparallelprogramming.We'veassuredlymissedsomeimportantfeaturesthatreallybelonginthispatternlanguage.Weembracechangeandlookforwardtoengagingwiththelargerparallelcomputingcommunitytoiterateonthislanguage.Overtime,we'llupdateandimprovethepatternlanguageuntilittrulyrepresentstheconsensusviewoftheparallelprogrammingcommunity.Thenourrealworkwillbeginusingthepatternlanguagetoguidethecreationofbetterparallelprogrammingenvironmentsandhelpingpeopletousethesetechnologiestowriteparallelsoftware.Wewon'trestuntilthedaysequentialsoftwareisrare.

    ACKNOWLEDGMENTS

    Westartedworkingtogetheronthispatternlanguagein1998.It'sbeenalongandtwistedroad,startingwithavagueideaaboutanewwaytothinkaboutparallelalgorithmsandfinishingwiththisbook.Wecouldn'thavedonethiswithoutagreatdealofhelp.

    ManiChandy,whothoughtwewouldmakeagoodteam,introducedTimtoBeverlyandBerna.TheNationalScienceFoundation,IntelCorp.,andTrinityUniversityhavesupportedthisresearchatvarioustimesovertheyears.HelpwiththepatternsthemselvescamefromthepeopleatthePatternLanguagesofPrograms(PLoP)workshopsheldinIllinoiseachsummer.Theformatofthese

  • workshopsandtheresultingreviewprocesswaschallengingandsometimesdifficult,butwithoutthemwewouldhaveneverfinishedthispatternlanguage.Wewouldalsoliketothankthereviewerswhocarefullyreadearlymanuscriptsandpointedoutcountlesserrorsandwaystoimprovethebook.

    Finally,wethankourfamilies.Writingabookishardontheauthors,butthatistobeexpected.Whatwedidn'tfullyappreciatewashowharditwouldbeonourfamilies.WearegratefultoBeverly'sfamily(DanielandSteve),Tim'sfamily(Noah,August,andMartha),andBerna'sfamily(Billie)forthesacrificesthey'vemadetosupportthisproject.

    TimMattson,Olympia,Washington,April2004

    BeverlySanders,Gainesville,Florida,April2004

    BernaMassingill,SanAntonio,Texas,April2004

  • Chapter 1. APatternLanguageforParallelProgramming Section1.1. INTRODUCTION Section1.2. PARALLELPROGRAMMING Section1.3. DESIGNPATTERNSANDPATTERNLANGUAGES Section1.4. APATTERNLANGUAGEFORPARALLELPROGRAMMING

    Chapter 2. BackgroundandJargonofParallelComputing Section2.1. CONCURRENCYINPARALLELPROGRAMSVERSUSOPERATINGSYSTEMS Section2.2. PARALLELARCHITECTURES:ABRIEFINTRODUCTION Section2.3. PARALLELPROGRAMMINGENVIRONMENTS Section2.4. THEJARGONOFPARALLELCOMPUTING Section2.5. AQUANTITATIVELOOKATPARALLELCOMPUTATION Section2.6. COMMUNICATION Section2.7. SUMMARY

    Chapter 3. TheFindingConcurrencyDesignSpace Section3.1. ABOUTTHEDESIGNSPACE Section3.2. THETASKDECOMPOSITIONPATTERN Section3.3. THEDATADECOMPOSITIONPATTERN Section3.4. THEGROUPTASKSPATTERN Section3.5. THEORDERTASKSPATTERN Section3.6. THEDATASHARINGPATTERN Section3.7. THEDESIGNEVALUATIONPATTERN Section3.8. SUMMARY

    Chapter 4. TheAlgorithmStructureDesignSpace Section4.1. INTRODUCTION Section4.2. CHOOSINGANALGORITHMSTRUCTUREPATTERN Section4.3. EXAMPLES

  • Section4.4. THETASKPARALLELISMPATTERN Section4.5. THEDIVIDEANDCONQUERPATTERN Section4.6. THEGEOMETRICDECOMPOSITIONPATTERN Section4.7. THERECURSIVEDATAPATTERN Section4.8. THEPIPELINEPATTERN Section4.9. THEEVENTBASEDCOORDINATIONPATTERN

    Chapter 5. TheSupportingStructuresDesignSpace Section5.1. INTRODUCTION Section5.2. FORCES Section5.3. CHOOSINGTHEPATTERNS Section5.4. THESPMDPATTERN Section5.5. THEMASTER/WORKERPATTERN Section5.6. THELOOPPARALLELISMPATTERN Section5.7. THEFORK/JOINPATTERN Section5.8. THESHAREDDATAPATTERN Section5.9. THESHAREDQUEUEPATTERN Section5.10. THEDISTRIBUTEDARRAYPATTERN Section5.11. OTHERSUPPORTINGSTRUCTURES

    Chapter 6. TheImplementationMechanismsDesignSpace Section6.1. OVERVIEW Section6.2. UEMANAGEMENT Section6.3. SYNCHRONIZATION Section6.4. COMMUNICATION Endnotes

    Appendix A: ABriefIntroductiontoOpenMP SectionA.1. CORECONCEPTS SectionA.2. STRUCTUREDBLOCKSANDDIRECTIVEFORMATS SectionA.3. WORKSHARING SectionA.4. DATAENVIRONMENTCLAUSES SectionA.5. THEOpenMPRUNTIMELIBRARY SectionA.6. SYNCHRONIZATION SectionA.7. THESCHEDULECLAUSE SectionA.8. THERESTOFTHELANGUAGE

    Appendix B: ABriefIntroductiontoMPI SectionB.1. CONCEPTS SectionB.2. GETTINGSTARTED SectionB.3. BASICPOINTTOPOINTMESSAGEPASSING SectionB.4. COLLECTIVEOPERATIONS SectionB.5. ADVANCEDPOINTTOPOINTMESSAGEPASSING SectionB.6. MPIANDFORTRAN SectionB.7. CONCLUSION

    Appendix C: ABriefIntroductiontoConcurrentProgramminginJava SectionC.1. CREATINGTHREADS SectionC.2. ATOMICITY,MEMORYSYNCHRONIZATION,ANDTHEvolatileKEYWORD

  • SectionC.3. SYNCHRONIZEDBLOCKS SectionC.4. WAITANDNOTIFY SectionC.5. LOCKS SectionC.6. OTHERSYNCHRONIZATIONMECHANISMSANDSHAREDDATA STRUCTURESSectionC.7. INTERRUPTS GlossaryBibliography

    AbouttheAuthorsIndex

  • APatternLanguageforParallelProgramming>INTRODUCTION

    Chapter 1. A Pattern Language for Parallel Programming

    1.1INTRODUCTION

    1.2PARALLELPROGRAMMING

    1.3DESIGNPATTERNSANDPATTERNLANGUAGES

    1.4APATTERNLANGUAGEFORPARALLELPROGRAMMING

    1.1. INTRODUCTIONComputersareusedtomodelphysicalsystemsinmanyfieldsofscience,medicine,andengineering.Modelers,whethertryingtopredicttheweatherorrenderasceneinthenextblockbustermovie,canusuallyusewhatevercomputingpowerisavailabletomakeevermoredetailedsimulations.Vastamountsofdata,whethercustomershoppingpatterns,telemetrydatafromspace,orDNAsequences,requireanalysis.Todelivertherequiredpower,computerdesignerscombinemultipleprocessingelementsintoasinglelargersystem.Thesesocalledparallelcomputersrunmultipletaskssimultaneouslyandsolvebiggerproblemsinlesstime.

    Traditionally,parallelcomputerswererareandavailableforonlythemostcriticalproblems.Sincethemid1990s,however,theavailabilityofparallelcomputershaschangeddramatically.Withmultithreadingsupportbuiltintothelatestmicroprocessorsandtheemergenceofmultipleprocessorcoresonasinglesilicondie,parallelcomputersarebecomingubiquitous.Now,almosteveryuniversitycomputersciencedepartmenthasatleastoneparallelcomputer.Virtuallyalloilcompanies,automobilemanufacturers,drugdevelopmentcompanies,andspecialeffectsstudiosuseparallelcomputing.

    Forexample,incomputeranimation,renderingisthestepwhereinformationfromtheanimationfiles,suchaslighting,textures,andshading,isappliedto3Dmodelstogeneratethe2Dimagethatmakesupaframeofthefilm.Parallelcomputingisessentialtogeneratetheneedednumberofframes(24persecond)forafeaturelengthfilm.ToyStory,thefirstcompletelycomputergeneratedfeaturelengthfilm,releasedbyPixarin1995,wasprocessedona"renderfarm"consistingof100dual

  • processormachines[PS00].By1999,forToyStory2,Pixarwasusinga1,400processorsystemwiththeimprovementinprocessingpowerfullyreflectedintheimproveddetailsintextures,clothing,andatmosphericeffects.Monsters,Inc.(2001)usedasystemof250enterpriseserverseachcontaining14processorsforatotalof3,500processors.Itisinterestingthattheamountoftimerequiredtogenerateaframehasremainedrelativelyconstantascomputingpower(boththenumberofprocessorsandthespeedofeachprocessor)hasincreased,ithasbeenexploitedtoimprovethequalityoftheanimation.

    ThebiologicalscienceshavetakendramaticleapsforwardwiththeavailabilityofDNAsequenceinformationfromavarietyoforganisms,includinghumans.Oneapproachtosequencing,championedandusedwithsuccessbyCeleraCorp.,iscalledthewholegenomeshotgunalgorithm.Theideaistobreakthegenomeintosmallsegments,experimentallydeterminetheDNAsequencesofthesegments,andthenuseacomputertoconstructtheentiresequencefromthesegmentsbyfindingoverlappingareas.ThecomputingfacilitiesusedbyCeleratosequencethehumangenomeincluded150fourwayserversplusaserverwith16processorsand64GBofmemory.Thecalculationinvolved500milliontrillionbasetobasecomparisons[Ein00].

    TheSETI@homeproject[SET,ACK +02 ]providesafascinatingexampleofthepowerofparallelcomputing.Theprojectseeksevidenceofextraterrestrialintelligencebyscanningtheskywiththeworld'slargestradiotelescope,theAreciboTelescopeinPuertoRico.Thecollecteddataisthenanalyzedforcandidatesignalsthatmightindicateanintelligentsource.Thecomputationaltaskisbeyondeventhelargestsupercomputer,andcertainlybeyondthecapabilitiesofthefacilitiesavailabletotheSETI@homeproject.Theproblemissolvedwithpublicresourcecomputing,whichturnsPCsaroundtheworldintoahugeparallelcomputerconnectedbytheInternet.DataisbrokenupintoworkunitsanddistributedovertheInternettoclientcomputerswhoseownersdonatesparecomputingtimetosupporttheproject.EachclientperiodicallyconnectswiththeSETI@homeserver,downloadsthedatatoanalyze,andthensendstheresultsbacktotheserver.TheclientprogramistypicallyimplementedasascreensaversothatitwilldevoteCPUcyclestotheSETIproblemonlywhenthecomputerisotherwiseidle.AworkunitcurrentlyrequiresanaverageofbetweensevenandeighthoursofCPUtimeonaclient.Morethan205,000,000workunitshavebeenprocessedsincethestartoftheproject.Morerecently,similartechnologytothatdemonstratedbySETI@homehasbeenusedforavarietyofpublicresourcecomputingprojectsaswellasinternalprojectswithinlargecompaniesutilizingtheiridlePCstosolveproblemsrangingfromdrugscreeningtochipdesignvalidation.

    Althoughcomputinginlesstimeisbeneficial,andmayenableproblemstobesolvedthatcouldn'tbeotherwise,itcomesatacost.Writingsoftwaretorunonparallelcomputerscanbedifficult.Onlyasmallminorityofprogrammershaveexperiencewithparallelprogramming.Ifallthesecomputersdesignedtoexploitparallelismaregoingtoachievetheirpotential,moreprogrammersneedtolearnhowtowriteparallelprograms.

    Thisbookaddressesthisneedbyshowingcompetentprogrammersofsequentialmachineshowtodesignprogramsthatcanrunonparallelcomputers.Althoughmanyexcellentbooksshowhowtouseparticularparallelprogrammingenvironments,thisbookisuniqueinthatitfocusesonhowtothinkaboutanddesignparallelalgorithms.Toaccomplishthisgoal,wewillbeusingtheconceptofapatternlanguage.Thishighlystructuredrepresentationofexpertdesignexperiencehasbeenheavilyusedintheobjectorienteddesigncommunity.

  • Thebookopenswithtwointroductorychapters.Thefirstgivesanoverviewoftheparallelcomputinglandscapeandbackgroundneededtounderstandandusethepatternlanguage.Thisisfollowedbyamoredetailedchapterinwhichwelayoutthebasicconceptsandjargonusedbyparallelprogrammers.Thebookthenmovesintothepatternlanguageitself.

    1.2. PARALLEL PROGRAMMINGThekeytoparallelcomputingisexploitableconcurrency.Concurrencyexistsinacomputationalproblemwhentheproblemcanbedecomposedintosubproblemsthatcansafelyexecuteatthesametime.Tobeofanyuse,however,itmustbepossibletostructurethecodetoexposeandlaterexploittheconcurrencyandpermitthesubproblemstoactuallyrunconcurrently;thatis,theconcurrencymustbeexploitable.

    Mostlargecomputationalproblemscontainexploitableconcurrency.Aprogrammerworkswithexploitableconcurrencybycreatingaparallelalgorithmandimplementingthealgorithmusingaparallelprogrammingenvironment.Whentheresultingparallelprogramisrunonasystemwithmultipleprocessors,theamountoftimewehavetowaitfortheresultsofthecomputationisreduced.Inaddition,multipleprocessorsmayallowlargerproblemstobesolvedthancouldbedoneonasingleprocessorsystem.

    Asasimpleexample,supposepartofacomputationinvolvescomputingthesummationofalargesetofvalues.Ifmultipleprocessorsareavailable,insteadofaddingthevaluestogethersequentially,thesetcanbepartitionedandthesummationsofthesubsetscomputedsimultaneously,eachonadifferentprocessor.Thepartialsumsarethencombinedtogetthefinalanswer.Thus,usingmultipleprocessorstocomputeinparallelmayallowustoobtainasolutionsooner.Also,ifeachprocessorhasitsownmemory,partitioningthedatabetweentheprocessorsmayallowlargerproblemstobehandledthancouldbehandledonasingleprocessor.

    Thissimpleexampleshowstheessenceofparallelcomputing.Thegoalistousemultipleprocessorstosolveproblemsinlesstimeand/ortosolvebiggerproblemsthanwouldbepossibleonasingleprocessor.Theprogrammer'staskistoidentifytheconcurrencyintheproblem,structurethealgorithmsothatthisconcurrencycanbeexploited,andthenimplementthesolutionusingasuitableprogrammingenvironment.Thefinalstepistosolvetheproblembyexecutingthecodeonaparallelsystem.

    Parallelprogrammingpresentsuniquechallenges.Often,theconcurrenttasksmakinguptheproblemincludedependenciesthatmustbeidentifiedandcorrectlymanaged.Theorderinwhichthetasksexecutemaychangetheanswersofthecomputationsinnondeterministicways.Forexample,intheparallelsummationdescribedearlier,apartialsumcannotbecombinedwithothersuntilitsowncomputationhascompleted.Thealgorithmimposesapartialorderonthetasks(thatis,theymustcompletebeforethesumscanbecombined).Moresubtly,thenumericalvalueofthesummationsmaychangeslightlydependingontheorderoftheoperationswithinthesumsbecausefloatingpoint

  • arithmeticisnonassociative.Agoodparallelprogrammermusttakecaretoensurethatnondeterministicissuessuchasthesedonotaffectthequalityofthefinalanswer.Creatingsafeparallelprogramscantakeconsiderableeffortfromtheprogrammer.

    Evenwhenaparallelprogramis"correct",itmayfailtodelivertheanticipatedperformanceimprovementfromexploitingconcurrency.Caremustbetakentoensurethattheoverheadincurredbymanagingtheconcurrencydoesnotoverwhelmtheprogramruntime.Also,partitioningtheworkamongtheprocessorsinabalancedwayisoftennotaseasyasthesummationexamplesuggests.Theeffectivenessofaparallelalgorithmdependsonhowwellitmapsontotheunderlyingparallelcomputer,soaparallelalgorithmcouldbeveryeffectiveononeparallelarchitectureandadisasteronanother.

    Wewillrevisittheseissuesandprovideamorequantitativeviewofparallelcomputationinthenextchapter.

    1.3. DESIGN PATTERNS AND PATTERN LANGUAGESAdesignpatterndescribesagoodsolutiontoarecurringprobleminaparticularcontext.Thepatternfollowsaprescribedformatthatincludesthepatternname,adescriptionofthecontext,theforces(goalsandconstraints),andthesolution.Theideaistorecordtheexperienceofexpertsinawaythatcanbeusedbyothersfacingasimilarproblem.Inadditiontothesolutionitself,thenameofthepatternisimportantandcanformthebasisforadomainspecificvocabularythatcansignificantlyenhancecommunicationbetweendesignersinthesamearea.

    DesignpatternswerefirstproposedbyChristopherAlexander.Thedomainwascityplanningandarchitecture[AIS77].DesignpatternswereoriginallyintroducedtothesoftwareengineeringcommunitybyBeckandCunningham[BC87]andbecameprominentintheareaofobjectorientedprogrammingwiththepublicationofthebookbyGamma,Helm,Johnson,andVlissides[GHJV95],affectionatelyknownastheGoF(GangofFour)book.Thisbookgivesalargecollectionofdesignpatternsforobjectorientedprogramming.Togiveoneexample,theVisitorpatterndescribesawaytostructureclassessothatthecodeimplementingaheterogeneousdatastructurecanbekeptseparatefromthecodetotraverseit.Thus,whathappensinatraversaldependsonboththetypeofeachnodeandtheclassthatimplementsthetraversal.Thisallowsmultiplefunctionalityfordatastructuretraversals,andsignificantflexibilityasnewfunctionalitycanbeaddedwithouthavingtochangethedatastructureclass.ThepatternsintheGoFbookhaveenteredthelexiconofobjectorientedprogrammingreferencestoitspatternsarefoundintheacademicliterature,tradepublications,andsystemdocumentation.Thesepatternshavebynowbecomepartoftheexpectedknowledgeofanycompetentsoftwareengineer.

    AneducationalnonprofitorganizationcalledtheHillsideGroup[Hil]wasformedin1993topromotetheuseofpatternsandpatternlanguagesand,moregenerally,toimprovehumancommunicationaboutcomputers"byencouragingpeopletocodifycommonprogramminganddesignpractice".Todevelopnewpatternsandhelppatternwritershonetheirskills,theHillsideGroupsponsorsanannualPatternLanguagesofPrograms(PLoP)workshopandseveralspinoffsinotherpartsoftheworld,suchasChiliPLoP(inthewesternUnitedStates),KoalaPLoP(Australia),EuroPLoP(Europe),and

  • MensorePLoP(Japan).Theproceedingsoftheseworkshops[Pat]providearichsourceofpatternscoveringavastrangeofapplicationdomainsinsoftwaredevelopmentandhavebeenusedasabasisforseveralbooks[CS95,VCK96,MRB97,HFR99].

    Inhisoriginalworkonpatterns,Alexanderprovidednotonlyacatalogofpatterns,butalsoapatternlanguagethatintroducedanewapproachtodesign.Inapatternlanguage,thepatternsareorganizedintoastructurethatleadstheuserthroughthecollectionofpatternsinsuchawaythatcomplexsystemscanbedesignedusingthepatterns.Ateachdecisionpoint,thedesignerselectsanappropriatepattern.Eachpatternleadstootherpatterns,resultinginafinaldesignintermsofawebofpatterns.Thus,apatternlanguageembodiesadesignmethodologyandprovidesdomainspecificadvicetotheapplicationdesigner.(Inspiteoftheoverlappingterminology,apatternlanguageisnotaprogramminglanguage.)

    1.4. A PATTERN LANGUAGE FOR PARALLEL PROGRAMMINGThisbookdescribesapatternlanguageforparallelprogrammingthatprovidesseveralbenefits.Theimmediatebenefitsareawaytodisseminatetheexperienceofexpertsbyprovidingacatalogofgoodsolutionstoimportantproblems,anexpandedvocabulary,andamethodologyforthedesignofparallelprograms.Wehopetolowerthebarriertoparallelprogrammingbyprovidingguidancethroughtheentireprocessofdevelopingaparallelprogram.Theprogrammerbringstotheprocessagoodunderstandingoftheactualproblemtobesolvedandthenworksthroughthepatternlanguage,eventuallyobtainingadetailedparalleldesignorpossiblyworkingcode.Inthelongerterm,wehopethatthispatternlanguagecanprovideabasisforbothadisciplinedapproachtothequalitativeevaluationofdifferentprogrammingmodelsandthedevelopmentofparallelprogrammingtools.

    ThepatternlanguageisorganizedintofourdesignspacesFindingConcurrency,AlgorithmStructure,SupportingStructures,andImplementationMechanismswhichformalinearhierarchy,withFindingConcurrencyatthetopandImplementationMechanismsatthebottom,asshowninFig.1.1.

    Figure 1.1. Overview of the pattern language

    TheFindingConcurrencydesignspaceisconcernedwithstructuringtheproblemtoexposeexploitableconcurrency.Thedesignerworkingatthislevelfocusesonhighlevelalgorithmicissues

  • andreasonsabouttheproblemtoexposepotentialconcurrency.TheAlgorithmStructuredesignspaceisconcernedwithstructuringthealgorithmtotakeadvantageofpotentialconcurrency.Thatis,thedesignerworkingatthislevelreasonsabouthowtousetheconcurrencyexposedinworkingwiththeFindingConcurrencypatterns.TheAlgorithmStructurepatternsdescribeoverallstrategiesforexploitingconcurrency.TheSupportingStructuresdesignspacerepresentsanintermediatestagebetweentheAlgorithmStructureandImplementationMechanismsdesignspaces.Twoimportantgroupsofpatternsinthisspacearethosethatrepresentprogramstructuringapproachesandthosethatrepresentcommonlyusedshareddatastructures.TheImplementationMechanismsdesignspaceisconcernedwithhowthepatternsofthehigherlevelspacesaremappedintoparticularprogrammingenvironments.Weuseittoprovidedescriptionsofcommonmechanismsforprocess/threadmanagement(forexample,creatingordestroyingprocesses/threads)andprocess/threadinteraction(forexample,semaphores,barriers,ormessagepassing).Theitemsinthisdesignspacearenotpresentedaspatternsbecauseinmanycasestheymapdirectlyontoelementswithinparticularparallelprogrammingenvironments.Theyareincludedinthepatternlanguageanyway,however,toprovideacompletepathfromproblemdescriptiontocode.

    Chapter 2. Background and Jargon of Parallel Computing

    2.1CONCURRENCYINPARALLELPROGRAMSVERSUSOPERATINGSYSTEMS

    2.2PARALLELARCHITECTURES:ABRIEFINTRODUCTION

    2.3PARALLELPROGRAMMINGENVIRONMENTS

    2.4THEJARGONOFPARALLELCOMPUTING

    2.5AQUANTITATIVELOOKATPARALLELCOMPUTATION

    2.6COMMUNICATION

    2.7SUMMARY

    Inthischapter,wegiveanoverviewoftheparallelprogramminglandscape,anddefineanyspecializedparallelcomputingterminologythatwewilluseinthepatterns.Becausemanytermsincomputingareoverloaded,takingdifferentmeaningsindifferentcontexts,wesuggestthatevenreadersfamiliarwithparallelprogrammingatleastskimthischapter.

    2.1. CONCURRENCY IN PARALLEL PROGRAMS VERSUS OPERATING SYSTEMSConcurrencywasfirstexploitedincomputingtobetterutilizeorshareresourceswithinacomputer.Modernoperatingsystemssupportcontextswitchingtoallowmultipletaskstoappeartoexecuteconcurrently,therebyallowingusefulworktooccurwhiletheprocessorisstalledononetask.Thisapplicationofconcurrency,forexample,allowstheprocessortostaybusybyswappinginanewtasktoexecutewhileanothertaskiswaitingforI/O.Byquicklyswappingtasksinandout,givingeach

  • taska"slice"oftheprocessortime,theoperatingsystemcanallowmultipleuserstousethesystemasifeachwereusingitalone(butwithdegradedperformance).

    Mostmodernoperatingsystemscanusemultipleprocessorstoincreasethethroughputofthesystem.TheUNIXshellusesconcurrencyalongwithacommunicationabstractionknownaspipestoprovideapowerfulformofmodularity:Commandsarewrittentoacceptastreamofbytesasinput(theconsumer)andproduceastreamofbytesasoutput(theproducer).Multiplecommandscanbechainedtogetherwithapipeconnectingtheoutputofonecommandtotheinputofthenext,allowingcomplexcommandstobebuiltfromsimplebuildingblocks.Eachcommandisexecutedinitsownprocess,withallprocessesexecutingconcurrently.Becausetheproducerblocksifbufferspaceinthepipeisnotavailable,andtheconsumerblocksifdataisnotavailable,thejobofmanagingthestreamofresultsmovingbetweencommandsisgreatlysimplified.Morerecently,withoperatingsystemswithwindowsthatinviteuserstodomorethanonethingatatime,andtheInternet,whichoftenintroducesI/Odelaysperceptibletotheuser,almosteveryprogramthatcontainsaGUIincorporatesconcurrency.

    Althoughthefundamentalconceptsforsafelyhandlingconcurrencyarethesameinparallelprogramsandoperatingsystems,therearesomeimportantdifferences.Foranoperatingsystem,theproblemisnotfindingconcurrencytheconcurrencyisinherentinthewaytheoperatingsystemfunctionsinmanagingacollectionofconcurrentlyexecutingprocesses(representingusers,applications,andbackgroundactivitiessuchasprintspooling)andprovidingsynchronizationmechanismssoresourcescanbesafelyshared.However,anoperatingsystemmustsupportconcurrencyinarobustandsecureway:Processesshouldnotbeabletointerferewitheachother(intentionallyornot),andtheentiresystemshouldnotcrashifsomethinggoeswrongwithoneprocess.Inaparallelprogram,findingandexploitingconcurrencycanbeachallenge,whileisolatingprocessesfromeachotherisnotthecriticalconcernitiswithanoperatingsystem.Performancegoalsaredifferentaswell.Inanoperatingsystem,performancegoalsarenormallyrelatedtothroughputorresponsetime,anditmaybeacceptabletosacrificesomeefficiencytomaintainrobustnessandfairnessinresourceallocation.Inaparallelprogram,thegoalistominimizetherunningtimeofasingleprogram.

    2.2. PARALLEL ARCHITECTURES: A BRIEF INTRODUCTIONTherearedozensofdifferentparallelarchitectures,amongthemnetworksofworkstations,clustersofofftheshelfPCs,massivelyparallelsupercomputers,tightlycoupledsymmetricmultiprocessors,andmultiprocessorworkstations.Inthissection,wegiveanoverviewofthesesystems,focusingonthecharacteristicsrelevanttotheprogrammer.

    2.2.1. Flynn's TaxonomyByfarthemostcommonwaytocharacterizethesearchitecturesisFlynn'staxonomy[Fly72].Hecategorizesallcomputersaccordingtothenumberofinstructionstreamsanddatastreamstheyhave,whereastreamisasequenceofinstructionsordataonwhichacomputeroperates.InFlynn'staxonomy,therearefourpossibilities:SISD,SIMD,MISD,andMIMD.

  • SingleInstruction,SingleData(SISD).InaSISDsystem,onestreamofinstructionsprocessesasinglestreamofdata,asshowninFig.2.1.ThisisthecommonvonNeumannmodelusedinvirtuallyallsingleprocessorcomputers.

    Figure 2.1. The Single Instruction, Single Data (SISD) architecture

    SingleInstruction,MultipleData(SIMD).InaSIMDsystem,asingleinstructionstreamisconcurrentlybroadcasttomultipleprocessors,eachwithitsowndatastream(asshowninFig.2.2).TheoriginalsystemsfromThinkingMachinesandMasParcanbeclassifiedasSIMD.TheCPPDAPGammaIIandQuadricsApemillearemorerecentexamples;thesearetypicallydeployedinspecializedapplications,suchasdigitalsignalprocessing,thataresuitedtofinegrainedparallelismandrequirelittleinterprocesscommunication.Vectorprocessors,whichoperateonvectordatainapipelinedfashion,canalsobecategorizedasSIMD.Exploitingthisparallelismisusuallydonebythecompiler.

    Figure 2.2. The Single Instruction, Multiple Data (SIMD) architecture

  • MultipleInstruction,SingleData(MISD).Nowellknownsystemsfitthisdesignation.Itismentionedforthesakeofcompleteness.

    MultipleInstruction,MultipleData(MIMD).InaMIMDsystem,eachprocessingelementhasitsownstreamofinstructionsoperatingonitsowndata.Thisarchitecture,showninFig.2.3,isthemostgeneralofthearchitecturesinthateachoftheothercasescanbemappedontotheMIMDarchitecture.Thevastmajorityofmodernparallelsystemsfitintothiscategory.

    Figure 2.3. The Multiple Instruction, Multiple Data (MIMD) architecture

    2.2.2. A Further Breakdown of MIMDTheMIMDcategoryofFlynn'staxonomyistoobroadtobeusefulonitsown;thiscategoryistypicallydecomposedaccordingtomemoryorganization.

    Sharedmemory.Inasharedmemorysystem,allprocessesshareasingleaddressspaceandcommunicatewitheachotherbywritingandreadingsharedvariables.

    OneclassofsharedmemorysystemsiscalledSMPs(symmetricmultiprocessors).AsshowninFig.2.4,allprocessorsshareaconnectiontoacommonmemoryandaccessallmemorylocationsatequalspeeds.SMPsystemsarearguablytheeasiestparallelsystemstoprogrambecauseprogrammersdonotneedtodistributedatastructuresamongprocessors.Becauseincreasingthenumberofprocessorsincreasescontentionforthememory,theprocessor/memorybandwidthistypicallyalimitingfactor.Thus,SMPsystemsdonotscalewellandarelimitedtosmallnumbersofprocessors.

    Figure 2.4. The Symmetric Multiprocessor (SMP) architecture

  • TheothermainclassofsharedmemorysystemsiscalledNUMA(nonuniformmemoryaccess).AsshowninFig.2.5,thememoryisshared;thatis,itisuniformlyaddressablefromallprocessors,butsomeblocksofmemorymaybephysicallymorecloselyassociatedwithsomeprocessorsthanothers.Thisreducesthememorybandwidthbottleneckandallowssystemswithmoreprocessors;however,asaresult,theaccesstimefromaprocessortoamemorylocationcanbesignificantlydifferentdependingonhow"close"thememorylocationistotheprocessor.Tomitigatetheeffectsofnonuniformaccess,eachprocessorhasacache,alongwithaprotocoltokeepcacheentriescoherent.Hence,anothernameforthesearchitecturesiscachecoherentnonuniformmemoryaccesssystems(ccNUMA).Logically,programmingaccNUMAsystemisthesameasprogramminganSMP,buttoobtainthebestperformance,theprogrammerwillneedtobemorecarefulaboutlocalityissuesandcacheeffects.

    Figure 2.5. An example of the nonuniform memory access (NUMA) architecture

    Distributedmemory.Inadistributedmemorysystem,eachprocesshasitsownaddressspaceandcommunicateswithotherprocessesbymessagepassing(sendingandreceivingmessages).AschematicrepresentationofadistributedmemorycomputerisshowninFig.2.6.

    Figure 2.6. The distributed-memory architecture

    Dependingonthetopologyandtechnologyusedfortheprocessorinterconnection,communication

  • speedcanrangefromalmostasfastassharedmemory(intightlyintegratedsupercomputers)toordersofmagnitudeslower(forexample,inaclusterofPCsinterconnectedwithanEthernetnetwork).Theprogrammermustexplicitlyprogramallthecommunicationbetweenprocessorsandbeconcernedwiththedistributionofdata.

    Distributedmemorycomputersaretraditionallydividedintotwoclasses:MPP(massivelyparallelprocessors)andclusters.InanMPP,theprocessorsandthenetworkinfrastructurearetightlycoupledandspecializedforuseinaparallelcomputer.Thesesystemsareextremelyscalable,insomecasessupportingtheuseofmanythousandsofprocessorsinasinglesystem[MSW96,IBM02].

    Clustersaredistributedmemorysystemscomposedofofftheshelfcomputersconnectedbyanofftheshelfnetwork.WhenthecomputersarePCsrunningtheLinuxoperatingsystem,theseclustersarecalledBeowulfclusters.Asofftheshelfnetworkingtechnologyimproves,systemsofthistypearebecomingmorecommonandmuchmorepowerful.Clustersprovideaninexpensivewayforanorganizationtoobtainparallelcomputingcapabilities[Beo].Preconfiguredclustersarenowavailablefrommanyvendors.OnefrugalgroupevenreportedconstructingausefulparallelsystembyusingaclustertoharnessthecombinedpowerofobsoletePCsthatotherwisewouldhavebeendiscarded[HHS01].

    Hybridsystems.Thesesystemsareclustersofnodeswithseparateaddressspacesinwhicheachnodecontainsseveralprocessorsthatsharememory.

    AccordingtovanderSteenandDongarra's"OverviewofRecentSupercomputers"[vdSD03],whichcontainsabriefdescriptionofthesupercomputerscurrentlyorsoontobecommerciallyavailable,hybridsystemsformedfromclustersofSMPsconnectedbyafastnetworkarecurrentlythedominanttrendinhighperformancecomputing.Forexample,inlate2003,fourofthefivefastestcomputersintheworldwerehybridsystems[Top].

    Grids.Gridsaresystemsthatusedistributed,heterogeneousresourcesconnectedbyLANsand/orWANs[FK03].OftentheinterconnectionnetworkistheInternet.Gridswereoriginallyenvisionedasawaytolinkmultiplesupercomputerstoenablelargerproblemstobesolved,andthuscouldbeviewedasaspecialtypeofdistributedmemoryorhybridMIMDmachine.Morerecently,theideaofgridcomputinghasevolvedintoageneralwaytoshareheterogeneousresources,suchascomputationservers,storage,applicationservers,informationservices,orevenscientificinstruments.Gridsdifferfromclustersinthatthevariousresourcesinthegridneednothaveacommonpointofadministration.Inmostcases,theresourcesonagridareownedbydifferentorganizationsthatmaintaincontroloverthepoliciesgoverninguseoftheresources.Thisaffectsthewaythesesystemsareused,themiddlewarecreatedtomanagethem,andmostimportantlyforthisdiscussion,theoverheadincurredwhencommunicatingbetweenresourceswithinthegrid.

  • 2.2.3. SummaryWehaveclassifiedthesesystemsaccordingtothecharacteristicsofthehardware.Thesecharacteristicstypicallyinfluencethenativeprogrammingmodelusedtoexpressconcurrencyonasystem;however,thisisnotalwaysthecase.Itispossibleforaprogrammingenvironmentforasharedmemorymachinetoprovidetheprogrammerwiththeabstractionofdistributedmemoryandmessagepassing.Virtualdistributedsharedmemorysystemscontainmiddlewaretoprovidetheopposite:theabstractionofsharedmemoryonadistributedmemorymachine.

    2.3. PARALLEL PROGRAMMING ENVIRONMENTSParallelprogrammingenvironmentsprovidethebasictools,languagefeatures,andapplication programminginterfaces(APIs)neededtoconstructaparallelprogram.Aprogrammingenvironmentimpliesaparticularabstractionofthecomputersystemcalledaprogrammingmodel.TraditionalsequentialcomputersusethewellknownvonNeumannmodel.Becauseallsequentialcomputersusethismodel,softwaredesignerscandesignsoftwaretoasingleabstractionandreasonablyexpectittomapontomost,ifnotall,sequentialcomputers.

    Unfortunately,therearemanypossiblemodelsforparallelcomputing,reflectingthedifferentwaysprocessorscanbeinterconnectedtoconstructaparallelsystem.Themostcommonmodelsarebasedononeofthewidelydeployedparallelarchitectures:sharedmemory,distributedmemorywithmessagepassing,orahybridcombinationofthetwo.

    Programmingmodelstoocloselyalignedtoaparticularparallelsystemleadtoprogramsthatarenotportablebetweenparallelcomputers.Becausetheeffectivelifespanofsoftwareislongerthanthatofhardware,manyorganizationshavemorethanonetypeofparallelcomputer,andmostprogrammersinsistonprogrammingenvironmentsthatallowthemtowriteportableparallelprograms.Also,explicitlymanaginglargenumbersofresourcesinaparallelcomputerisdifficult,suggestingthathigherlevelabstractionsoftheparallelcomputermightbeuseful.Theresultisthatasofthemid1990s,therewasaveritableglutofparallelprogrammingenvironments.ApartiallistoftheseisshowninTable2.1.Thiscreatedagreatdealofconfusionforapplicationdevelopersandhinderedtheadoptionofparallelcomputingformainstreamapplications.

    Table 2.1. Some Parallel Programming Environments from the Mid-1990s

    "C*inC CUMULVS JavaRMI PRIO Quake

    ABCPL DAGGER javaPG P3L Quark

    ACE DAPPLE JAVAR P4Linda QuickThreads

    ACT++ DataParallelC JavaSpaces Pablo Sage++

    ADDAP DC++ JIDL PADE SAM

    Adl DCE++ Joyce PADRE SCANDAL

    Adsmith DDD Karma Panda SCHEDULE

  • AFAPI DICE Khoros Papers SciTL

    ALWAN DIPC KOAN/FortranS Para++ SDDA

    AM DistributedSmalltalk

    LAM Paradigm SHMEM

    AMDC DOLIB Legion Parafrase2 SIMPLE

    Amoeba DOME Lilac Paralation Sina

    AppLeS DOSMOS Linda Parallaxis SISAL

    ARTS DRL LiPS ParallelHaskell

    SMI

    AthapascanOb DSMThreads Locust ParallelC++ SONiC

    Aurora Ease Lparx ParC SplitC

    Automap ECO Lucid ParLib++ SR

    bb_threads Eilean Maisie ParLin Sthreads

    Blaze Emerald Manifold Parlog Strand

    BlockComm EPL Mentat Parmacs SUIF

    BSP Excalibur MetaChaos Parti SuperPascal

    C* Express Midway pC Synergy

    C** Falcon Millipede pC++ TCGMSG

    C4 Filaments Mirage PCN Telegraphos

    CarlOS FLASH Modula2* PCP: TheFORCE

    Cashmere FM ModulaP PCU Threads.h++

    CC++ Fork MOSIX PEACE TRAPPER

    Charlotte FortranM MpC PENNY TreadMarks

    Charm FX MPC++ PET UC

    Charm++ GA MPI PETSc uC++

    Chu GAMMA Multipol PH UNITY

    Cid Glenda Munin Phosphorus V

    Cilk GLU NanoThreads POET Vic*

    CMFortran GUARD NESL Polaris VisifoldVNUS

    Code HAsL NetClasses++ POOLT VPE

  • ConcurrentML HORUS Nexus POOMA Win32threads

    Converse HPC Nimrod POSYBL WinPar

    COOL HPC++ NOW PRESTO WWWinda

    CORRELATE HPF ObjectiveLinda Prospero XENOOPS

    CparPar IMPACT Occam Proteus XPC

    CPS ISETLLinda Omega PSDM Zounds

    CRL ISIS OOF90 PSI ZPL

    CSP JADA Orca PVM

    Cthreads JADE P++ QPC++

    Fortunately,bythelate1990s,theparallelprogrammingcommunityconvergedpredominantlyontwoenvironmentsforparallelprogramming:OpenMP[OMP]forsharedmemoryandMPI[Mesb]formessagepassing.

    OpenMPisasetoflanguageextensionsimplementedascompilerdirectives.ImplementationsarecurrentlyavailableforFortran,C,andC++.OpenMPisfrequentlyusedtoincrementallyaddparallelismtosequentialcode.Byaddingacompilerdirectivearoundaloop,forexample,thecompilercanbeinstructedtogeneratecodetoexecutetheiterationsoftheloopinparallel.Thecompilertakescareofmostofthedetailsofthreadcreationandmanagement.OpenMPprogramstendtoworkverywellonSMPs,butbecauseitsunderlyingprogrammingmodeldoesnotincludeanotionofnonuniformmemoryaccesstimes,itislessidealforccNUMAanddistributedmemorymachines.

    MPIisasetoflibraryroutinesthatprovideforprocessmanagement,messagepassing,andsomecollectivecommunicationoperations(theseareoperationsthatinvolvealltheprocessesinvolvedinaprogram,suchasbarrier,broadcast,andreduction).MPIprogramscanbedifficulttowritebecausetheprogrammerisresponsiblefordatadistributionandexplicitinterprocesscommunicationusingmessages.Becausetheprogrammingmodelassumesdistributedmemory,MPIisagoodchoiceforMPPsandotherdistributedmemorymachines.

    NeitherOpenMPnorMPIisanidealfitforhybridarchitecturesthatcombinemultiprocessornodes,eachwithmultipleprocessesandasharedmemory,intoalargersystemwithseparateaddressspacesforeachnode:TheOpenMPmodeldoesnotrecognizenonuniformmemoryaccesstimes,soitsdataallocationcanleadtopoorperformanceonmachinesthatarenotSMPs,whileMPIdoesnotincludeconstructstomanagedatastructuresresidinginasharedmemory.OnesolutionisahybridmodelinwhichOpenMPisusedoneachsharedmemorynodeandMPIisusedbetweenthenodes.Thisworkswell,butitrequirestheprogrammertoworkwithtwodifferentprogrammingmodelswithinasingleprogram.AnotheroptionistouseMPIonboththesharedmemoryanddistributedmemoryportionsofthealgorithmandgiveuptheadvantagesofasharedmemoryprogrammingmodel,evenwhenthehardwaredirectlysupportsit.

    Newhighlevelprogrammingenvironmentsthatsimplifyportableparallelprogrammingandmoreaccuratelyreflecttheunderlyingparallelarchitecturesaretopicsofcurrentresearch[Cen].Another

  • approachmorepopularinthecommercialsectoristoextendMPIandOpenMP.Inthemid1990s,theMPIForumdefinedanextendedMPIcalledMPI2.0,althoughimplementationsarenotwidelyavailableatthetimethiswaswritten.ItisalargecomplexextensiontoMPIthatincludesdynamicprocesscreation,parallelI/O,andmanyotherfeatures.Ofparticularinteresttoprogrammersofmodernhybridarchitecturesistheinclusionofonesidedcommunication.Onesidedcommunicationmimicssomeofthefeaturesofasharedmemorysystembylettingoneprocesswriteintoorreadfromthememoryregionsofotherprocesses.Theterm"onesided"referstothefactthatthereadorwriteislaunchedbytheinitiatingprocesswithouttheexplicitinvolvementoftheotherparticipatingprocess.AmoresophisticatedabstractionofonesidedcommunicationisavailableaspartoftheGlobalArrays[NHL96,NHK +02 ,Gloa]package.GlobalArraysworkstogetherwithMPItohelpaprogrammermanagedistributedarraydata.Aftertheprogrammerdefinesthearrayandhowitislaidoutinmemory,theprogramexecutes"puts"or"gets"intothearraywithoutneedingtoexplicitlymanagewhichMPIprocess"owns"theparticularsectionofthearray.Inessence,theglobalarrayprovidesanabstractionofagloballysharedarray.Thisonlyworksforarrays,butthesearesuchcommondatastructuresinparallelcomputingthatthispackage,althoughlimited,canbeveryuseful.

    JustasMPIhasbeenextendedtomimicsomeofthebenefitsofasharedmemoryenvironment,OpenMPhasbeenextendedtorunindistributedmemoryenvironments.TheannualWOMPAT(WorkshoponOpenMPApplicationsandTools)workshopscontainmanypapersdiscussingvariousapproachesandexperienceswithOpenMPinclustersandccNUMAenvironments.

    MPIisimplementedasalibraryofroutinestobecalledfromprogramswritteninasequentialprogramminglanguage,whereasOpenMPisasetofextensionstosequentialprogramminglanguages.Theyrepresenttwoofthepossiblecategoriesofparallelprogrammingenvironments(librariesandlanguageextensions),andthesetwoparticularenvironmentsaccountfortheoverwhelmingmajorityofparallelcomputingbeingdonetoday.Thereis,however,onemorecategoryofparallelprogrammingenvironments,namelylanguageswithbuiltinfeaturestosupportparallelprogramming.Javaissuchalanguage.Ratherthanbeingdesignedtosupporthighperformancecomputing,Javaisanobjectoriented,generalpurposeprogrammingenvironmentwithfeaturesforexplicitlyspecifyingconcurrentprocessingwithsharedmemory.Inaddition,thestandardI/OandnetworkpackagesprovideclassesthatmakeiteasyforJavatoperforminterprocesscommunicationbetweenmachines,thusmakingitpossibletowriteprogramsbasedonboththesharedmemoryandthedistributedmemorymodels.Thenewerjava.niopackagessupportI/Oinawaythatislessconvenientfortheprogrammer,butgivessignificantlybetterperformance,andJava21.5includesnewsupportforconcurrentprogramming,mostsignificantlyinthejava.util.concurrent.*packages.Additionalpackagesthatsupportdifferentapproachestoparallelcomputingarewidelyavailable.

    Althoughtherehavebeenothergeneralpurposelanguages,bothpriortoJavaandmorerecent(forexample,C#),thatcontainedconstructsforspecifyingconcurrency,Javaisthefirsttobecomewidelyused.Asaresult,itmaybethefirstexposureformanyprogrammerstoconcurrentandparallelprogramming.AlthoughJavaprovidessoftwareengineeringbenefits,currentlytheperformanceofparallelJavaprogramscannotcompetewithOpenMPorMPIprogramsfortypicalscientificcomputingapplications.TheJavadesignhasalsobeencriticizedforseveraldeficienciesthatmatterinthisdomain(forexample,afloatingpointmodelthatemphasizesportabilityandmorereproducibleresultsoverexploitingtheavailablefloatingpointhardwaretothefullest,inefficienthandlingof

  • arrays,andlackofalightweightmechanismtohandlecomplexnumbers).TheperformancedifferencebetweenJavaandotheralternativescanbeexpectedtodecrease,especiallyforsymbolicorothernonnumericproblems,ascompilertechnologyforJavaimprovesandasnewpackagesandlanguageextensionsbecomeavailable.TheTitaniumproject[Tita]isanexampleofaJavadialectdesignedforhighperformancecomputinginaccNUMAenvironment.

    Forthepurposesofthisbook,wehavechosenOpenMP,MPI,andJavaasthethreeenvironmentswewilluseinourexamplesOpenMPandMPIfortheirpopularityandJavabecauseitislikelytobemanyprogrammers'firstexposuretoconcurrentprogramming.Abriefoverviewofeachcanbefoundintheappendixes.

    2.4. THE JARGON OF PARALLEL COMPUTINGInthissection,wedefinesometermsthatarefrequentlyusedthroughoutthepatternlanguage.Additionaldefinitionscanbefoundintheglossary.

    Task.Thefirststepindesigningaparallelprogramistobreaktheproblemupintotasks.Ataskisasequenceofinstructionsthatoperatetogetherasagroup.Thisgroupcorrespondstosomelogicalpartofanalgorithmorprogram.Forexample,considerthemultiplicationoftwoorderNmatrices.Dependingonhowweconstructthealgorithm,thetaskscouldbe(1)themultiplicationofsubblocksofthematrices,(2)innerproductsbetweenrowsandcolumnsofthematrices,or(3)individualiterationsoftheloopsinvolvedinthematrixmultiplication.Thesearealllegitimatewaystodefinetasksformatrixmultiplication;thatis,thetaskdefinitionfollowsfromthewaythealgorithmdesignerthinksabouttheproblem.

    Unitofexecution(UE).Tobeexecuted,ataskneedstobemappedtoaUEsuchasaprocessorthread.Aprocessisacollectionofresourcesthatenablestheexecutionofprograminstructions.Theseresourcescanincludevirtualmemory,I/Odescriptors,aruntimestack,signalhandlers,userandgroupIDs,andaccesscontroltokens.Amorehighlevelviewisthataprocessisa"heavyweight"unitofexecutionwithitsownaddressspace.AthreadisthefundamentalUEinmodernoperatingsystems.Athreadisassociatedwithaprocessandsharestheprocess'senvironment.Thismakesthreadslightweight(thatis,acontextswitchbetweenthreadstakesonlyasmallamountoftime).Amorehighlevelviewisthatathreadisa"lightweight"UEthatsharesanaddressspacewithotherthreads.

    WewilluseunitofexecutionorUEasagenerictermforoneofacollectionofpossiblyconcurrentlyexecutingentities,usuallyeitherprocessesorthreads.Thisisconvenientintheearlystagesofprogramdesignwhenthedistinctionsbetweenprocessesandthreadsarelessimportant.

    Processingelement(PE).Weusethetermprocessingelement(PE)asagenerictermforahardwareelementthatexecutesastreamofinstructions.TheunitofhardwareconsideredtobeaPEdependsonthecontext.Forexample,someprogrammingenvironmentsvieweachworkstationinaclusterofSMPworkstationsasexecutingasingleinstructionstream;inthissituation,thePEwouldbetheworkstation.Adifferentprogrammingenvironmentrunningonthesamehardware,however,mightvieweachprocessorofeachworkstationasexecutinganindividualinstructionstream;inthiscase,thePEistheindividualprocessor,andeachworkstationcontainsseveralPEs.

  • Loadbalanceandloadbalancing.Toexecuteaparallelprogram,thetasksmustbemappedtoUEs,andtheUEstoPEs.Howthemappingsaredonecanhaveasignificantimpactontheoverallperformanceofaparallelalgorithm.ItiscrucialtoavoidthesituationinwhichasubsetofthePEsisdoingmostoftheworkwhileothersareidle.LoadbalancereferstohowwelltheworkisdistributedamongPEs.LoadbalancingistheprocessofallocatingworktoPEs,eitherstaticallyordynamically,sothattheworkisdistributedasevenlyaspossible.

    Synchronization.Inaparallelprogram,duetothenondeterminismoftaskschedulingandotherfactors,eventsinthecomputationmightnotalwaysoccurinthesameorder.Forexample,inonerun,ataskmightreadvariablexbeforeanothertaskreadsvariabley;inthenextrunwiththesameinput,theeventsmightoccurintheoppositeorder.Inmanycases,theorderinwhichtwoeventsoccurdoesnotmatter.Inothersituations,theorderdoesmatter,andtoensurethattheprogramiscorrect,theprogrammermustintroducesynchronizationtoenforcethenecessaryorderingconstraints.TheprimitivesprovidedforthispurposeinourselectedenvironmentsarediscussedintheImplementationMechanismsdesignspace(Section6.3).

    Synchronousversusasynchronous.Weusethesetwotermstoqualitativelyrefertohowtightlycoupledintimetwoeventsare.Iftwoeventsmusthappenatthesametime,theyaresynchronous;otherwisetheyareasynchronous.Forexample,messagepassing(thatis,communicationbetweenUEsbysendingandreceivingmessages)issynchronousifamessagesentmustbereceivedbeforethesendercancontinue.Messagepassingisasynchronousifthesendercancontinueitscomputationregardlessofwhathappensatthereceiver,orifthereceivercancontinuecomputationswhilewaitingforareceivetocomplete.

    Raceconditions.Araceconditionisakindoferrorpeculiartoparallelprograms.ItoccurswhentheoutcomeofaprogramchangesastherelativeschedulingofUEsvaries.BecausetheoperatingsystemandnottheprogrammercontrolstheschedulingoftheUEs,raceconditionsresultinprogramsthatpotentiallygivedifferentanswersevenwhenrunonthesamesystemwiththesamedata.Raceconditionsareparticularlydifficulterrorstodebugbecausebytheirnaturetheycannotbereliablyreproduced.Testinghelps,butisnotaseffectiveaswithsequentialprograms:Aprogrammayruncorrectlythefirstthousandtimesandthenfailcatastrophicallyonthethousandandfirstexecutionandthenrunagaincorrectlywhentheprogrammerattemptstoreproducetheerrorasthefirststepindebugging.

    Raceconditionsresultfromerrorsinsynchronization.IfmultipleUEsreadandwritesharedvariables,theprogrammermustprotectaccesstothesesharedvariablessothereadsandwritesoccurinavalidorderregardlessofhowthetasksareinterleaved.Whenmanyvariablesaresharedorwhentheyareaccessedthroughmultiplelevelsofindirection,verifyingbyinspectionthatnoraceconditionsexistcanbeverydifficult.Toolsareavailablethathelpdetectandfixraceconditions,suchasThreadCheckerfromIntelCorporation,andtheproblemremainsanareaofactiveandimportantresearch[NM92].

    Deadlocks.Deadlocksareanothertypeoferrorpeculiartoparallelprograms.Adeadlockoccurswhenthereisacycleoftasksinwhicheachtaskisblockedwaitingforanothertoproceed.Becauseallarewaitingforanothertasktodosomething,theywillallbeblockedforever.Asasimpleexample,considertwotasksinamessagepassingenvironment.TaskAattemptstoreceiveamessagefromtaskB,afterwhichAwillreplybysendingamessageofitsowntotaskB.Meanwhile,taskBattemptsto

  • receiveamessagefromtaskA,afterwhichBwillsendamessagetoA.Becauseeachtaskiswaitingfortheothertosenditamessagefirst,bothtaskswillbeblockedforever.Fortunately,deadlocksarenotdifficulttodiscover,asthetaskswillstopatthepointofthedeadlock.

    2.5. A QUANTITATIVE LOOK AT PARALLEL COMPUTATIONThetwomainreasonsforimplementingaparallelprogramaretoobtainbetterperformanceandtosolvelargerproblems.Performancecanbebothmodeledandmeasured,sointhissectionwewilltakeaanotherlookatparallelcomputationsbygivingsomesimpleanalyticalmodelsthatillustratesomeofthefactorsthatinfluencetheperformanceofaparallelprogram.

    Consideracomputationconsistingofthreeparts:asetupsection,acomputationsection,andafinalizationsection.ThetotalrunningtimeofthisprogramononePEisthengivenasthesumofthetimesforthethreeparts.

    Equation2.1

    WhathappenswhenwerunthiscomputationonaparallelcomputerwithmultiplePEs?Supposethatthesetupandfinalizationsectionscannotbecarriedoutconcurrentlywithanyotheractivities,butthatthecomputationsectioncouldbedividedintotasksthatwouldrunindependentlyonasmanyPEsasareavailable,withthesametotalnumberofcomputationstepsasintheoriginalcomputation.ThetimeforthefullcomputationonPPEscanthereforebegivenbyOfcourse,Eq.2.2describesaveryidealizedsituation.However,theideathatcomputationshaveaserialpart(forwhichadditionalPEsareuseless)andaparallelizablepart(forwhichmorePEsdecreasetherunningtime)isrealistic.Thus,thissimplemodelcapturesanimportantrelationship.

    Equation2.2

    AnimportantmeasureofhowmuchadditionalPEshelpistherelativespeedupS,whichdescribeshowmuchfasteraproblemrunsinawaythatnormalizesawaytheactualrunningtime.

    Equation2.3

    ArelatedmeasureistheefficiencyE,whichisthespeedupnormalizedbythenumberofPEs.

    Equation2.4

  • Equation2.5

    Ideally,wewouldwantthespeeduptobeequaltoP,thenumberofPEs.Thisissometimescalledperfectlinearspeedup.Unfortunately,thisisanidealthatcanrarelybeachievedbecausetimesforsetupandfinalizationarenotimprovedbyaddingmorePEs,limitingthespeedup.Thetermsthatcannotberunconcurrentlyarecalledtheserialterms.Theirrunningtimesrepresentsomefractionofthetotal,calledtheserialfraction,denoted .Equation2.6

    Thefractionoftimespentintheparallelizablepartoftheprogramisthen(1 ).Wecanthus rewritetheexpressionfortotalcomputationtimewithPPEsas

    Equation2.7

    Now,rewritingSintermsofthenewexpressionforTtotal(P),weobtainthefamousAmdahl'slaw:

    Equation2.8

    Equation2.9

    Thus,inanidealparallelalgorithmwithnooverheadintheparallelpart,thespeedupshouldfollowEq.2.9.Whathappenstothespeedupifwetakeouridealparallelalgorithmanduseaverylargenumberofprocessors?TakingthelimitasPgoestoinfinityinourexpressionforSyields

  • Equation2.10

    Eq.2.10thusgivesanupperboundonthespeedupobtainableinanalgorithmwhoseserialpartrepresents ofthetotalcomputation.Theseconceptsarevitaltotheparallelalgorithmdesigner.Indesigningaparallelalgorithm,itisimportanttounderstandthevalueoftheserialfractionsothatrealisticexpectationscanbesetforperformance.Itmaynotmakesensetoimplementacomplex,arbitrarilyscalableparallelalgorithmif10%ormoreofthealgorithmisserialand10%isfairlycommon.

    Ofcourse,Amdahl'slawisbasedonassumptionsthatmayormaynotbetrueinpractice.Inreallife,anumberoffactorsmaymaketheactualrunningtimelongerthanthisformulaimplies.Forexample,creatingadditionalparalleltasksmayincreaseoverheadandthechancesofcontentionforsharedresources.Ontheotherhand,iftheoriginalserialcomputationislimitedbyresourcesotherthantheavailabilityofCPUcycles,theactualperformancecouldbemuchbetterthanAmdahl'slawwouldpredict.Forexample,alargeparallelmachinemayallowbiggerproblemstobeheldinmemory,thusreducingvirtualmemorypaging,ormultipleprocessorseachwithitsowncachemayallowmuchmoreoftheproblemtoremaininthecache.Amdahl'slawalsorestsontheassumptionthatforanygiveninput,theparallelandserialimplementationsperformexactlythesamenumberofcomputationalsteps.Iftheserialalgorithmbeingusedintheformulaisnotthebestpossiblealgorithmfortheproblem,thenacleverparallelalgorithmthatstructuresthecomputationdifferentlycanreducethetotalnumberofcomputationalsteps.

    Ithasalsobeenobserved[Gus88]thattheexerciseunderlyingAmdahl'slaw,namelyrunningexactlythesameproblemwithvaryingnumbersofprocessors,isartificialinsomecircumstances.If,say,theparallelapplicationwereaweathersimulation,thenwhennewprocessorswereadded,onewouldmostlikelyincreasetheproblemsizebyaddingmoredetailstothemodelwhilekeepingthetotalexecutiontimeconstant.Ifthisisthecase,thenAmdahl'slaw,orfixedsizespeedup,givesapessimisticviewofthebenefitsofadditionalprocessors.

    Toseethis,wecanreformulatetheequationtogivethespeedupintermsofperformanceonaPprocessorsystem.EarlierinEq.2.2,weobtainedtheexecutiontimeforTprocessors,Ttotal(P),from

    theexecutiontimeoftheserialtermsandtheexecutiontimeoftheparallelizablepartwhenexecutedononeprocessor.Here,wedotheoppositeandobtainTtotal(1)fromtheserialandparallelterms

    whenexecutedonPprocessors.

    Equation2.11

    Now,wedefinethesocalledscaledserialfraction,denotedscaled,as

  • Equation2.12

    andthen

    Equation2.13

    Rewritingtheequationforspeedup(Eq.2.3)andsimplifying,weobtainthescaled(orfixedtime)speedup.[1]

    [1]Thisequation,sometimesknownasGustafson'slaw,wasattributedin[Gus88]toE.Barsis.

    Equation2.14

    ThisgivesexactlythesamespeedupasAmdahl'slaw,butallowsadifferentquestiontobeaskedwhenthenumberofprocessorsisincreased.SincescaleddependsonP,theresultoftakingthelimitisn'timmediatelyobvious,butwouldgivethesameresultasthelimitinAmdahl'slaw.However,supposewetakethelimitinPwhileholdingTcomputeandthusscaledconstant.Theinterpretationisthatweareincreasingthesizeoftheproblemsothatthetotalrunningtimeremainsconstantwhenmoreprocessorsareadded.(Thiscontainstheimplicitassumptionthattheexecutiontimeoftheserialtermsdoesnotchangeastheproblemsizegrows.)Inthiscase,thespeedupislinearinP.Thus,whileaddingmoreprocessorstosolveafixedproblemmayhitthespeeduplimitsofAmdahl'slawwitharelativelysmallnumberofprocessors,iftheproblemgrowsasmoreprocessorsareadded,Amdahl'slawwillbepessimistic.Thesetwomodelsofspeedup,alongwithafixedmemoryversionofspeedup,arediscussedin[SN90].

    2.6. COMMUNICATION

    2.6.1. Latency and BandwidthAsimplebutusefulmodelcharacterizesthetotaltimeformessagetransferasthesumofafixedcost

  • plusavariablecostthatdependsonthelengthofthemessage.

    Equation2.15

    Thefixedcost iscalledlatencyandisessentiallythetimeittakestosendanemptymessageover thecommunicationmedium,fromthetimethesendroutineiscalledtothetimethedataisreceivedbytherecipient.Latency(giveninsomeappropriatetimeunit)includesoverheadduetosoftwareandnetworkhardwareplusthetimeittakesforthemessagetotraversethecommunicationmedium.Thebandwidth (giveninsomemeasureofbytespertimeunit)isameasureofthecapacityofthe communicationmedium.Nisthelengthofthemessage.

    Thelatencyandbandwidthcanvarysignificantlybetweensystemsdependingonboththehardwareusedandthequalityofthesoftwareimplementingthecommunicationprotocols.Becausethesevaluescanbemeasuredwithfairlysimplebenchmarks[DD97],itissometimesworthwhiletomeasurevaluesfor and ,asthesecanhelpguideoptimizationstoimprovecommunicationperformance. Forexample,inasysteminwhich isrelativelylarge,itmightbeworthwhiletotrytorestructurea programthatsendsmanysmallmessagestoaggregatethecommunicationintoafewlargemessagesinstead.Dataforseveralrecentsystemshasbeenpresentedin[BBC +03 ].

    2.6.2. Overlapping Communication and Computation and Latency HidingIfwelookmorecloselyatthecomputationtimewithinasingletaskonasingleprocessor,itcanroughlybedecomposedintocomputationtime,communicationtime,andidletime.Thecommunicationtimeisthetimespentsendingandreceivingmessages(andthusonlyappliestodistributedmemorymachines),whereastheidletimeistimethatnoworkisbeingdonebecausethetaskiswaitingforanevent,suchasthereleaseofaresourceheldbyanothertask.

    Acommonsituationinwhichataskmaybeidleiswhenitiswaitingforamessagetobetransmittedthroughthesystem.Thiscanoccurwhensendingamessage(astheUEwaitsforareplybeforeproceeding)orwhenreceivingamessage.Sometimesitispossibletoeliminatethiswaitbyrestructuringthetasktosendthemessageand/orpostthereceive(thatis,indicatethatitwantstoreceiveamessage)andthencontinuethecomputation.Thisallowstheprogrammertooverlapcommunicationandcomputation.WeshowanexampleofthistechniqueinFig.2.7.Thisstyleofmessagepassingismorecomplicatedfortheprogrammer,becausetheprogrammermusttakecaretowaitforthereceivetocompleteafteranyworkthatcanbeoverlappedwithcommunicationiscompleted.

  • Figure 2.7. Communication without (left) and with (right) support for overlapping communication and computation. Although UE 0 in the computation on the right still has some idle time waiting for the reply from UE 1, the idle time is reduced and the

    computation requires less total time because of UE 1 's earlier start.

    AnothertechniqueusedonmanyparallelcomputersistoassignmultipleUEstoeachPE,sothatwhenoneUEiswaitingforcommunication,itwillbepossibletocontextswitchtoanotherUEandkeeptheprocessorbusy.Thisisanexampleoflatencyhiding.Itisincreasinglybeingusedonmodernhighperformancecomputingsystems,themostfamousexamplebeingtheMTAsystemfromCrayResearch[ACC +90 ].

    2.7. SUMMARYThischapterhasgivenabriefoverviewofsomeoftheconceptsandvocabularyusedinparallelcomputing.Additionaltermsaredefinedintheglossary.Wealsodiscussedthemajorprogrammingenvironmentsinuseforparallelcomputing:OpenMP,MPI,andJava.Throughoutthebook,wewillusethesethreeprogrammingenvironmentsforourexamples.MoredetailsaboutOpenMP,MPI,andJavaandhowtousethemtowriteparallelprogramsareprovidedintheappendixes.

    Chapter 3. The Finding Concurrency Design Space

    3.1ABOUTTHEDESIGNSPACE

    3.2THETASKDECOMPOSITIONPATTERN

  • 3.3THEDATADECOMPOSITIONPATTERN

    3.4THEGROUPTASKSPATTERN

    3.5THEORDERTASKSPATTERN

    3.6THEDATASHARINGPATTERN

    3.7THEDESIGNEVALUATIONPATTERN

    3.8SUMMARY

    3.1. ABOUT THE DESIGN SPACEThesoftwaredesignerworksinanumberofdomains.Thedesignprocessstartsintheproblemdomainwithdesignelementsdirectlyrelevanttotheproblembeingsolved(forexample,fluidflows,decisiontrees,atoms,etc.).Theultimateaimofthedesignissoftware,soatsomepoint,thedesignelementschangeintoonesrelevanttoaprogram(forexample,datastructuresandsoftwaremodules).Wecallthistheprogramdomain.Althoughitisoftentemptingtomoveintotheprogramdomainassoonaspossible,adesignerwhomovesoutoftheproblemdomaintoosoonmaymissvaluabledesignoptions.

    Thisisparticularlyrelevantinparallelprogramming.Parallelprogramsattempttosolvebiggerproblemsinlesstimebysimultaneouslysolvingdifferentpartsoftheproblemondifferentprocessingelements.Thiscanonlywork,however,iftheproblemcontainsexploitableconcurrency,thatis,multipleactivitiesortasksthatcanexecuteatthesametime.Afteraproblemhasbeenmappedontotheprogramdomain,however,itcanbedifficulttoseeopportunitiestoexploitconcurrency.

    Hence,programmersshouldstarttheirdesignofaparallelsolutionbyanalyzingtheproblemwithintheproblemdomaintoexposeexploitableconcurrency.WecallthedesignspaceinwhichthisanalysisiscarriedouttheFindingConcurrencydesignspace.Thepatternsinthisdesignspacewillhelpidentifyandanalyzetheexploitableconcurrencyinaproblem.Afterthisisdone,oneormorepatternsfromtheAlgorithmStructurespacecanbechosentohelpdesigntheappropriatealgorithmstructuretoexploittheidentifiedconcurrency.

    AnoverviewofthisdesignspaceanditsplaceinthepatternlanguageisshowninFig.3.1.

    Figure 3.1. Overview of the Finding Concurrency design space and its place in the pattern language

  • ExperienceddesignersworkinginafamiliardomainmayseetheexploitableconcurrencyimmediatelyandcouldmovedirectlytothepatternsintheAlgorithmStructuredesignspace.

    3.1.1. OverviewBeforestartingtoworkwiththepatternsinthisdesignspace,thealgorithmdesignermustfirstconsidertheproblemtobesolvedandmakesuretheefforttocreateaparallelprogramwillbejustified:Istheproblemlargeenoughandtheresultssignificantenoughtojustifyexpendingefforttosolveitfaster?Ifso,thenextstepistomakesurethekeyfeaturesanddataelementswithintheproblemarewellunderstood.Finally,thedesignerneedstounderstandwhichpartsoftheproblemaremostcomputationallyintensive,becausetheefforttoparallelizetheproblemshouldbefocusedonthoseparts.

    Afterthisanalysisiscomplete,thepatternsintheFindingConcurrencydesignspacecanbeusedtostartdesigningaparallelalgorithm.Thepatternsinthisdesignspacecanbeorganizedintothreegroups.

    DecompositionPatterns.Thetwodecompositionpatterns,TaskDecompositionandDataDecomposition,areusedtodecomposetheproblemintopiecesthatcanexecuteconcurrently.

    DependencyAnalysisPatterns.Thisgroupcontainsthreepatternsthathelpgroupthetasksandanalyzethedependenciesamongthem:GroupTasks,OrderTasks,andDataSharing.Nominally,thepatternsareappliedinthisorder.Inpractice,however,itisoftennecessarytoworkbackandforthbetweenthem,orpossiblyevenrevisitthedecompositionpatterns.

    DesignEvaluationPattern.ThefinalpatterninthisspaceguidesthealgorithmdesignerthroughananalysisofwhathasbeendonesofarbeforemovingontothepatternsintheAlgorithmStructuredesignspace.Thispatternisimportantbecauseitoftenhappensthatthebestdesignisnotfoundonthefirstattempt,andtheearlierdesignflawsareidentified,the

  • easiertheyaretocorrect.Ingeneral,workingthroughthepatternsinthisspaceisaniterativeprocess.

    3.1.2. Using the Decomposition PatternsThefirststepindesigningaparallelalgorithmistodecomposetheproblemintoelementsthatcanexecuteconcurrently.Wecanthinkofthisdecompositionasoccurringintwodimensions.

    Thetaskdecompositiondimensionviewstheproblemasastreamofinstructionsthatcanbebrokenintosequencescalledtasksthatcanexecutesimultaneously.Forthecomputationtobeefficient,theoperationsthatmakeupthetaskshouldbelargelyindependentoftheoperationstakingplaceinsideothertasks.

    Thedatadecompositiondimensionfocusesonthedatarequiredbythetasksandhowitcanbedecomposedintodistinctchunks.Thecomputationassociatedwiththedatachunkswillonlybeefficientifthedatachunkscanbeoperateduponrelativelyindependently.

    Viewingtheproblemdecompositionintermsoftwodistinctdimensionsissomewhatartificial.Ataskdecompositionimpliesadatadecompositionandviceversa;hence,thetwodecompositionsarereallydifferentfacetsofthesamefundamentaldecomposition.Wedividethemintoseparatedimensions,however,becauseaproblemdecompositionusuallyproceedsmostnaturallybyemphasizingonedimensionofthedecompositionovertheother.Bymakingthemdistinct,wemakethisdesignemphasisexplicitandeasierforthedesignertounderstand.

    3.1.3. Background for ExamplesInthissection,wegivebackgroundinformationonsomeoftheexamplesthatareusedinseveralpatterns.Itcanbeskippedforthetimebeingandrevisitedlaterwhenreadingapatternthatreferstooneoftheexamples.

    Medical imaging

    PET(PositronEmissionTomography)scansprovideanimportantdiagnostictoolbyallowingphysicianstoobservehowaradioactivesubstancepropagatesthroughapatient'sbody.Unfortunately,theimagesformedfromthedistributionofemittedradiationareoflowresolution,dueinparttothescatteringoftheradiationasitpassesthroughthebody.Itisalsodifficulttoreasonfromtheabsoluteradiationintensities,becausedifferentpathwaysthroughthebodyattenuatetheradiationdifferently.

    Tosolvethisproblem,modelsofhowradiationpropagatesthroughthebodyareusedtocorrecttheimages.AcommonapproachistobuildaMonteCarlomodel,asdescribedbyLjungbergandKing[LK98].Randomlyselectedpointswithinthebodyareassumedtoemitradiation(usuallyagammaray),andthetrajectoryofeachrayisfollowed.Asaparticle(ray)passesthroughthebody,itisattenuatedbythedifferentorgansittraverses,continuinguntiltheparticleleavesthebodyandhitsacameramodel,therebydefiningafulltrajectory.Tocreateastatisticallysignificantsimulation,thousands,ifnotmillions,oftrajectoriesarefollowed.

    Thisproblemcanbeparallelizedintwoways.Becauseeachtrajectoryisindependent,itispossibletoparallelizetheapplicationbyassociatingeachtrajectorywithatask.ThisapproachisdiscussedintheExamplessectionoftheTaskDecompositionpattern.Anotherapproachwouldbetopartitionthe

  • bodyintosectionsandassigndifferentsectionstodifferentprocessingelements.ThisapproachisdiscussedintheExamplessectionoftheDataDecompositionpattern.

    Linear algebra

    Linearalgebraisanimportanttoolinappliedmathematics:Itprovidesthemachineryrequiredtoanalyzesolutionsoflargesystemsoflinearequations.Theclassiclinearalgebraproblemasks,formatrixAandvectorb,whatvaluesforxwillsolvetheequation

    Equation3.1

    ThematrixAinEq.3.1takesonacentralroleinlinearalgebra.Manyproblemsareexpressedintermsoftransformationsofthismatrix.Thesetransformationsareappliedbymeansofamatrixmultiplication

    Equation3.2

    IfT,A,andCaresquarematricesoforderN,matrixmultiplicationisdefinedsuchthateachelementoftheresultingmatrixCis

    Equation3.3

    wherethesubscriptsdenoteparticularelementsofthematrices.Inotherwords,theelementoftheproductmatrixCinrowiandcolumnjisthedotproductoftheithrowofTandthejthcolumnofA.Hence,computingeachoftheN2elementsofCrequiresNmultiplicationsandN1additions,makingtheoverallcomplexityofmatrixmultiplicationO(N3).

    Therearemanywaystoparallelizeamatrixmultiplicationoperation.Itcanbeparallelizedusingeitherataskbaseddecomposition(asdiscussedintheExamplessectionoftheTaskDecompositionpattern)oradatabaseddecomposition(asdiscussedintheExamplessectionoftheDataDecompositionpattern).

    Molecular dynamics

    Moleculardynamicsisusedtosimulatethemotionsofalargemolecularsystem.Forexample,moleculardynamicssimulationsshowhowalargeproteinmovesaroundandhowdifferentlyshapeddrugsmightinteractwiththeprotein.Notsurprisingly,moleculardynamicsisextremelyimportantin

  • thepharmaceuticalindustry.Itisalsoausefultestproblemforcomputerscientistsworkingonparallelcomputing:Itisstraightforwardtounderstand,relevanttoscienceatlarge,anddifficulttoparallelizeeffectively.Asaresult,ithasbeenthesubjectofmuchresearch[Mat94,PH95,Pli95].

    Thebasicideaistotreatamoleculeasalargecollectionofballsconnectedbysprings.Theballsrepresenttheatomsinthemolecule,whilethespringsrepresentthechemicalbondsbetweentheatoms.Themoleculardynamicssimulationitselfisanexplicittimesteppingprocess.Ateachtimestep,theforceoneachatomiscomputedandthenstandardclassicalmechanicstechniquesareusedtocomputehowtheforcemovestheatoms.Thisprocessiscarriedoutrepeatedlytostepthroughtimeandcomputeatrajectoryforthemolecularsystem.

    Theforcesduetothechemicalbonds(the"springs")arerelativelysimpletocompute.Thesecorrespondtothevibrationsandrotationsofthechemicalbondsthemselves.Theseareshortrangeforcesthatcanbecomputedwithknowledgeofthehandfulofatomsthatsharechemicalbonds.Themajordifficultyarisesbecausetheatomshavepartialelectricalcharges.Hence,whileatomsonlyinteractwithasmallneighborhoodofatomsthroughtheirchemicalbonds,theelectricalchargescauseeveryatomtoapplyaforceoneveryotheratom.

    ThisisthefamousNbodyproblem.OntheorderofN2termsmustbecomputedtofindthesenonbondedforces.BecauseNislarge(tensorhundredsofthousands)andthenumberoftimestepsinasimulationishuge(tensofthousands),thetimerequiredtocomputethesenonbondedforcesdominatesthecomputation.SeveralwayshavebeenproposedtoreducetheeffortrequiredtosolvetheNbodyproblem.Weareonlygoingtodiscussthesimplestone:thecutoffmethod.

    Theideaissimple.Eventhougheachatomexertsaforceoneveryotheratom,thisforcedecreaseswiththesquareofthedistancebetweentheatoms.Hence,itshouldbepossibletopickadistancebeyondwhichtheforcecontributionissosmallthatitcanbeignored.Byignoringtheatomsthatexceedthiscutoff,theproblemisreducedtoonethatscalesasO(Nxn),wherenisthenumberofatomswithinthecutoffvolume,usuallyhundreds.Thecomputationisstillhuge,anditdominatestheoverallruntimeforthesimulation,butatleasttheproblemistractable.

    Thereareahostofdetails,butthebasicsimulationcanbesummarizedasinFig.3.2.

    Theprimarydatastructuresholdtheatomicpositions(atoms),thevelocitiesofeachatom(velocity),theforcesexertedoneachatom(forces),andlistsofatomswithinthecutoffdistanceofeachatoms(neighbors).Theprogramitselfisatimesteppingloop,inwhicheachiterationcomputestheshortrangeforceterms,updatestheneighborlists,andthenfindsthenonbondedforces.Aftertheforceoneachatomhasbeencomputed,asimpleordinarydifferentialequationissolvedtoupdatethepositionsandvelocities.Physicalpropertiesbasedonatomicmotionsarethenupdated,andwegotothenexttimestep.

    Therearemanywaystoparallelizethemoleculardynamicsproblem.Weconsiderthemostcommonapproach,startingwiththetaskdecomposition(discussedintheTaskDecompositionpattern)andfollowingwiththeassociateddatadecomposition(discussedintheDataDecompositionpattern).Thisexampleshowshowthetwodecompositionsfittogethertoguidethedesignoftheparallelalgorithm.

  • Figure 3.2. Pseudocode for the molecular dynamics example

    Int const N // number of atoms

    Array of Real :: atoms (3,N) //3D coordinatesArray of Real :: velocities (3,N) //velocity vectorArray of Real :: forces (3,N) //force in each dimensionArray of List :: neighbors(N) //atoms in cutoff volume

    loop over time steps vibrational_forces (N, atoms, forces) rotational_forces (N, atoms, forces) neighbor_list (N, atoms, neighbors) non_bonded_forces (N, atoms, neighbors, forces) update_atom_positions_and_velocities( N, atoms, velocities, forces) physical_properties ( ... Lots of stuff ... )end loop

    3.2. THE TASK DECOMPOSITION PATTERN

    ProblemHowcanaproblembedecomposedintotasksthatcanexecuteconcurrently?

    ContextEveryparallelalgorithmdesignstartsfromthesamepoint,namelyagoodunderstandingoftheproblembeingsolved.Theprogrammermustunderstandwhicharethecomputationallyintensivepartsoftheproblem,thekeydatastructures,andhowthedataisusedastheproblem'ssolutionunfolds.

    Thenextstepistodefinethetasksthatmakeuptheproblemandthedatadecompositionimpliedbythetasks.Fundamentally,everyparallelalgorithminvolvesacollectionoftasksthatcanexecuteconcurrently.Thechallengeistofindthesetasksandcraftanalgorithmthatletsthemrunconcurrently.

    Insomecases,theproblemwillnaturallybreakdownintoacollectionofindependent(ornearlyindependent)tasks,anditiseasiesttostartwithataskbaseddecomposition.Inothercases,thetasksaredifficulttoisolateandthedecompositionofthedata(asdiscussedintheDataDecompositionpattern)isabetterstartingpoint.Itisnotalwaysclearwhichapproachisbest,andoftenthealgorithmdesignerneedstoconsiderboth.

    Regardlessofwhetherthestartingpointisataskbasedoradatabaseddecomposition,however,aparallelalgorithmultimatelyneedstasksthatwillexecuteconcurrently,sothesetasksmustbeidentified.

  • ForcesThemainforcesinfluencingthedesignatthispointareflexibility,efficiency,andsimplicity.

    Flexibility.Flexibilityinthedesignwillallowittobeadaptedtodifferentimplementationrequirements.Forexample,itisusuallynotagoodideatonarrowtheoptionstoasinglecomputersystemorstyleofprogrammingatthisstageofthedesign.

    Efficiency.Aparallelprogramisonlyusefulifitscalesefficientlywiththesizeoftheparallelcomputer(intermsofreducedruntimeand/ormemoryutilization).Forataskdecomposition,thismeansweneedenoughtaskstokeepallthePEsbusy,withenoughworkpertasktocompensateforoverheadincurredtomanagedependencies.However,thedriveforefficiencycanleadtocomplexdecompositionsthatlackflexibility.

    Simplicity.Thetaskdecompositionneedstobecomplexenoughtogetthejobdone,butsimpleenoughtolettheprogrambedebuggedandmaintainedwithreasonableeffort.

    SolutionThekeytoaneffectivetaskdecompositionistoensurethatthetasksaresufficientlyindependentsothatmanagingdependenciestakesonlyasmallfractionoftheprogram'soverallexecutiontime.ItisalsoimportanttoensurethattheexecutionofthetaskscanbeevenlydistributedamongtheensembleofPEs(theloadbalancingproblem).

    Inanidealworld,thecompilerwouldfindthetasksfortheprogrammer.Unfortunately,thisalmostneverhappens.Instead,itmustusuallybedonebyhandbasedonknowledgeoftheproblemandthecoderequiredtosolveit.Insomecases,itmightbenecessarytocompletelyrecasttheproblemintoaformthatexposesrelativelyindependenttasks.

    Inataskbaseddecomposition,welookattheproblemasacollectionofdistincttasks,payingparticularattentionto

    Theactionsthatarecarriedouttosolvetheproblem.(Arethereenoughofthemtokeeptheprocessingelementsonthetargetmachinesbusy?)

    Whethertheseactionsaredistinctandrelativelyindependent.

    Asafirstpass,wetrytoidentifyasmanytasksaspossible;itismucheasiertostartwithtoomanytasksandmergethemlateronthantostartwithtoofewtasksandlatertrytosplitthem.

    Taskscanbefoundinmanydifferentplaces.

    Insomecases,eachtaskcorrespondstoadistinctcalltoafunction.Definingataskforeachfunctioncallleadstowhatissometimescalledafunctionaldecomposition.

    Anotherplacetofindtasksisindistinctiterationsoftheloopswithinanalgorithm.Iftheiterationsareindependentandthereareenoughofthem,thenitmightworkwelltobaseataskdecompositiononmappingeachiterationontoatask.Thisstyleoftaskbaseddecompositionleadstowhataresometimescalledloopsplittingalgorithms.

    Tasksalsoplayakeyroleindatadrivendecompositions.Inthiscase,alargedatastructureisdecomposedandmultipleunitsofexecutionconcurrentlyupdatedifferentchunksofthedatastructure.Inthiscase,thetasksarethoseupdatesonindividualchunks.

  • AlsokeepinmindtheforcesgivenintheForcessection:

    Flexibility.Thedesignneedstobeflexibleinthenumberoftasksgenerated.Usuallythisisdonebyparameterizingthenumberandsizeoftasksonsomeappropriatedimension.Thiswillletthedesignbeadaptedtoawiderangeofparallelcomputerswithdifferentnumbersofprocessors.

    Efficiency.Therearetwomajorefficiencyissuestoconsiderinthetaskdecomposition.First,eachtaskmustincludeenoughworktocompensatefortheoverheadincurredbycreatingthetasksandmanagingtheirdependencies.Second,thenumberoftasksshouldbelargeenoughsothatalltheunitsofexecutionarebusywithusefulworkthroughoutthecomputation.

    Simplicity.Tasksshouldbedefinedinawaythatmakesdebuggingandmaintenancesimple.Whenpossible,tasksshouldbedefinedsotheyreusecodefromexistingsequentialprogramsthatsolverelatedproblems.

    Afterthetaskshavebeenidentified,thenextstepistolookatthedatadecompositionimpliedbythetasks.TheDataDecompositionpatternmayhelpwiththisanalysis.

    Examples

    Medical imaging

    ConsiderthemedicalimagingproblemdescribedinSec.3.1.3.Inthisapplication,apointinsideamodelofthebodyisselectedrandomly,aradioactivedecayisallowedtooccuratthispoint,andthetrajectoryoftheemittedparticleisfollowed.Tocreateastatisticallysignificantsimulation,thousands,ifnotmillions,oftrajectoriesarefollowed.

    Itisnaturaltoassociateataskwitheachtrajectory.Thesetasksareparticularlysimpletomanageconcurrentlybecausetheyarecompletelyindependent.Furthermore,therearelargenumbersoftrajectories,sotherewillbemanytasks,makingthisdecompositionsuitableforalargerangeofcomputersystems,fromasharedmemorysystemwithasmallnumberofprocessingelementstoalargeclusterwithhundredsofprocessingelements.

    Withthebasictasksdefined,wenowconsiderthecorrespondingdatadecompositionthatis,wedefinethedataassociatedwitheachtask.Eachtaskneedstoholdtheinformationdefiningthetrajectory.Butthatisnotall:Thetasksneedaccesstothemodelofthebodyaswell.Althoughitmightnotbeapparentfromourdescriptionoftheproblem,thebodymodelcanbeextremelylarge.Becauseitisareadonlymodel,thisisnoproblemifthereisaneffectivesharedmemorysystem;eachtaskcanreaddataasneeded.Ifthetargetplatformisbasedonadistributedmemoryarchitecture,however,thebodymodelwillneedtobereplicatedoneachPE.Thiscanbeverytimeconsumingandcanwasteagreatdealofmemory.ForsystemswithsmallmemoriesperPEand/orwithslownetworksbetweenPEs,adecompositionoftheproblembasedonthebodymodelmightbemoreeffective.

    Thisisacommonsituationinparallelprogramming:Manyproblemscanbedecomposedprimarilyintermsofdataorprimarilyintermsoftasks.Ifataskbaseddecompositionavoidstheneedtobreakupanddistributecomplexdatastructures,itwillbeamuchsimplerprogramtowriteanddebug.Ontheotherhand,ifmemoryand/ornetworkbandwidthisalimitingfactor,adecompositionthatfocuseson

  • thedatamightbemoreeffective.Itisnotsomuchamatterofoneapproachbeing"better"thananotherasamatterofbalancingtheneedsofthemachinewiththeneedsoftheprogrammer.WediscussthisinmoredetailintheDataDecompositionpattern.

    Matrix multiplication

    Considerthemultiplicationoftwomatrices(C=AB),asdescribedinSec.3.1.3.Wecanproduceataskbaseddecompositionofthisproblembyconsideringthecalculationofeachelementoftheproductmatrixasaseparatetask.EachtaskneedsaccesstoonerowofAandonecolumnofB.Thisdecompositionhastheadvantagethatallthetasksareindependent,andbecauseallthedatathatissharedamongtasks(AandB)isreadonly,itwillbestraightforwardtoimplementinasharedmemoryenvironment.

    Theperformanceofthisalgorithm,however,wouldbepoor.ConsiderthecasewherethethreematricesaresquareandoforderN.ForeachelementofC,NelementsfromAandNelementsfromBwouldberequired,resultingin2NmemoryreferencesforNmultiply/addoperations.Memoryaccesstimeisslowcomparedtofloatingpointarithmetic,sothebandwidthofthememorysubsystemwouldlimittheperformance.

    Abetterapproachwouldbetodesignanalgorithmthatmaximizesreuseofdataloadedintoaprocessor'scaches.Wecanarriveatthisalgorithmintwodifferentways.First,wecouldgrouptogethertheelementwisetaskswedefinedearliersothetasksthatusesimilarelementsoftheAandBmatricesrunonthesameUE(seetheGroupTaskspattern).Alternatively,wecouldstartwiththedatadecompositionanddesignthealgorithmfromthebeginningaroundthewaythematricesfitintothecaches.WediscussthisexamplefurtherintheExamplessectionoftheDataDecompositionpattern.

    Molecular dynamics

    ConsiderthemoleculardynamicsproblemdescribedinSec.3.1.3.PseudocodeforthisexampleisshownagaininFig.3.3.

    Beforeperformingthetaskdecomposition,weneedtobetterunderstandsomedetailsoftheproblem.First,theneighbor_list ()computationistimeconsuming.Thegistofthecomputationisaloopovereachatom,insideofwhicheveryotheratomischeckedtodeterminewhetheritfallswithintheindicatedcutoffvolume.Fortunately,thetimestepsareverysmall,andtheatomsdon'tmoveverymuchinanygiventimestep.Hence,thistimeconsumingcomputationisonlycarriedoutevery10to100steps.

    Figure 3.3. Pseudocode for the molecular dynamics example

    Int const N // number of atoms

    Array of Real :: atoms (3,N) //3D coordinatesArray of Real :: velocities (3,N) //velocity vectorArray of Real :: forces (3,N) //force in each dimensionArray of List :: neighbors(N) //atoms in cutoff volume

    loop over time steps vibrational_forces (N, atoms, forces) rotational_forces (N, atoms, forces) neighbor_list (N, atoms, neighbors)

  • non_bonded_forces (N, atoms, neighbors, forces) update_atom_positions_and_velocities( N, atoms, velocities, forces) physical_properties ( ... Lots of stuff ... )end loop

    Second,thephysical_properties()functioncomputesenergies,correlationcoefficients,andahostofinterestingphysicalproperties.Thesecomputations,however,aresimpleanddonotsignificantlyaffecttheprogram'soverallruntime,sowewillignoretheminthisdiscussion.

    Becausethebulkofthecomputationtimewillbeinnon_bonded_forces(),wemustpickaproblemdecompositionthatmakesthatcomputationrunefficientlyinparallel.Theproblemismadeeasierbythefactthateachofthefunctionsinsidethetimeloophasasimilarstructure:Inthesequentialversion,eachfunctionincludesaloopoveratomstocomputecontributionstotheforcevector.Thus,anaturaltaskdefinitionistheupdaterequiredbyeachatom,whichcorrespondstoaloopiterationinthesequentialversion.Afterperformingthetaskdecomposition,therefore,weobtainthefollowingtasks.

    Tasksthatfindthevibrationalforcesonanatom

    Tasksthatfindtherotationalforcesonanatom

    Tasksthatfindthenonbondedforcesonanatom

    Tasksthatupdatethepositionandvelocityofanatom

    Atasktoupdatetheneighborlistforalltheatoms(whichwewillleavesequential)

    Withourcollectionoftasksinhand,wecanconsidertheaccompanyingdatadecomposition.Thekeydatastructuresaretheneighborlist,theatomiccoordinates,theatomicvelocities,andtheforcevector.Everyiterationthatupdatestheforcevectorneedsthecoordinatesofaneighborhoodofatoms.Thecomputationofnonbondedforces,however,potentiallyneedsthecoordinatesofalltheatoms,becausethemoleculebeingsimulatedmightfoldbackonitselfinunpredictableways.Wewillusethisinformationtocarryoutthedatadecomposition(intheDataDecompositionpattern)andthedatasharinganalysis(intheDataSharingpattern).

    Known uses

    Taskbaseddecompositionsareextremelycommoninparallelcomputing.Forexample,thedistancegeometrycodeDGEOM[Mat96]usesataskbaseddecomposition,asdoestheparallelWESDYNmoleculardynamicsprogram[MR95].

    3.3. THE DATA DECOMPOSITION PATTERN

  • ProblemHowcanaproblem'sdatabedecomposedintounitsthatcanbeoperatedonrelativelyindependently?

    ContextTheparallelalgorithmdesignermusthaveadetailedunderstandingoftheproblembeingsolved.Inaddition,thedesignershouldidentifythemostcomputationallyintensivepartsoftheproblem,thekeydatastructuresrequiredtosolvetheproblem,andhowdataisusedastheproblem'ssolutionunfolds.

    Afterthebasicproblemisunderstood,theparallelalgorithmdesignershouldconsiderthetasksthatmakeuptheproblemandthedatadecompositionimpliedbythetasks.Boththetaskanddatadecompositionsneedtobeaddressedtocreateaparallelalgorithm.Thequestionisnotwhichdecompositiontodo.Thequestioniswhichonetostartwith.Adatabaseddecompositionisagoodstartingpointifthefollowingistrue.

    Themostcomputationallyintensivepartoftheproblemisorganizedaroundthemanipulationofalargedatastructure.

    Similaroperationsarebeingappliedtodifferentpartsofthedatastructure,insuchawaythatthedifferentpartscanbeoperatedonrelativelyindependently.

    Forexample,manylinearalgebraproblemsupdatelargematrices,applyingasimilarsetofoperationstoeachelementofthematrix.Inthesecases,itisstraightforwardtodrivetheparallelalgorithmdesignbylookingathowthematrixcanbebrokenupintoblocksthatareupdatedconcurrently.Thetaskdefinitionsthenfollowfromhowtheblocksaredefinedandmappedontotheprocessingelementsoftheparallelcomputer.

    ForcesThemainforcesinfluencingthedesignatthispointareflexibility,efficiency,andsimplicity.

    Flexibility.Flexibilitywillallowthedesigntobeadaptedtodifferentimplementationrequirements.Forexample,itisusuallynotagoodideatonarrowtheoptionstoasinglecomputersystemorstyleofprogrammingatthisstageofthedesign.

    Efficiency.Aparallelprogramisonlyusefulifitscalesefficientlywiththesizeoftheparallelcomputer(intermsofreducedruntimeand/ormemoryutilization).

    Simplicity.Thedecompositionneedstobecomplexenoughtogetthejobdone,butsimpleenoughtolettheprogrambedebuggedandmaintainedwithreasonableeffort.

    SolutionInsharedmemoryprogrammingenvironmentssuchasOpenMP,thedatadecompositionwillfrequentlybeimpliedbythetaskdecomposition.Inmostcases,however,thedecompositionwillneedtobedonebyhand,becausethememoryisphysicallydistributed,becausedatadependenciesaretoocomplexwithoutexplicitlydecomposingthedata,ortoachieveacceptableefficiencyonaNUMAcomputer.

    Ifataskbaseddecompositionhasalreadybeendone,thedatadecompositionisdrivenbytheneedsofeachtask.Ifwelldefinedanddistinctdatacanbeassociatedwitheachtask,thedecompositionshould

  • besimple.

    Whenstartingwithadatadecomposition,however,weneedtolooknotatthetasks,butatthecentraldatastructuresdefiningtheproblemandconsiderwhethertheycantheybebrokendownintochunksthatcanbeoperatedonconcurrently.Afewcommonexamplesincludethefollowing.

    Arraybasedcomputations.Concurrencycanbedefinedintermsofupdatesofdifferentsegmentsofthearray.Ifthearrayismultidimensional,itcanbedecomposedinavarietyofways(rows,columns,orblocksofvaryingshapes).

    Recursivedatastructures.Wecanthinkof,forexample,decomposingtheparallelupdateofalargetreedatastructurebydecomposingthedatastructureintosubtreesthatcanbeupdatedconcurrently.

    Regardlessofthenatureoftheunderlyingdatastructure,ifthedatadecompositionistheprimaryfactordrivingthesolutiontotheproblem,itservesastheorganizingprincipleoftheparallelalgorithm.

    Whenconsideringhowtodecomposetheproblem'sdatastructures,keepinmindthecompetingforces.

    Flexibility.Thesizeandnumberofdatachunksshouldbeflexibletosupportthewidestrangeofparallelsystems.Oneapproachistodefinechunkswhosesizeandnumberarecontrolledbyasmallnumberofparameters.Theseparametersdefinegranularityknobsthatcanbevariedtomodifythesizeofthedatachunkstomatchtheneedsoftheunderlyinghardware.(Note,however,thatmanydesignsarenotinfinitelyadaptablewithrespecttogranularity.)

    Theeasiestplacetoseetheimpactofgranularityonthedatadecompositionisintheoverheadrequiredtomanagedependenciesbetweenchunks.Thetimerequiredtomanagedependenciesmustbesmallcomparedtotheoverallruntime.Inagooddatadecomposition,thedependenciesscaleatalowerdimensionthanthecomputationaleffortassociatedwitheachchunk.Forexample,inmanyfinitedifferenceprograms,thecellsattheboundariesbetweenchunks,thatis,thesurfacesofthechunks,mustbeshared.Thesizeofthesetofdependentcellsscalesasthesurfacearea,whiletheeffortrequiredinthecomputationscalesasthevolumeofthechunk.Thismeansthatthecomputationaleffortcanbescaled(basedonthechunk'svolume)tooffsetoverheadsassociatedwithdatadependencies(basedonthesurfaceareaofthechunk).

    Efficiency.Itisimportantthatthedatachunksbelargeenoughthattheamountofworktoupdatethechunkoffsetstheoverheadofmanagingdependencies.AmoresubtleissuetoconsiderishowthechunksmapontoUEs.AneffectiveparallelalgorithmmustbalancetheloadbetweenUEs.Ifthisisn'tdonewell,somePEsmighthaveadisproportionateamountofwork,andtheoverallscalabilitywillsuffer.Thismayrequirecleverwaystobreakuptheproblem.Forexample,iftheproblemclearsthecolumnsinamatrixfromlefttoright,acolumnmappingofthematrixwillcauseproblemsastheUEswiththeleftmostcolumnswillfinishtheirworkbeforetheothers.Arowbasedblockdecompositionorevenablockcyclicdecomposition(inwhichrowsareassignedcyclicallytoPEs)woulddoamuchbetterjobofkeepingalltheprocessorsfullyoccupied.TheseissuesarediscussedinmoredetailintheDistributedArraypattern.

  • Simplicity.Overlycomplexdatadecompositionscanbeverydifficulttodebug.Adatadecompositionwillusuallyrequireamappingofaglobalindexspaceontoatasklocalindexspace.Makingthismappingabstractallowsittobeeasilyisolatedandtested.

    Afterthedatahasbeendecomposed,ifithasnotalreadybeendone,thenextstepistolookatthetaskdecompositionimpliedbythetasks.TheTaskDecompositionpatternmayhelpwiththisanalysis.

    Examples

    Medical imaging

    ConsiderthemedicalimagingproblemdescribedinSec.3.1.3.Inthisapplication,apointinsideamodelofthebodyisselectedrandomly,aradioactivedecayisallowedtooccuratthispoint,andthetrajectoryoftheemittedparticleisfollowed.Tocreateastatisticallysignificantsimulation,thousandsifnotmillionsoftrajectoriesarefollowed.

    Inadatabaseddecompositionofthisproblem,thebodymodelisthelargecentraldatastructurearoundwhichthecomputationcanbeorganized.Themodelisbrokenintosegments,andoneormoresegmentsareassociatedwitheachprocessingelement.Thebodysegmentsareonlyread,notwritten,duringthetrajectorycomputations,sotherearenodatadependenciescreatedbythedecompositionofthebodymodel.

    Afterthedatahasbeendecomposed,weneedtolookatthetasksassociatedwitheachdatasegment.Inthiscase,eachtrajectorypassingthroughthedatasegmentdefinesatask.Thetrajectoriesareinitiatedandpropagatedwithinasegment.Whenasegmentboundaryisencountered,thetrajectorymustbepassedbetweensegments.Itisthistransferthatdefinesthedependenciesbetweendatachunks.

    Ontheotherhand,inataskbasedapproachtothisproblem(asdiscussedintheTaskDecompositionpattern),thetrajectoriesforeachparticledrivethealgorithmdesign.EachPEpotentiallyneedstoaccessthefullbodymodeltoserviceitssetoftrajectories.Inasharedmemoryenvironment,thisiseasybecausethebodymodelisareadonlydataset.Inadistributedmemoryenvironment,however,thiswouldrequiresubstantialstartupoverheadasthebodymodelisbroadcastacrossthesystem.

    Thisisacommonsituationinparallelprogramming:Differentpointsofviewleadtodifferentalgorithmswithpotentiallyverydifferentperformancecharacteristics.Thetaskbasedalgorithmissimple,butitonlyworksifeachprocessingelementhasaccesstoalargememoryandiftheoverheadincurredloadingthedataintomemoryisinsignificantcomparedtotheprogram'sruntime.Analgorithmdrivenbyadatadecomposition,ontheotherhand,makesefficientuseofmemoryand(indistributedmemoryenvironments)lessuseofnetworkbandwidth,butitincursmorecommunicationoverheadduringtheconcurrentpartofcomputationandissignificantlymorecomplex.ChoosingwhichistheappropriateapproachcanbedifficultandisdiscussedfurtherintheDesignEvaluationpattern.

    Matrix multiplication

    Considerthestandardmultiplicationoftwomatrice