eecs 570 lecture 2 message passing & shared memory

Lecture 2 Slide 1EECS 570

EECS570Lecture2MessagePassing&SharedMemoryWinter2018

Prof.SatishNarayanasamy

http://www.eecs.umich.edu/courses/eecs570/

Slides developed in part by Drs. Adve, Falsafi, Martin, Musuvathi, Narayanasamy, Nowatzyk, Wenisch, Sarkar, Mikko Lipasti, Jim Smith, John Shen, Mark Hill, David Wood, Guri Sohi, Jim Smith, Natalie Enright Jerger, Michel Dubois, Murali Annavaram, Per Stenström, and probably others

IntelParagonXP/S


AnnouncementsDiscussionthisFriday.•Willdiscusscourseproject


ReadingsFortoday

❒ DavidWoodandMarkHill.“Cost-EffectiveParallelComputing,”IEEEComputer,1995.

❒ MarkHilletal.“21st CenturyComputerArchitecture.”CCCWhitePaper,2012.

ForWednesday1/13:❒ Seileretal.Larrabee:AMany-Corex86Architecturefor

VisualComputing.Siggraph 2008.


Sequential Merge Sort

16MBinput(32-bitintegers)

Recurse(left)

Recurse(right)

Copybacktoinputarray

Mergetoscratcharray

Time

SequentialExecution


Parallel Merge Sort (as Parallel Directed Acyclic Graph)

16MBinput(32-bitintegers)

Recurse(left) Recurse(right)

Copybacktoinputarray

Mergetoscratcharray

Time ParallelExecution


Parallel DAG for Merge Sort (2-core)

SequentialSort

Merge

SequentialSort

Time


The DAG Execution Model of a Parallel Computation

• Givenaninput,dynamicallycreateaDAG

• Nodesrepresentsequentialcomputation❒ Weightedbytheamountofwork

• Edgesrepresentdependencies:❒ NodeAà NodeBmeansthatBcannotbescheduledunless

Aisfinished


Sorting 16 elements in four cores


Sorting 16 elements in four cores(4 element arrays sorted in constant time)

1 16

1 8

1

1

1 8

1

1


Performance Measures• GivenagraphG,aschedulerS,andPprocessors

• Tp(S) :TimeonPprocessorsusingschedulerS

• Tp :TimeonPprocessorsusingbestscheduler

• T1 :Timeonasingleprocessor(sequentialcost)

• T∞ :Timeassuminginfiniteresources


Work and Depth

• T1=Work❒ Thetotalnumberofoperationsexecutedbyacomputation

• T∞=Depth❒ Thelongestchainofsequentialdependencies(criticalpath)

intheparallelDAG


T∞ (Depth): Critical Path Length(Sequential Bottleneck)


T1 (work): Time to Run Sequentially


Sorting 16 elements in four cores(4 element arrays sorted in constant time)

1 16

1 8

1

1

1 8

1

1

Work=Depth=


Some Useful Theorems


Work Law• “Youcannotavoidworkbyparallelizing”

T1 / P ≤ TP



T1 / P ≤ TP

Speedup = T1 / TP



• Canspeedupbemorethan2whenwegofrom1-coreto2-coreinpractice?

T1 / P ≤ TP

Speedup = T1 / TP


Depth Law•Moreresourcesshouldmakethingsfaster• Youarelimitedbythesequentialbottleneck

TP ≥ T∞


Amount of Parallelism

Parallelism = T1 / T∞


Maximum Speedup Possible

Parallelism

“speedupisboundedabovebyavailableparallelism”

Speedup T1 / TP ≤ T1 / T∞


Greedy Scheduler

• IfmorethanPnodescanbescheduled,pickanysubsetofsizeP

• IflessthanPnodescanbescheduled,schedulethemall


MoreReading:http://www.cs.cmu.edu/afs/cs/academic/class/15492-f07/www/scribe/lec4/lecture4.pdf


Performance of the Greedy Scheduler

Worklaw T1 / P ≤ TP

Depthlaw T∞ ≤ TP

TP(Greedy) ≤ T1 / P + T∞


Greedy is optimal within factor of 2

Worklaw T1 / P ≤ TP

Depthlaw T∞ ≤ TP

TP ≤ TP(Greedy) ≤ 2 TP


Work/Depth of Merge Sort (Sequential Merge)

•WorkT1 : O(n log n)• DepthT∞ : O(n)

❒ TakesO(n) timetomergen elements

• Parallelism:❒ T1 / T∞ = O(log n) à reallybad!


Main Message• AnalyzetheWorkandDepthofyouralgorithm• ParallelismisWork/Depth• TrytodecreaseDepth

❒ thecriticalpath❒ a sequential bottleneck

• IfyouincreaseDepth❒ betterincreaseWorkbyalotmore!


Amdahl’s law• Sortingtakes70%oftheexecutiontimeofasequentialprogram

• Youreplacethesortingalgorithmwithonethatscalesperfectlyonmulti-corehardware

• Howmanycoresdoyouneedtogeta4xspeed-upontheprogram?


Amdahl’s law, 𝑓 = 70%

f =theparallel portionofexecution1 - f =thesequential portionofexecution

c =numberofcoresused

Speedup(f, c) = 1 / ( 1 – f) + f / c



0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Speedu

p

#cores

Desired4xspeedup

Speedupachieved(perfectscalingon70%)



0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Speedu

p

#cores

Desired4xspeedup

Speedupachieved(perfectscalingon70%)

Limitasc→∞=1/(1-f)=3.33



0.94

0.96

0.98

1.00

1.02

1.04

1.06

1.08

1.10

1.12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Speedu

p

#cores

Speedupachievedwithperfectscaling

Amdahl’slawlimit,just1.11x



0

10

20

30

40

50

60

1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103

109

115

121

127

Speedu

p

#cores


Lesson• Speedupislimitedbysequential code

• Evenasmallpercentageofsequential codecangreatlylimitpotentialspeedup


21st CenturyComputer Architecture

A CCC community white paperhttp://cra.org/ccc/docs/init/21stcenturyarchitecturewhitepap

er.pdf

SlidesfromM.Hill,HPCA2014Keynote


20th Century ICT Set Up

• Information&CommunicationTechnology(ICT)HasChangedOurWorld❒ <longlistomitted>

• Requiredinnovationsinalgorithms,applications,programminglanguages,…,&systemsoftware

• Key(invisible)enablers(cost-)performancegains❒ Semiconductortechnology(“Moore’sLaw”)❒ Computerarchitecture(~80xperDanowitz etal.)


Enablers: Technology + Architecture

Danowitz etal.,CACM04/2012,Figure1

Technology

Architecture


21st Century ICT Promises More

Data-centricpersonalizedhealthcare Computation-drivenscientificdiscovery

Muchmore:known&unknownHumannetworkanalysis


21st Century App Characteristics

BIG DATA

Whither enablers of future (cost-)performance gains?

ALWAYS ONLINE

SECURE/PRIVATE


Technology’s Challenges 1/2Late 20th Century The New Reality

Moore’s Law —2× transistors/chip

Transistor count still 2× BUT…

Dennard Scaling —~constant power/chip

Gone. Can’t repeatedly double power/chip

42


Technology’s Challenges 2/2Late 20th Century The New Reality

Moore’s Law —2× transistors/chip

Transistor count still 2× BUT…

Dennard Scaling —~constant power/chip

Gone. Can’t repeatedly double power/chip

Modest (hidden)transistor unreliability

Increasing transistor unreliability can’t be hidden

Focus on computation over communication

Communication (energy) more expensive than computation

1-time costs amortized via mass market

One-time cost much worse &want specialized platforms

How should architects step up as technology falters?


“Timeline” from DARPA ISATSy

stem

Cap

abilit

y (lo

g)

80s 90s 00s 10s 20s 30s 40s

Fallow Period

50sSource:AdvancingComputerSystemswithoutTechnologyProgress,

ISATOutbrief (http://www.cs.wisc.edu/~markhill/papers/isat2012_ACSWTP.pdf)

MarkD.HillandChristosKozyrakis,DARPA/ISATWorkshop,March26-27,2012.ApprovedforPublicRelease,DistributionUnlimited

TheviewsexpressedarethoseoftheauthoranddonotreflecttheofficialpolicyorpositionoftheDepartmentofDefenseortheU.S.Government.


21st Century Comp Architecture20th Century 21st CenturySingle-chip ingeneric computer

Architecture as Infrastructure: Spanning sensors to cloudsPerformance + security, privacy, availability, programmability, …

Cross-Cutting:

Break current layers with new interfaces

Performance via invisible instr.-level parallelism

Energy First● Parallelism● Specialization● Cross-layer design

Predictable technologies: CMOS, DRAM, & disks

New technologies (non-volatile memory, near-threshold, 3D, photonics, …) Rethink: memory & storage, reliability, communication

X

X




Cross-Cutting:






X




Cross-Cutting:







21st Century Comp Architecture20th Century 21st CenturySingle-chip instand-alone computer


Cross-Cutting:







Cost-Effective Computing[Wood & Hill, IEEE Computer 1995]

Premise:Isn’tspeedup(P)<Pinefficient?❒ Ifonlythroughputmatters,usePcomputersinstead…

• Keyobservation:muchofacomputer’scostisNOTCPULetCostup(P)=Cost(P)/Cost(1)Parallelcomputingiscost-effectiveif:

❒ Speedup(P)>Costup(P)

• E.g.,forSGIPowerChallenge w/500MB❒ Costup(32)=8.6


Parallel Programming Models and Interfaces


Programming Models• Highlevelparadigmforexpressinganalgorithm

❒ Examples:❍ Functional❍ Sequential,procedural❍ Sharedmemory❍MessagePassing

• Embodiedinlanguagesthatsupportconcurrentexecution❒ Incorporatedintolanguageconstructs❒ Incorporatedaslibrariesaddedtoexistingsequentiallanguage

• Toplevelfeatures:(Forconventionalmodels– sharedmemory,messagepassing)❒ Multiplethreadsareconceptuallyvisibletoprogrammer❒ Communication/synchronizationarevisibletoprogrammer


An Incomplete Taxonomy

• VLIW• SIMD• Vector• DataFlow• GPU• MapReduce /WSC• SystolicArray• Reconfigurable/FPGA• …

MP Systems

SharedMemory

Bus-based

SwitchingFabric-based

Incoherent Coherent

MessagePassing

COTS-based

SwitchingFabric-based

Emerging&LunaticFringe


Programming Model Elements• ForbothSharedMemoryandMessagePassing• Processesandthreads

❒ Process: Asharedaddressspaceandoneormorethreadsofcontrol❒ Thread: Aprogramsequencerandprivateaddressspace❒ Task:Lessformalterm– partofanoveralljob❒ Created,terminated,scheduled,etc.

• Communication❒ Passingofdata

• Synchronization❒ Communicatingcontrolinformation❒ Toassurereliable,deterministiccommunication


Historical View

Joinat:

Programwith:

P P P

M M M

IO IO IO

I/O(Network)

Messagepassing

P P P

M M M

IO IO IO

Memory

SharedMemory

P P P

M M M

IO IO IO

Processor

Dataflow,SIMD,VLIW,CUDA,

otherdataparallel


Message PassingProgramming Model


Message Passing Programming Model

• Userlevelsend/receiveabstraction❒ Matchvialocalbuffer(x,y),process(Q,P),andtag(t)❒ Neednaming/synchronizationconventions

ProcessPlocaladdressspace

ProcessQlocaladdressspace

match

Addressx

Addressy

sendx,Q,t

recv y,P,t


Message Passing Architectures

• Cannotdirectlyaccessmemoryofanothernode

• IBMSP-2,IntelParagon,Myrinet QuadricsQSW

• Clusterofworkstations(e.g.,MPIonfluxcluster)

CPU($)Mem

Interconnect

NicCPU($)Mem Nic

CPU($)Mem Nic

CPU($)Mem Nic


MPI –Message Passing Interface API

• Awidelyusedstandard❒ Foravarietyofdistributedmemorysystems

• SMPClusters,workstationclusters,MPPs,heterogeneoussystems

• AlsoworksonSharedMemoryMPs❒EasytoemulatedistributedmemoryonsharedmemoryHW

• Canbeusedwithanumberofhighlevellanguages

• AvailableintheFluxclusteratMichigan


Processes and Threads• Lotsofflexibility(advantageofmessagepassing)

1. Multiplethreadssharinganaddressspace2. Multipleprocessessharinganaddressspace3. Multipleprocesseswithdifferentaddressspaces

• anddifferentOSes

• 1and2easilyimplementedonsharedmemoryHW(withsingleOS)❒ Processandthreadcreation/managementsimilartosharedmemory

• 3probablymorecommoninpractice❒ Processcreationoftenexternaltoexecutionenvironment;e.g.shellscript❒ HardforuserprocessononesystemtocreateprocessonanotherOS


Communication and Synchronization• Combinedinthemessagepassingparadigm

❒ Synchronizationofmessagespartofcommunicationsemantics

• Point-to-pointcommunication❒ Fromoneprocesstoanother

• Collectivecommunication❒ Involvesgroupsofprocesses❒e.g.,broadcast


Message Passing: Send()

• Send(<what>,<where-to>,<how>)

• What:❒ Adatastructureorobjectinuserspace❒ Abufferallocatedfromspecialmemory❒ Awordorsignal

• Where-to:❒ Aspecificprocessor❒ Asetofspecificprocessors❒ Aqueue,dispatcher,scheduler

• How:❒ Asynchronouslyvs.synchronously❒ Typed❒ In-ordervs.out-of-order❒ Prioritized


Message Passing: Receive()

• Receive(<data>,<info>,<what>,<how>)

• Data:mechanismtoreturnmessagecontent❒ Abufferallocatedintheuserprocess❒ Memoryallocatedelsewhere

• Info:meta-infoaboutthemessage❒ Sender-ID❒ Type,Size,Priority❒ Flowcontrolinformation

• What:receiveonlycertainmessages❒ Sender-ID,Type,Priority

• How:❒ Blockingvs.non-blocking


Synchronous vs Asynchronous

• SynchronousSend❒ Stalluntilmessagehasactuallybeenreceived❒ Impliesamessageacknowledgementfromreceivertosender

• SynchronousReceive❒ Stalluntilmessagehasactuallybeenreceived

• AsynchronousSendandReceive❒ Senderandreceivercanproceedregardless❒Returnsrequesthandlethatcanbetestedformessagereceipt❒Requesthandlecanbetestedtoseeifmessagehasbeensent/received


Deadlock• Blockingcommunicationsmaydeadlock

• Requirescareful(safe)orderingofsends/receives

<Process 0> <Process 1>Send(Process1, Message); Receive (Process0, Message);Receive(Process1, Message); Send (Process0, Message);

<Process 0> <Process 1>Send(Process1, Message); Send(Process0, Message);Receive(Process1, Message); Receive(Process0, Message);


Message Passing Paradigm Summary

ProgrammingModel(Software)pointofview:

• Disjoint,separatenamespaces

• “Sharednothing”

• Communicationviaexplicit,typedmessages:send&receive


Message Passing Paradigm Summary

ComputerEngineering(Hardware)pointofview:

• Treatinter-processcommunicationasI/Odevice

• Criticalissues:❒ HowtooptimizeAPIoverhead

❒ Minimizecommunicationlatency

❒ Buffermanagement:howtodealwithearly/unsolicitedmessages,messagetyping,high-levelflowcontrol

❒ Eventsignaling&synchronization

❒ Librarysupportforcommonfunctions(barriersynchronization,taskdistribution,scatter/gather,datastructuremaintenance)


Shared Memory Programming Model


Shared-Memory Model

P1 P2 P3 P4

MemorySystem

❒ Multipleexecutioncontextssharingasingleaddressspace❍ Multipleprograms(MIMD)❍ Ormorefrequently:multiplecopiesofoneprogram(SPMD)

❒ Implicit(automatic)communicationvialoadsandstores❒ Theoreticalfoundation:PRAMmodel


Global Shared Physical Address Space

• Communication,sharing,synchronizationvialoads/storestosharedvariables

• Facilitiesforaddresstranslationbetweenlocal/globaladdressspaces

• RequiresOSsupporttomaintainthismapping

Sharedportionof

addressspacePrivate

portionofaddressspace

Commonphysical

addressspace

Pn private

P2private

P1private

P0private

storeP0

loadPn


Why Shared Memory?Pluses

❒ Forapplicationslookslikemultitaskinguniprocessor❒ ForOSonlyevolutionaryextensionsrequired❒ EasytodocommunicationwithoutOS❒ Softwarecanworryaboutcorrectnessfirstthenperformance

Minuses❒ Propersynchronizationiscomplex❒ Communicationisimplicitsohardertooptimize❒ Hardwaredesignersmustimplement

Result❒ Traditionallybus-basedSymmetricMultiprocessors(SMPs),

andnowCMPsarethemostsuccessparallelmachinesever❒ Andthefirstwithmulti-billion-dollarmarkets


Thread-Level Parallelism

• Thread-levelparallelism(TLP)❒ Collectionofasynchronoustasks:notstartedandstoppedtogether❒ Datasharedloosely,dynamically

• Example:database/webserver(eachqueryisathread)❒ accts isshared,can’tregisterallocateevenifitwerescalar❒ id andamt areprivatevariables,registerallocatedtor1,r2

struct acct_t { int bal; };shared struct acct_t accts[MAX_ACCT];int id,amt;if (accts[id].bal >= amt){

accts[id].bal -= amt;spew_cash();

}

0: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash


Synchronization• Mutualexclusion :locks,…

• Order :barriers,signal-wait,…

• Implementedusingread/write/modifytosharedlocation❒ Language-level:

❍ libraries(e.g.,locksinpthread)❍ Programmerscanwritecustomsynchronizations

❒ HardwareISA❍ E.g.,test-and-set

• OSprovidessupportformanagingthreads❒ scheduling,fork,join,futex signal/wait

We’llcoversynchronizationinmoredetailinafewweeks


Paired vs. Separate Processor/Memory?• Separateprocessor/memory

❒ Uniformmemoryaccess(UMA):equallatencytoallmemory+ Simplesoftware,doesn’tmatterwhereyouputdata– Lowerpeakperformance❒ Bus-basedUMAscommon:symmetricmulti-processors(SMP)

• Pairedprocessor/memory❒ Non-uniformmemoryaccess(NUMA):fastertolocalmemory– Morecomplexsoftware:whereyouputdatamatters+ Higherpeakperformance:assumingproperdataplacement

CPU($)

Mem

CPU($)

Mem

CPU($)

Mem

CPU($)

Mem

CPU($)Mem

CPU($)Mem

CPU($)Mem

CPU($)MemR RRR


Shared vs. Point-to-Point Networks• Sharednetwork:e.g.,bus(left)

+ Lowlatency– Lowbandwidth:doesn’tscalebeyond~16processors+ Sharedpropertysimplifiescachecoherenceprotocols(later)

• Point-to-pointnetwork:e.g.,meshorring(right)– Longerlatency:mayneedmultiple“hops”tocommunicate+ Higherbandwidth:scalesto1000sofprocessors– Cachecoherenceprotocolsarecomplex

CPU($)Mem

CPU($)Mem R

CPU($)Mem R

CPU($)MemR

CPU($)MemR

CPU($)Mem

CPU($)Mem

CPU($)Mem RRRR


Implementation #1: Snooping Bus MP

• Twobasicimplementations

• Bus-basedsystems❒ Typicallysmall:2–8(maybe16)processors❒ Typicallyprocessorssplitfrommemories(UMA)

❍ Sometimesmultipleprocessorsonsinglechip(CMP)❍ Symmetricmultiprocessors(SMPs)❍ Common,Iuseoneeveryday

CPU($)

Mem

CPU($)

Mem

CPU($)

Mem

CPU($)

Mem


Implementation #2: Scalable MP

• Generalpoint-to-pointnetwork-basedsystems❒ Typicallyprocessor/memory/routerblocks(NUMA)

❍ Glueless MP:noneedforadditional“glue”chips❒ Canbearbitrarilylarge:1000’sofprocessors

❍ Massivelyparallelprocessors(MPPs)❒ Inrealityonlygovernment(DoD)hasMPPs…

❍ Companieshavemuchsmallersystems:32–64processors❍ Scalablemulti-processors

CPU($)Mem R

CPU($)Mem R

CPU($)MemR

CPU($)MemR


Cache Coherence

• Two$100withdrawalsfromaccount#241attwoATMs❒ Eachtransactionmapstothreadondifferentprocessor❒ Trackaccts[241].bal (addressisinr3)

Processor 00: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash

Processor 1


CPU0 MemCPU1


No-Cache, No-Problem

• ScenarioI:processorshavenocaches❒ Noproblem


Processor 1


500500

400

400

300


Cache Incoherence

• ScenarioII:processorshavewrite-backcaches❒ Potentially3copiesofaccts[241].bal:memory,p0$,p1$❒ Cangetincoherent(inconsistent)


Processor 1


500V:500 500

D:400 500

D:400 500V:500

D:400 500D:400


Snooping Cache-Coherence ProtocolsBusprovidesserializationpoint

Eachcachecontroller“snoops”allbustransactions❒ takeactiontoensurecoherence

❍ invalidate❍ update❍ supplyvalue

❒ dependsonstateoftheblockandtheprotocol


Scalable Cache Coherence• Scalablecachecoherence:twopartsolution

• PartI:busbandwidth❒ Replacenon-scalablebandwidthsubstrate(bus)…❒ …withscalablebandwidthone(point-to-pointnetwork,e.g.,mesh)

• PartII:processorsnoopingbandwidth❒ Interesting:mostsnoopsresultinnoaction❒ Replacenon-scalablebroadcastprotocol(spameveryone)…❒ …withscalabledirectoryprotocol (onlyspamprocessorsthatcare)

• WewillcoverthisinUnit3


Shared Memory Summary• Shared-memorymultiprocessors

+ Simplesoftware:easydatasharing,handlesbothDLPandTLP– Complexhardware:mustprovideillusionofglobaladdress

space

• Twobasicimplementations❒ Symmetric(UMA)multi-processors(SMPs)

❍ Underlyingcommunicationnetwork:bus(ordered)+ Low-latency,simpleprotocolsthatrelyonglobalorder– Low-bandwidth,poorscalability

❒ Scalable(NUMA)multi-processors(MPPs)❍ Underlyingcommunicationnetwork:point-to-point

(unordered)+ Scalablebandwidth– Higher-latency,complexprotocols

eecs 570 lecture 2 message passing & shared memory

Documents