eecs 570 lecture 2 message passing & shared memory
TRANSCRIPT
Lecture 2 Slide 1EECS 570
EECS570Lecture2MessagePassing&SharedMemoryWinter2018
Prof.SatishNarayanasamy
http://www.eecs.umich.edu/courses/eecs570/
Slides developed in part by Drs. Adve, Falsafi, Martin, Musuvathi, Narayanasamy, Nowatzyk, Wenisch, Sarkar, Mikko Lipasti, Jim Smith, John Shen, Mark Hill, David Wood, Guri Sohi, Jim Smith, Natalie Enright Jerger, Michel Dubois, Murali Annavaram, Per Stenström, and probably others
IntelParagonXP/S
Lecture 2 Slide 2EECS 570
AnnouncementsDiscussionthisFriday.•Willdiscusscourseproject
Lecture 2 Slide 3EECS 570
ReadingsFortoday
❒ DavidWoodandMarkHill.“Cost-EffectiveParallelComputing,”IEEEComputer,1995.
❒ MarkHilletal.“21st CenturyComputerArchitecture.”CCCWhitePaper,2012.
ForWednesday1/13:❒ Seileretal.Larrabee:AMany-Corex86Architecturefor
VisualComputing.Siggraph 2008.
Lecture 2 Slide 4EECS 570
Sequential Merge Sort
16MBinput(32-bitintegers)
Recurse(left)
Recurse(right)
Copybacktoinputarray
Mergetoscratcharray
Time
SequentialExecution
Lecture 2 Slide 5EECS 570
Parallel Merge Sort (as Parallel Directed Acyclic Graph)
16MBinput(32-bitintegers)
Recurse(left) Recurse(right)
Copybacktoinputarray
Mergetoscratcharray
Time ParallelExecution
Lecture 2 Slide 6EECS 570
Parallel DAG for Merge Sort (2-core)
SequentialSort
Merge
SequentialSort
Time
Lecture 2 Slide 7EECS 570
Parallel DAG for Merge Sort (4-core)
Lecture 2 Slide 8EECS 570
Parallel DAG for Merge Sort (8-core)
Lecture 2 Slide 9EECS 570
The DAG Execution Model of a Parallel Computation
• Givenaninput,dynamicallycreateaDAG
• Nodesrepresentsequentialcomputation❒ Weightedbytheamountofwork
• Edgesrepresentdependencies:❒ NodeAà NodeBmeansthatBcannotbescheduledunless
Aisfinished
Lecture 2 Slide 10EECS 570
Sorting 16 elements in four cores
Lecture 2 Slide 11EECS 570
Sorting 16 elements in four cores(4 element arrays sorted in constant time)
1 16
1 8
1
1
1 8
1
1
Lecture 2 Slide 12EECS 570
Performance Measures• GivenagraphG,aschedulerS,andPprocessors
• Tp(S) :TimeonPprocessorsusingschedulerS
• Tp :TimeonPprocessorsusingbestscheduler
• T1 :Timeonasingleprocessor(sequentialcost)
• T∞ :Timeassuminginfiniteresources
Lecture 2 Slide 13EECS 570
Work and Depth
• T1=Work❒ Thetotalnumberofoperationsexecutedbyacomputation
• T∞=Depth❒ Thelongestchainofsequentialdependencies(criticalpath)
intheparallelDAG
Lecture 2 Slide 14EECS 570
T∞ (Depth): Critical Path Length(Sequential Bottleneck)
Lecture 2 Slide 15EECS 570
T1 (work): Time to Run Sequentially
Lecture 2 Slide 16EECS 570
Sorting 16 elements in four cores(4 element arrays sorted in constant time)
1 16
1 8
1
1
1 8
1
1
Work=Depth=
Lecture 2 Slide 17EECS 570
Some Useful Theorems
Lecture 2 Slide 18EECS 570
Work Law• “Youcannotavoidworkbyparallelizing”
T1 / P ≤ TP
Lecture 2 Slide 19EECS 570
Work Law• “Youcannotavoidworkbyparallelizing”
T1 / P ≤ TP
Speedup = T1 / TP
Lecture 2 Slide 20EECS 570
Work Law• “Youcannotavoidworkbyparallelizing”
• Canspeedupbemorethan2whenwegofrom1-coreto2-coreinpractice?
T1 / P ≤ TP
Speedup = T1 / TP
Lecture 2 Slide 21EECS 570
Depth Law•Moreresourcesshouldmakethingsfaster• Youarelimitedbythesequentialbottleneck
TP ≥ T∞
Lecture 2 Slide 22EECS 570
Amount of Parallelism
Parallelism = T1 / T∞
Lecture 2 Slide 23EECS 570
Maximum Speedup Possible
Parallelism
“speedupisboundedabovebyavailableparallelism”
Speedup T1 / TP ≤ T1 / T∞
Lecture 2 Slide 24EECS 570
Greedy Scheduler
• IfmorethanPnodescanbescheduled,pickanysubsetofsizeP
• IflessthanPnodescanbescheduled,schedulethemall
Lecture 2 Slide 25EECS 570
MoreReading:http://www.cs.cmu.edu/afs/cs/academic/class/15492-f07/www/scribe/lec4/lecture4.pdf
Lecture 2 Slide 26EECS 570
Performance of the Greedy Scheduler
Worklaw T1 / P ≤ TP
Depthlaw T∞ ≤ TP
TP(Greedy) ≤ T1 / P + T∞
Lecture 2 Slide 27EECS 570
Greedy is optimal within factor of 2
Worklaw T1 / P ≤ TP
Depthlaw T∞ ≤ TP
TP ≤ TP(Greedy) ≤ 2 TP
Lecture 2 Slide 28EECS 570
Work/Depth of Merge Sort (Sequential Merge)
•WorkT1 : O(n log n)• DepthT∞ : O(n)
❒ TakesO(n) timetomergen elements
• Parallelism:❒ T1 / T∞ = O(log n) à reallybad!
Lecture 2 Slide 29EECS 570
Main Message• AnalyzetheWorkandDepthofyouralgorithm• ParallelismisWork/Depth• TrytodecreaseDepth
❒ thecriticalpath❒ a sequential bottleneck
• IfyouincreaseDepth❒ betterincreaseWorkbyalotmore!
Lecture 2 Slide 30EECS 570
Amdahl’s law• Sortingtakes70%oftheexecutiontimeofasequentialprogram
• Youreplacethesortingalgorithmwithonethatscalesperfectlyonmulti-corehardware
• Howmanycoresdoyouneedtogeta4xspeed-upontheprogram?
Lecture 2 Slide 31EECS 570
Amdahl’s law, 𝑓 = 70%
f =theparallel portionofexecution1 - f =thesequential portionofexecution
c =numberofcoresused
Speedup(f, c) = 1 / ( 1 – f) + f / c
Lecture 2 Slide 32EECS 570
Amdahl’s law, 𝑓 = 70%
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Speedu
p
#cores
Desired4xspeedup
Speedupachieved(perfectscalingon70%)
Lecture 2 Slide 33EECS 570
Amdahl’s law, 𝑓 = 70%
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Speedu
p
#cores
Desired4xspeedup
Speedupachieved(perfectscalingon70%)
Limitasc→∞=1/(1-f)=3.33
Lecture 2 Slide 34EECS 570
Amdahl’s law, 𝑓 = 10%
0.94
0.96
0.98
1.00
1.02
1.04
1.06
1.08
1.10
1.12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Speedu
p
#cores
Speedupachievedwithperfectscaling
Amdahl’slawlimit,just1.11x
Lecture 2 Slide 35EECS 570
Amdahl’s law, 𝑓 = 98%
0
10
20
30
40
50
60
1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103
109
115
121
127
Speedu
p
#cores
Lecture 2 Slide 36EECS 570
Lesson• Speedupislimitedbysequential code
• Evenasmallpercentageofsequential codecangreatlylimitpotentialspeedup
Lecture 2 Slide 37EECS 570
21st CenturyComputer Architecture
A CCC community white paperhttp://cra.org/ccc/docs/init/21stcenturyarchitecturewhitepap
er.pdf
SlidesfromM.Hill,HPCA2014Keynote
Lecture 2 Slide 38EECS 570
20th Century ICT Set Up
• Information&CommunicationTechnology(ICT)HasChangedOurWorld❒ <longlistomitted>
• Requiredinnovationsinalgorithms,applications,programminglanguages,…,&systemsoftware
• Key(invisible)enablers(cost-)performancegains❒ Semiconductortechnology(“Moore’sLaw”)❒ Computerarchitecture(~80xperDanowitz etal.)
Lecture 2 Slide 39EECS 570
Enablers: Technology + Architecture
Danowitz etal.,CACM04/2012,Figure1
Technology
Architecture
Lecture 2 Slide 40EECS 570
21st Century ICT Promises More
Data-centricpersonalizedhealthcare Computation-drivenscientificdiscovery
Muchmore:known&unknownHumannetworkanalysis
Lecture 2 Slide 41EECS 570
21st Century App Characteristics
BIG DATA
Whither enablers of future (cost-)performance gains?
ALWAYS ONLINE
SECURE/PRIVATE
Lecture 2 Slide 42EECS 570
Technology’s Challenges 1/2Late 20th Century The New Reality
Moore’s Law —2× transistors/chip
Transistor count still 2× BUT…
Dennard Scaling —~constant power/chip
Gone. Can’t repeatedly double power/chip
42
Lecture 2 Slide 43EECS 570
Technology’s Challenges 2/2Late 20th Century The New Reality
Moore’s Law —2× transistors/chip
Transistor count still 2× BUT…
Dennard Scaling —~constant power/chip
Gone. Can’t repeatedly double power/chip
Modest (hidden)transistor unreliability
Increasing transistor unreliability can’t be hidden
Focus on computation over communication
Communication (energy) more expensive than computation
1-time costs amortized via mass market
One-time cost much worse &want specialized platforms
How should architects step up as technology falters?
Lecture 2 Slide 44EECS 570
“Timeline” from DARPA ISATSy
stem
Cap
abilit
y (lo
g)
80s 90s 00s 10s 20s 30s 40s
Fallow Period
50sSource:AdvancingComputerSystemswithoutTechnologyProgress,
ISATOutbrief (http://www.cs.wisc.edu/~markhill/papers/isat2012_ACSWTP.pdf)
MarkD.HillandChristosKozyrakis,DARPA/ISATWorkshop,March26-27,2012.ApprovedforPublicRelease,DistributionUnlimited
TheviewsexpressedarethoseoftheauthoranddonotreflecttheofficialpolicyorpositionoftheDepartmentofDefenseortheU.S.Government.
Lecture 2 Slide 45EECS 570
21st Century Comp Architecture20th Century 21st CenturySingle-chip ingeneric computer
Architecture as Infrastructure: Spanning sensors to cloudsPerformance + security, privacy, availability, programmability, …
Cross-Cutting:
Break current layers with new interfaces
Performance via invisible instr.-level parallelism
Energy First● Parallelism● Specialization● Cross-layer design
Predictable technologies: CMOS, DRAM, & disks
New technologies (non-volatile memory, near-threshold, 3D, photonics, …) Rethink: memory & storage, reliability, communication
X
X
Lecture 2 Slide 46EECS 570
21st Century Comp Architecture20th Century 21st CenturySingle-chip ingeneric computer
Architecture as Infrastructure: Spanning sensors to cloudsPerformance + security, privacy, availability, programmability, …
Cross-Cutting:
Break current layers with new interfaces
Performance via invisible instr.-level parallelism
Energy First● Parallelism● Specialization● Cross-layer design
Predictable technologies: CMOS, DRAM, & disks
New technologies (non-volatile memory, near-threshold, 3D, photonics, …) Rethink: memory & storage, reliability, communication
X
Lecture 2 Slide 47EECS 570
21st Century Comp Architecture20th Century 21st CenturySingle-chip ingeneric computer
Architecture as Infrastructure: Spanning sensors to cloudsPerformance + security, privacy, availability, programmability, …
Cross-Cutting:
Break current layers with new interfaces
Performance via invisible instr.-level parallelism
Energy First● Parallelism● Specialization● Cross-layer design
Predictable technologies: CMOS, DRAM, & disks
New technologies (non-volatile memory, near-threshold, 3D, photonics, …) Rethink: memory & storage, reliability, communication
Lecture 2 Slide 48EECS 570
21st Century Comp Architecture20th Century 21st CenturySingle-chip ingeneric computer
Architecture as Infrastructure: Spanning sensors to cloudsPerformance + security, privacy, availability, programmability, …
Cross-Cutting:
Break current layers with new interfaces
Performance via invisible instr.-level parallelism
Energy First● Parallelism● Specialization● Cross-layer design
Predictable technologies: CMOS, DRAM, & disks
New technologies (non-volatile memory, near-threshold, 3D, photonics, …) Rethink: memory & storage, reliability, communication
Lecture 2 Slide 49EECS 570
21st Century Comp Architecture20th Century 21st CenturySingle-chip instand-alone computer
Architecture as Infrastructure: Spanning sensors to cloudsPerformance + security, privacy, availability, programmability, …
Cross-Cutting:
Break current layers with new interfaces
Performance via invisible instr.-level parallelism
Energy First● Parallelism● Specialization● Cross-layer design
Predictable technologies: CMOS, DRAM, & disks
New technologies (non-volatile memory, near-threshold, 3D, photonics, …) Rethink: memory & storage, reliability, communication
Lecture 2 Slide 50EECS 570
Cost-Effective Computing[Wood & Hill, IEEE Computer 1995]
Premise:Isn’tspeedup(P)<Pinefficient?❒ Ifonlythroughputmatters,usePcomputersinstead…
• Keyobservation:muchofacomputer’scostisNOTCPULetCostup(P)=Cost(P)/Cost(1)Parallelcomputingiscost-effectiveif:
❒ Speedup(P)>Costup(P)
• E.g.,forSGIPowerChallenge w/500MB❒ Costup(32)=8.6
Lecture 2 Slide 51EECS 570
Parallel Programming Models and Interfaces
Lecture 2 Slide 52EECS 570
Programming Models• Highlevelparadigmforexpressinganalgorithm
❒ Examples:❍ Functional❍ Sequential,procedural❍ Sharedmemory❍MessagePassing
• Embodiedinlanguagesthatsupportconcurrentexecution❒ Incorporatedintolanguageconstructs❒ Incorporatedaslibrariesaddedtoexistingsequentiallanguage
• Toplevelfeatures:(Forconventionalmodels– sharedmemory,messagepassing)❒ Multiplethreadsareconceptuallyvisibletoprogrammer❒ Communication/synchronizationarevisibletoprogrammer
Lecture 2 Slide 53EECS 570
An Incomplete Taxonomy
• VLIW• SIMD• Vector• DataFlow• GPU• MapReduce /WSC• SystolicArray• Reconfigurable/FPGA• …
MP Systems
SharedMemory
Bus-based
SwitchingFabric-based
Incoherent Coherent
MessagePassing
COTS-based
SwitchingFabric-based
Emerging&LunaticFringe
Lecture 2 Slide 54EECS 570
Programming Model Elements• ForbothSharedMemoryandMessagePassing• Processesandthreads
❒ Process: Asharedaddressspaceandoneormorethreadsofcontrol❒ Thread: Aprogramsequencerandprivateaddressspace❒ Task:Lessformalterm– partofanoveralljob❒ Created,terminated,scheduled,etc.
• Communication❒ Passingofdata
• Synchronization❒ Communicatingcontrolinformation❒ Toassurereliable,deterministiccommunication
Lecture 2 Slide 55EECS 570
Historical View
Joinat:
Programwith:
P P P
M M M
IO IO IO
I/O(Network)
Messagepassing
P P P
M M M
IO IO IO
Memory
SharedMemory
P P P
M M M
IO IO IO
Processor
Dataflow,SIMD,VLIW,CUDA,
otherdataparallel
Lecture 2 Slide 56EECS 570
Message PassingProgramming Model
Lecture 2 Slide 57EECS 570
Message Passing Programming Model
• Userlevelsend/receiveabstraction❒ Matchvialocalbuffer(x,y),process(Q,P),andtag(t)❒ Neednaming/synchronizationconventions
ProcessPlocaladdressspace
ProcessQlocaladdressspace
match
Addressx
Addressy
sendx,Q,t
recv y,P,t
Lecture 2 Slide 58EECS 570
Message Passing Architectures
• Cannotdirectlyaccessmemoryofanothernode
• IBMSP-2,IntelParagon,Myrinet QuadricsQSW
• Clusterofworkstations(e.g.,MPIonfluxcluster)
CPU($)Mem
Interconnect
NicCPU($)Mem Nic
CPU($)Mem Nic
CPU($)Mem Nic
Lecture 2 Slide 59EECS 570
MPI –Message Passing Interface API
• Awidelyusedstandard❒ Foravarietyofdistributedmemorysystems
• SMPClusters,workstationclusters,MPPs,heterogeneoussystems
• AlsoworksonSharedMemoryMPs❒EasytoemulatedistributedmemoryonsharedmemoryHW
• Canbeusedwithanumberofhighlevellanguages
• AvailableintheFluxclusteratMichigan
Lecture 2 Slide 60EECS 570
Processes and Threads• Lotsofflexibility(advantageofmessagepassing)
1. Multiplethreadssharinganaddressspace2. Multipleprocessessharinganaddressspace3. Multipleprocesseswithdifferentaddressspaces
• anddifferentOSes
• 1and2easilyimplementedonsharedmemoryHW(withsingleOS)❒ Processandthreadcreation/managementsimilartosharedmemory
• 3probablymorecommoninpractice❒ Processcreationoftenexternaltoexecutionenvironment;e.g.shellscript❒ HardforuserprocessononesystemtocreateprocessonanotherOS
Lecture 2 Slide 61EECS 570
Communication and Synchronization• Combinedinthemessagepassingparadigm
❒ Synchronizationofmessagespartofcommunicationsemantics
• Point-to-pointcommunication❒ Fromoneprocesstoanother
• Collectivecommunication❒ Involvesgroupsofprocesses❒e.g.,broadcast
Lecture 2 Slide 62EECS 570
Message Passing: Send()
• Send(<what>,<where-to>,<how>)
• What:❒ Adatastructureorobjectinuserspace❒ Abufferallocatedfromspecialmemory❒ Awordorsignal
• Where-to:❒ Aspecificprocessor❒ Asetofspecificprocessors❒ Aqueue,dispatcher,scheduler
• How:❒ Asynchronouslyvs.synchronously❒ Typed❒ In-ordervs.out-of-order❒ Prioritized
Lecture 2 Slide 63EECS 570
Message Passing: Receive()
• Receive(<data>,<info>,<what>,<how>)
• Data:mechanismtoreturnmessagecontent❒ Abufferallocatedintheuserprocess❒ Memoryallocatedelsewhere
• Info:meta-infoaboutthemessage❒ Sender-ID❒ Type,Size,Priority❒ Flowcontrolinformation
• What:receiveonlycertainmessages❒ Sender-ID,Type,Priority
• How:❒ Blockingvs.non-blocking
Lecture 2 Slide 64EECS 570
Synchronous vs Asynchronous
• SynchronousSend❒ Stalluntilmessagehasactuallybeenreceived❒ Impliesamessageacknowledgementfromreceivertosender
• SynchronousReceive❒ Stalluntilmessagehasactuallybeenreceived
• AsynchronousSendandReceive❒ Senderandreceivercanproceedregardless❒Returnsrequesthandlethatcanbetestedformessagereceipt❒Requesthandlecanbetestedtoseeifmessagehasbeensent/received
Lecture 2 Slide 65EECS 570
Deadlock• Blockingcommunicationsmaydeadlock
• Requirescareful(safe)orderingofsends/receives
<Process 0> <Process 1>Send(Process1, Message); Receive (Process0, Message);Receive(Process1, Message); Send (Process0, Message);
<Process 0> <Process 1>Send(Process1, Message); Send(Process0, Message);Receive(Process1, Message); Receive(Process0, Message);
Lecture 2 Slide 66EECS 570
Message Passing Paradigm Summary
ProgrammingModel(Software)pointofview:
• Disjoint,separatenamespaces
• “Sharednothing”
• Communicationviaexplicit,typedmessages:send&receive
Lecture 2 Slide 67EECS 570
Message Passing Paradigm Summary
ComputerEngineering(Hardware)pointofview:
• Treatinter-processcommunicationasI/Odevice
• Criticalissues:❒ HowtooptimizeAPIoverhead
❒ Minimizecommunicationlatency
❒ Buffermanagement:howtodealwithearly/unsolicitedmessages,messagetyping,high-levelflowcontrol
❒ Eventsignaling&synchronization
❒ Librarysupportforcommonfunctions(barriersynchronization,taskdistribution,scatter/gather,datastructuremaintenance)
Lecture 2 Slide 68EECS 570
Shared Memory Programming Model
Lecture 2 Slide 69EECS 570
Shared-Memory Model
P1 P2 P3 P4
MemorySystem
❒ Multipleexecutioncontextssharingasingleaddressspace❍ Multipleprograms(MIMD)❍ Ormorefrequently:multiplecopiesofoneprogram(SPMD)
❒ Implicit(automatic)communicationvialoadsandstores❒ Theoreticalfoundation:PRAMmodel
Lecture 2 Slide 70EECS 570
Global Shared Physical Address Space
• Communication,sharing,synchronizationvialoads/storestosharedvariables
• Facilitiesforaddresstranslationbetweenlocal/globaladdressspaces
• RequiresOSsupporttomaintainthismapping
Sharedportionof
addressspacePrivate
portionofaddressspace
Commonphysical
addressspace
Pn private
P2private
P1private
P0private
storeP0
loadPn
Lecture 2 Slide 71EECS 570
Why Shared Memory?Pluses
❒ Forapplicationslookslikemultitaskinguniprocessor❒ ForOSonlyevolutionaryextensionsrequired❒ EasytodocommunicationwithoutOS❒ Softwarecanworryaboutcorrectnessfirstthenperformance
Minuses❒ Propersynchronizationiscomplex❒ Communicationisimplicitsohardertooptimize❒ Hardwaredesignersmustimplement
Result❒ Traditionallybus-basedSymmetricMultiprocessors(SMPs),
andnowCMPsarethemostsuccessparallelmachinesever❒ Andthefirstwithmulti-billion-dollarmarkets
Lecture 2 Slide 72EECS 570
Thread-Level Parallelism
• Thread-levelparallelism(TLP)❒ Collectionofasynchronoustasks:notstartedandstoppedtogether❒ Datasharedloosely,dynamically
• Example:database/webserver(eachqueryisathread)❒ accts isshared,can’tregisterallocateevenifitwerescalar❒ id andamt areprivatevariables,registerallocatedtor1,r2
struct acct_t { int bal; };shared struct acct_t accts[MAX_ACCT];int id,amt;if (accts[id].bal >= amt){
accts[id].bal -= amt;spew_cash();
}
0: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash
Lecture 2 Slide 73EECS 570
Synchronization• Mutualexclusion :locks,…
• Order :barriers,signal-wait,…
• Implementedusingread/write/modifytosharedlocation❒ Language-level:
❍ libraries(e.g.,locksinpthread)❍ Programmerscanwritecustomsynchronizations
❒ HardwareISA❍ E.g.,test-and-set
• OSprovidessupportformanagingthreads❒ scheduling,fork,join,futex signal/wait
We’llcoversynchronizationinmoredetailinafewweeks
Lecture 2 Slide 74EECS 570
Paired vs. Separate Processor/Memory?• Separateprocessor/memory
❒ Uniformmemoryaccess(UMA):equallatencytoallmemory+ Simplesoftware,doesn’tmatterwhereyouputdata– Lowerpeakperformance❒ Bus-basedUMAscommon:symmetricmulti-processors(SMP)
• Pairedprocessor/memory❒ Non-uniformmemoryaccess(NUMA):fastertolocalmemory– Morecomplexsoftware:whereyouputdatamatters+ Higherpeakperformance:assumingproperdataplacement
CPU($)
Mem
CPU($)
Mem
CPU($)
Mem
CPU($)
Mem
CPU($)Mem
CPU($)Mem
CPU($)Mem
CPU($)MemR RRR
Lecture 2 Slide 75EECS 570
Shared vs. Point-to-Point Networks• Sharednetwork:e.g.,bus(left)
+ Lowlatency– Lowbandwidth:doesn’tscalebeyond~16processors+ Sharedpropertysimplifiescachecoherenceprotocols(later)
• Point-to-pointnetwork:e.g.,meshorring(right)– Longerlatency:mayneedmultiple“hops”tocommunicate+ Higherbandwidth:scalesto1000sofprocessors– Cachecoherenceprotocolsarecomplex
CPU($)Mem
CPU($)Mem R
CPU($)Mem R
CPU($)MemR
CPU($)MemR
CPU($)Mem
CPU($)Mem
CPU($)Mem RRRR
Lecture 2 Slide 76EECS 570
Implementation #1: Snooping Bus MP
• Twobasicimplementations
• Bus-basedsystems❒ Typicallysmall:2–8(maybe16)processors❒ Typicallyprocessorssplitfrommemories(UMA)
❍ Sometimesmultipleprocessorsonsinglechip(CMP)❍ Symmetricmultiprocessors(SMPs)❍ Common,Iuseoneeveryday
CPU($)
Mem
CPU($)
Mem
CPU($)
Mem
CPU($)
Mem
Lecture 2 Slide 77EECS 570
Implementation #2: Scalable MP
• Generalpoint-to-pointnetwork-basedsystems❒ Typicallyprocessor/memory/routerblocks(NUMA)
❍ Glueless MP:noneedforadditional“glue”chips❒ Canbearbitrarilylarge:1000’sofprocessors
❍ Massivelyparallelprocessors(MPPs)❒ Inrealityonlygovernment(DoD)hasMPPs…
❍ Companieshavemuchsmallersystems:32–64processors❍ Scalablemulti-processors
CPU($)Mem R
CPU($)Mem R
CPU($)MemR
CPU($)MemR
Lecture 2 Slide 78EECS 570
Cache Coherence
• Two$100withdrawalsfromaccount#241attwoATMs❒ Eachtransactionmapstothreadondifferentprocessor❒ Trackaccts[241].bal (addressisinr3)
Processor 00: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash
Processor 1
0: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash
CPU0 MemCPU1
Lecture 2 Slide 79EECS 570
No-Cache, No-Problem
• ScenarioI:processorshavenocaches❒ Noproblem
Processor 00: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash
Processor 1
0: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash
500500
400
400
300
Lecture 2 Slide 80EECS 570
Cache Incoherence
• ScenarioII:processorshavewrite-backcaches❒ Potentially3copiesofaccts[241].bal:memory,p0$,p1$❒ Cangetincoherent(inconsistent)
Processor 00: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash
Processor 1
0: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash
500V:500 500
D:400 500
D:400 500V:500
D:400 500D:400
Lecture 2 Slide 81EECS 570
Snooping Cache-Coherence ProtocolsBusprovidesserializationpoint
Eachcachecontroller“snoops”allbustransactions❒ takeactiontoensurecoherence
❍ invalidate❍ update❍ supplyvalue
❒ dependsonstateoftheblockandtheprotocol
Lecture 2 Slide 82EECS 570
Scalable Cache Coherence• Scalablecachecoherence:twopartsolution
• PartI:busbandwidth❒ Replacenon-scalablebandwidthsubstrate(bus)…❒ …withscalablebandwidthone(point-to-pointnetwork,e.g.,mesh)
• PartII:processorsnoopingbandwidth❒ Interesting:mostsnoopsresultinnoaction❒ Replacenon-scalablebroadcastprotocol(spameveryone)…❒ …withscalabledirectoryprotocol (onlyspamprocessorsthatcare)
• WewillcoverthisinUnit3
Lecture 2 Slide 83EECS 570
Shared Memory Summary• Shared-memorymultiprocessors
+ Simplesoftware:easydatasharing,handlesbothDLPandTLP– Complexhardware:mustprovideillusionofglobaladdress
space
• Twobasicimplementations❒ Symmetric(UMA)multi-processors(SMPs)
❍ Underlyingcommunicationnetwork:bus(ordered)+ Low-latency,simpleprotocolsthatrelyonglobalorder– Low-bandwidth,poorscalability
❒ Scalable(NUMA)multi-processors(MPPs)❍ Underlyingcommunicationnetwork:point-to-point
(unordered)+ Scalablebandwidth– Higher-latency,complexprotocols