eecs 570 lecture 2 message passing & shared memory

Post on 02-Jan-2017

226 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Lecture 2 Slide 1EECS 570

EECS570Lecture2MessagePassing&SharedMemoryWinter2018

Prof.SatishNarayanasamy

http://www.eecs.umich.edu/courses/eecs570/

Slides developed in part by Drs. Adve, Falsafi, Martin, Musuvathi, Narayanasamy, Nowatzyk, Wenisch, Sarkar, Mikko Lipasti, Jim Smith, John Shen, Mark Hill, David Wood, Guri Sohi, Jim Smith, Natalie Enright Jerger, Michel Dubois, Murali Annavaram, Per Stenström, and probably others

IntelParagonXP/S

Lecture 2 Slide 2EECS 570

AnnouncementsDiscussionthisFriday.•Willdiscusscourseproject

Lecture 2 Slide 3EECS 570

ReadingsFortoday

❒ DavidWoodandMarkHill.“Cost-EffectiveParallelComputing,”IEEEComputer,1995.

❒ MarkHilletal.“21st CenturyComputerArchitecture.”CCCWhitePaper,2012.

ForWednesday1/13:❒ Seileretal.Larrabee:AMany-Corex86Architecturefor

VisualComputing.Siggraph 2008.

Lecture 2 Slide 4EECS 570

Sequential Merge Sort

16MBinput(32-bitintegers)

Recurse(left)

Recurse(right)

Copybacktoinputarray

Mergetoscratcharray

Time

SequentialExecution

Lecture 2 Slide 5EECS 570

Parallel Merge Sort (as Parallel Directed Acyclic Graph)

16MBinput(32-bitintegers)

Recurse(left) Recurse(right)

Copybacktoinputarray

Mergetoscratcharray

Time ParallelExecution

Lecture 2 Slide 6EECS 570

Parallel DAG for Merge Sort (2-core)

SequentialSort

Merge

SequentialSort

Time

Lecture 2 Slide 7EECS 570

Parallel DAG for Merge Sort (4-core)

Lecture 2 Slide 8EECS 570

Parallel DAG for Merge Sort (8-core)

Lecture 2 Slide 9EECS 570

The DAG Execution Model of a Parallel Computation

• Givenaninput,dynamicallycreateaDAG

• Nodesrepresentsequentialcomputation❒ Weightedbytheamountofwork

• Edgesrepresentdependencies:❒ NodeAà NodeBmeansthatBcannotbescheduledunless

Aisfinished

Lecture 2 Slide 10EECS 570

Sorting 16 elements in four cores

Lecture 2 Slide 11EECS 570

Sorting 16 elements in four cores(4 element arrays sorted in constant time)

1 16

1 8

1

1

1 8

1

1

Lecture 2 Slide 12EECS 570

Performance Measures• GivenagraphG,aschedulerS,andPprocessors

• Tp(S) :TimeonPprocessorsusingschedulerS

• Tp :TimeonPprocessorsusingbestscheduler

• T1 :Timeonasingleprocessor(sequentialcost)

• T∞ :Timeassuminginfiniteresources

Lecture 2 Slide 13EECS 570

Work and Depth

• T1=Work❒ Thetotalnumberofoperationsexecutedbyacomputation

• T∞=Depth❒ Thelongestchainofsequentialdependencies(criticalpath)

intheparallelDAG

Lecture 2 Slide 14EECS 570

T∞ (Depth): Critical Path Length(Sequential Bottleneck)

Lecture 2 Slide 15EECS 570

T1 (work): Time to Run Sequentially

Lecture 2 Slide 16EECS 570

Sorting 16 elements in four cores(4 element arrays sorted in constant time)

1 16

1 8

1

1

1 8

1

1

Work=Depth=

Lecture 2 Slide 17EECS 570

Some Useful Theorems

Lecture 2 Slide 18EECS 570

Work Law• “Youcannotavoidworkbyparallelizing”

T1 / P ≤ TP

Lecture 2 Slide 19EECS 570

Work Law• “Youcannotavoidworkbyparallelizing”

T1 / P ≤ TP

Speedup = T1 / TP

Lecture 2 Slide 20EECS 570

Work Law• “Youcannotavoidworkbyparallelizing”

• Canspeedupbemorethan2whenwegofrom1-coreto2-coreinpractice?

T1 / P ≤ TP

Speedup = T1 / TP

Lecture 2 Slide 21EECS 570

Depth Law•Moreresourcesshouldmakethingsfaster• Youarelimitedbythesequentialbottleneck

TP ≥ T∞

Lecture 2 Slide 22EECS 570

Amount of Parallelism

Parallelism = T1 / T∞

Lecture 2 Slide 23EECS 570

Maximum Speedup Possible

Parallelism

“speedupisboundedabovebyavailableparallelism”

Speedup T1 / TP ≤ T1 / T∞

Lecture 2 Slide 24EECS 570

Greedy Scheduler

• IfmorethanPnodescanbescheduled,pickanysubsetofsizeP

• IflessthanPnodescanbescheduled,schedulethemall

Lecture 2 Slide 25EECS 570

MoreReading:http://www.cs.cmu.edu/afs/cs/academic/class/15492-f07/www/scribe/lec4/lecture4.pdf

Lecture 2 Slide 26EECS 570

Performance of the Greedy Scheduler

Worklaw T1 / P ≤ TP

Depthlaw T∞ ≤ TP

TP(Greedy) ≤ T1 / P + T∞

Lecture 2 Slide 27EECS 570

Greedy is optimal within factor of 2

Worklaw T1 / P ≤ TP

Depthlaw T∞ ≤ TP

TP ≤ TP(Greedy) ≤ 2 TP

Lecture 2 Slide 28EECS 570

Work/Depth of Merge Sort (Sequential Merge)

•WorkT1 : O(n log n)• DepthT∞ : O(n)

❒ TakesO(n) timetomergen elements

• Parallelism:❒ T1 / T∞ = O(log n) à reallybad!

Lecture 2 Slide 29EECS 570

Main Message• AnalyzetheWorkandDepthofyouralgorithm• ParallelismisWork/Depth• TrytodecreaseDepth

❒ thecriticalpath❒ a sequential bottleneck

• IfyouincreaseDepth❒ betterincreaseWorkbyalotmore!

Lecture 2 Slide 30EECS 570

Amdahl’s law• Sortingtakes70%oftheexecutiontimeofasequentialprogram

• Youreplacethesortingalgorithmwithonethatscalesperfectlyonmulti-corehardware

• Howmanycoresdoyouneedtogeta4xspeed-upontheprogram?

Lecture 2 Slide 31EECS 570

Amdahl’s law, 𝑓 = 70%

f =theparallel portionofexecution1 - f =thesequential portionofexecution

c =numberofcoresused

Speedup(f, c) = 1 / ( 1 – f) + f / c

Lecture 2 Slide 32EECS 570

Amdahl’s law, 𝑓 = 70%

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Speedu

p

#cores

Desired4xspeedup

Speedupachieved(perfectscalingon70%)

Lecture 2 Slide 33EECS 570

Amdahl’s law, 𝑓 = 70%

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Speedu

p

#cores

Desired4xspeedup

Speedupachieved(perfectscalingon70%)

Limitasc→∞=1/(1-f)=3.33

Lecture 2 Slide 34EECS 570

Amdahl’s law, 𝑓 = 10%

0.94

0.96

0.98

1.00

1.02

1.04

1.06

1.08

1.10

1.12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Speedu

p

#cores

Speedupachievedwithperfectscaling

Amdahl’slawlimit,just1.11x

Lecture 2 Slide 35EECS 570

Amdahl’s law, 𝑓 = 98%

0

10

20

30

40

50

60

1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103

109

115

121

127

Speedu

p

#cores

Lecture 2 Slide 36EECS 570

Lesson• Speedupislimitedbysequential code

• Evenasmallpercentageofsequential codecangreatlylimitpotentialspeedup

Lecture 2 Slide 37EECS 570

21st CenturyComputer Architecture

A CCC community white paperhttp://cra.org/ccc/docs/init/21stcenturyarchitecturewhitepap

er.pdf

SlidesfromM.Hill,HPCA2014Keynote

Lecture 2 Slide 38EECS 570

20th Century ICT Set Up

• Information&CommunicationTechnology(ICT)HasChangedOurWorld❒ <longlistomitted>

• Requiredinnovationsinalgorithms,applications,programminglanguages,…,&systemsoftware

• Key(invisible)enablers(cost-)performancegains❒ Semiconductortechnology(“Moore’sLaw”)❒ Computerarchitecture(~80xperDanowitz etal.)

Lecture 2 Slide 39EECS 570

Enablers: Technology + Architecture

Danowitz etal.,CACM04/2012,Figure1

Technology

Architecture

Lecture 2 Slide 40EECS 570

21st Century ICT Promises More

Data-centricpersonalizedhealthcare Computation-drivenscientificdiscovery

Muchmore:known&unknownHumannetworkanalysis

Lecture 2 Slide 41EECS 570

21st Century App Characteristics

BIG DATA

Whither enablers of future (cost-)performance gains?

ALWAYS ONLINE

SECURE/PRIVATE

Lecture 2 Slide 42EECS 570

Technology’s Challenges 1/2Late 20th Century The New Reality

Moore’s Law —2× transistors/chip

Transistor count still 2× BUT…

Dennard Scaling —~constant power/chip

Gone. Can’t repeatedly double power/chip

42

Lecture 2 Slide 43EECS 570

Technology’s Challenges 2/2Late 20th Century The New Reality

Moore’s Law —2× transistors/chip

Transistor count still 2× BUT…

Dennard Scaling —~constant power/chip

Gone. Can’t repeatedly double power/chip

Modest (hidden)transistor unreliability

Increasing transistor unreliability can’t be hidden

Focus on computation over communication

Communication (energy) more expensive than computation

1-time costs amortized via mass market

One-time cost much worse &want specialized platforms

How should architects step up as technology falters?

Lecture 2 Slide 44EECS 570

“Timeline” from DARPA ISATSy

stem

Cap

abilit

y (lo

g)

80s 90s 00s 10s 20s 30s 40s

Fallow Period

50sSource:AdvancingComputerSystemswithoutTechnologyProgress,

ISATOutbrief (http://www.cs.wisc.edu/~markhill/papers/isat2012_ACSWTP.pdf)

MarkD.HillandChristosKozyrakis,DARPA/ISATWorkshop,March26-27,2012.ApprovedforPublicRelease,DistributionUnlimited

TheviewsexpressedarethoseoftheauthoranddonotreflecttheofficialpolicyorpositionoftheDepartmentofDefenseortheU.S.Government.

Lecture 2 Slide 45EECS 570

21st Century Comp Architecture20th Century 21st CenturySingle-chip ingeneric computer

Architecture as Infrastructure: Spanning sensors to cloudsPerformance + security, privacy, availability, programmability, …

Cross-Cutting:

Break current layers with new interfaces

Performance via invisible instr.-level parallelism

Energy First● Parallelism● Specialization● Cross-layer design

Predictable technologies: CMOS, DRAM, & disks

New technologies (non-volatile memory, near-threshold, 3D, photonics, …) Rethink: memory & storage, reliability, communication

X

X

Lecture 2 Slide 46EECS 570

21st Century Comp Architecture20th Century 21st CenturySingle-chip ingeneric computer

Architecture as Infrastructure: Spanning sensors to cloudsPerformance + security, privacy, availability, programmability, …

Cross-Cutting:

Break current layers with new interfaces

Performance via invisible instr.-level parallelism

Energy First● Parallelism● Specialization● Cross-layer design

Predictable technologies: CMOS, DRAM, & disks

New technologies (non-volatile memory, near-threshold, 3D, photonics, …) Rethink: memory & storage, reliability, communication

X

Lecture 2 Slide 47EECS 570

21st Century Comp Architecture20th Century 21st CenturySingle-chip ingeneric computer

Architecture as Infrastructure: Spanning sensors to cloudsPerformance + security, privacy, availability, programmability, …

Cross-Cutting:

Break current layers with new interfaces

Performance via invisible instr.-level parallelism

Energy First● Parallelism● Specialization● Cross-layer design

Predictable technologies: CMOS, DRAM, & disks

New technologies (non-volatile memory, near-threshold, 3D, photonics, …) Rethink: memory & storage, reliability, communication

Lecture 2 Slide 48EECS 570

21st Century Comp Architecture20th Century 21st CenturySingle-chip ingeneric computer

Architecture as Infrastructure: Spanning sensors to cloudsPerformance + security, privacy, availability, programmability, …

Cross-Cutting:

Break current layers with new interfaces

Performance via invisible instr.-level parallelism

Energy First● Parallelism● Specialization● Cross-layer design

Predictable technologies: CMOS, DRAM, & disks

New technologies (non-volatile memory, near-threshold, 3D, photonics, …) Rethink: memory & storage, reliability, communication

Lecture 2 Slide 49EECS 570

21st Century Comp Architecture20th Century 21st CenturySingle-chip instand-alone computer

Architecture as Infrastructure: Spanning sensors to cloudsPerformance + security, privacy, availability, programmability, …

Cross-Cutting:

Break current layers with new interfaces

Performance via invisible instr.-level parallelism

Energy First● Parallelism● Specialization● Cross-layer design

Predictable technologies: CMOS, DRAM, & disks

New technologies (non-volatile memory, near-threshold, 3D, photonics, …) Rethink: memory & storage, reliability, communication

Lecture 2 Slide 50EECS 570

Cost-Effective Computing[Wood & Hill, IEEE Computer 1995]

Premise:Isn’tspeedup(P)<Pinefficient?❒ Ifonlythroughputmatters,usePcomputersinstead…

• Keyobservation:muchofacomputer’scostisNOTCPULetCostup(P)=Cost(P)/Cost(1)Parallelcomputingiscost-effectiveif:

❒ Speedup(P)>Costup(P)

• E.g.,forSGIPowerChallenge w/500MB❒ Costup(32)=8.6

Lecture 2 Slide 51EECS 570

Parallel Programming Models and Interfaces

Lecture 2 Slide 52EECS 570

Programming Models• Highlevelparadigmforexpressinganalgorithm

❒ Examples:❍ Functional❍ Sequential,procedural❍ Sharedmemory❍MessagePassing

• Embodiedinlanguagesthatsupportconcurrentexecution❒ Incorporatedintolanguageconstructs❒ Incorporatedaslibrariesaddedtoexistingsequentiallanguage

• Toplevelfeatures:(Forconventionalmodels– sharedmemory,messagepassing)❒ Multiplethreadsareconceptuallyvisibletoprogrammer❒ Communication/synchronizationarevisibletoprogrammer

Lecture 2 Slide 53EECS 570

An Incomplete Taxonomy

• VLIW• SIMD• Vector• DataFlow• GPU• MapReduce /WSC• SystolicArray• Reconfigurable/FPGA• …

MP Systems

SharedMemory

Bus-based

SwitchingFabric-based

Incoherent Coherent

MessagePassing

COTS-based

SwitchingFabric-based

Emerging&LunaticFringe

Lecture 2 Slide 54EECS 570

Programming Model Elements• ForbothSharedMemoryandMessagePassing• Processesandthreads

❒ Process: Asharedaddressspaceandoneormorethreadsofcontrol❒ Thread: Aprogramsequencerandprivateaddressspace❒ Task:Lessformalterm– partofanoveralljob❒ Created,terminated,scheduled,etc.

• Communication❒ Passingofdata

• Synchronization❒ Communicatingcontrolinformation❒ Toassurereliable,deterministiccommunication

Lecture 2 Slide 55EECS 570

Historical View

Joinat:

Programwith:

P P P

M M M

IO IO IO

I/O(Network)

Messagepassing

P P P

M M M

IO IO IO

Memory

SharedMemory

P P P

M M M

IO IO IO

Processor

Dataflow,SIMD,VLIW,CUDA,

otherdataparallel

Lecture 2 Slide 56EECS 570

Message PassingProgramming Model

Lecture 2 Slide 57EECS 570

Message Passing Programming Model

• Userlevelsend/receiveabstraction❒ Matchvialocalbuffer(x,y),process(Q,P),andtag(t)❒ Neednaming/synchronizationconventions

ProcessPlocaladdressspace

ProcessQlocaladdressspace

match

Addressx

Addressy

sendx,Q,t

recv y,P,t

Lecture 2 Slide 58EECS 570

Message Passing Architectures

• Cannotdirectlyaccessmemoryofanothernode

• IBMSP-2,IntelParagon,Myrinet QuadricsQSW

• Clusterofworkstations(e.g.,MPIonfluxcluster)

CPU($)Mem

Interconnect

NicCPU($)Mem Nic

CPU($)Mem Nic

CPU($)Mem Nic

Lecture 2 Slide 59EECS 570

MPI –Message Passing Interface API

• Awidelyusedstandard❒ Foravarietyofdistributedmemorysystems

• SMPClusters,workstationclusters,MPPs,heterogeneoussystems

• AlsoworksonSharedMemoryMPs❒EasytoemulatedistributedmemoryonsharedmemoryHW

• Canbeusedwithanumberofhighlevellanguages

• AvailableintheFluxclusteratMichigan

Lecture 2 Slide 60EECS 570

Processes and Threads• Lotsofflexibility(advantageofmessagepassing)

1. Multiplethreadssharinganaddressspace2. Multipleprocessessharinganaddressspace3. Multipleprocesseswithdifferentaddressspaces

• anddifferentOSes

• 1and2easilyimplementedonsharedmemoryHW(withsingleOS)❒ Processandthreadcreation/managementsimilartosharedmemory

• 3probablymorecommoninpractice❒ Processcreationoftenexternaltoexecutionenvironment;e.g.shellscript❒ HardforuserprocessononesystemtocreateprocessonanotherOS

Lecture 2 Slide 61EECS 570

Communication and Synchronization• Combinedinthemessagepassingparadigm

❒ Synchronizationofmessagespartofcommunicationsemantics

• Point-to-pointcommunication❒ Fromoneprocesstoanother

• Collectivecommunication❒ Involvesgroupsofprocesses❒e.g.,broadcast

Lecture 2 Slide 62EECS 570

Message Passing: Send()

• Send(<what>,<where-to>,<how>)

• What:❒ Adatastructureorobjectinuserspace❒ Abufferallocatedfromspecialmemory❒ Awordorsignal

• Where-to:❒ Aspecificprocessor❒ Asetofspecificprocessors❒ Aqueue,dispatcher,scheduler

• How:❒ Asynchronouslyvs.synchronously❒ Typed❒ In-ordervs.out-of-order❒ Prioritized

Lecture 2 Slide 63EECS 570

Message Passing: Receive()

• Receive(<data>,<info>,<what>,<how>)

• Data:mechanismtoreturnmessagecontent❒ Abufferallocatedintheuserprocess❒ Memoryallocatedelsewhere

• Info:meta-infoaboutthemessage❒ Sender-ID❒ Type,Size,Priority❒ Flowcontrolinformation

• What:receiveonlycertainmessages❒ Sender-ID,Type,Priority

• How:❒ Blockingvs.non-blocking

Lecture 2 Slide 64EECS 570

Synchronous vs Asynchronous

• SynchronousSend❒ Stalluntilmessagehasactuallybeenreceived❒ Impliesamessageacknowledgementfromreceivertosender

• SynchronousReceive❒ Stalluntilmessagehasactuallybeenreceived

• AsynchronousSendandReceive❒ Senderandreceivercanproceedregardless❒Returnsrequesthandlethatcanbetestedformessagereceipt❒Requesthandlecanbetestedtoseeifmessagehasbeensent/received

Lecture 2 Slide 65EECS 570

Deadlock• Blockingcommunicationsmaydeadlock

• Requirescareful(safe)orderingofsends/receives

<Process 0> <Process 1>Send(Process1, Message); Receive (Process0, Message);Receive(Process1, Message); Send (Process0, Message);

<Process 0> <Process 1>Send(Process1, Message); Send(Process0, Message);Receive(Process1, Message); Receive(Process0, Message);

Lecture 2 Slide 66EECS 570

Message Passing Paradigm Summary

ProgrammingModel(Software)pointofview:

• Disjoint,separatenamespaces

• “Sharednothing”

• Communicationviaexplicit,typedmessages:send&receive

Lecture 2 Slide 67EECS 570

Message Passing Paradigm Summary

ComputerEngineering(Hardware)pointofview:

• Treatinter-processcommunicationasI/Odevice

• Criticalissues:❒ HowtooptimizeAPIoverhead

❒ Minimizecommunicationlatency

❒ Buffermanagement:howtodealwithearly/unsolicitedmessages,messagetyping,high-levelflowcontrol

❒ Eventsignaling&synchronization

❒ Librarysupportforcommonfunctions(barriersynchronization,taskdistribution,scatter/gather,datastructuremaintenance)

Lecture 2 Slide 68EECS 570

Shared Memory Programming Model

Lecture 2 Slide 69EECS 570

Shared-Memory Model

P1 P2 P3 P4

MemorySystem

❒ Multipleexecutioncontextssharingasingleaddressspace❍ Multipleprograms(MIMD)❍ Ormorefrequently:multiplecopiesofoneprogram(SPMD)

❒ Implicit(automatic)communicationvialoadsandstores❒ Theoreticalfoundation:PRAMmodel

Lecture 2 Slide 70EECS 570

Global Shared Physical Address Space

• Communication,sharing,synchronizationvialoads/storestosharedvariables

• Facilitiesforaddresstranslationbetweenlocal/globaladdressspaces

• RequiresOSsupporttomaintainthismapping

Sharedportionof

addressspacePrivate

portionofaddressspace

Commonphysical

addressspace

Pn private

P2private

P1private

P0private

storeP0

loadPn

Lecture 2 Slide 71EECS 570

Why Shared Memory?Pluses

❒ Forapplicationslookslikemultitaskinguniprocessor❒ ForOSonlyevolutionaryextensionsrequired❒ EasytodocommunicationwithoutOS❒ Softwarecanworryaboutcorrectnessfirstthenperformance

Minuses❒ Propersynchronizationiscomplex❒ Communicationisimplicitsohardertooptimize❒ Hardwaredesignersmustimplement

Result❒ Traditionallybus-basedSymmetricMultiprocessors(SMPs),

andnowCMPsarethemostsuccessparallelmachinesever❒ Andthefirstwithmulti-billion-dollarmarkets

Lecture 2 Slide 72EECS 570

Thread-Level Parallelism

• Thread-levelparallelism(TLP)❒ Collectionofasynchronoustasks:notstartedandstoppedtogether❒ Datasharedloosely,dynamically

• Example:database/webserver(eachqueryisathread)❒ accts isshared,can’tregisterallocateevenifitwerescalar❒ id andamt areprivatevariables,registerallocatedtor1,r2

struct acct_t { int bal; };shared struct acct_t accts[MAX_ACCT];int id,amt;if (accts[id].bal >= amt){

accts[id].bal -= amt;spew_cash();

}

0: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash

Lecture 2 Slide 73EECS 570

Synchronization• Mutualexclusion :locks,…

• Order :barriers,signal-wait,…

• Implementedusingread/write/modifytosharedlocation❒ Language-level:

❍ libraries(e.g.,locksinpthread)❍ Programmerscanwritecustomsynchronizations

❒ HardwareISA❍ E.g.,test-and-set

• OSprovidessupportformanagingthreads❒ scheduling,fork,join,futex signal/wait

We’llcoversynchronizationinmoredetailinafewweeks

Lecture 2 Slide 74EECS 570

Paired vs. Separate Processor/Memory?• Separateprocessor/memory

❒ Uniformmemoryaccess(UMA):equallatencytoallmemory+ Simplesoftware,doesn’tmatterwhereyouputdata– Lowerpeakperformance❒ Bus-basedUMAscommon:symmetricmulti-processors(SMP)

• Pairedprocessor/memory❒ Non-uniformmemoryaccess(NUMA):fastertolocalmemory– Morecomplexsoftware:whereyouputdatamatters+ Higherpeakperformance:assumingproperdataplacement

CPU($)

Mem

CPU($)

Mem

CPU($)

Mem

CPU($)

Mem

CPU($)Mem

CPU($)Mem

CPU($)Mem

CPU($)MemR RRR

Lecture 2 Slide 75EECS 570

Shared vs. Point-to-Point Networks• Sharednetwork:e.g.,bus(left)

+ Lowlatency– Lowbandwidth:doesn’tscalebeyond~16processors+ Sharedpropertysimplifiescachecoherenceprotocols(later)

• Point-to-pointnetwork:e.g.,meshorring(right)– Longerlatency:mayneedmultiple“hops”tocommunicate+ Higherbandwidth:scalesto1000sofprocessors– Cachecoherenceprotocolsarecomplex

CPU($)Mem

CPU($)Mem R

CPU($)Mem R

CPU($)MemR

CPU($)MemR

CPU($)Mem

CPU($)Mem

CPU($)Mem RRRR

Lecture 2 Slide 76EECS 570

Implementation #1: Snooping Bus MP

• Twobasicimplementations

• Bus-basedsystems❒ Typicallysmall:2–8(maybe16)processors❒ Typicallyprocessorssplitfrommemories(UMA)

❍ Sometimesmultipleprocessorsonsinglechip(CMP)❍ Symmetricmultiprocessors(SMPs)❍ Common,Iuseoneeveryday

CPU($)

Mem

CPU($)

Mem

CPU($)

Mem

CPU($)

Mem

Lecture 2 Slide 77EECS 570

Implementation #2: Scalable MP

• Generalpoint-to-pointnetwork-basedsystems❒ Typicallyprocessor/memory/routerblocks(NUMA)

❍ Glueless MP:noneedforadditional“glue”chips❒ Canbearbitrarilylarge:1000’sofprocessors

❍ Massivelyparallelprocessors(MPPs)❒ Inrealityonlygovernment(DoD)hasMPPs…

❍ Companieshavemuchsmallersystems:32–64processors❍ Scalablemulti-processors

CPU($)Mem R

CPU($)Mem R

CPU($)MemR

CPU($)MemR

Lecture 2 Slide 78EECS 570

Cache Coherence

• Two$100withdrawalsfromaccount#241attwoATMs❒ Eachtransactionmapstothreadondifferentprocessor❒ Trackaccts[241].bal (addressisinr3)

Processor 00: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash

Processor 1

0: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash

CPU0 MemCPU1

Lecture 2 Slide 79EECS 570

No-Cache, No-Problem

• ScenarioI:processorshavenocaches❒ Noproblem

Processor 00: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash

Processor 1

0: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash

500500

400

400

300

Lecture 2 Slide 80EECS 570

Cache Incoherence

• ScenarioII:processorshavewrite-backcaches❒ Potentially3copiesofaccts[241].bal:memory,p0$,p1$❒ Cangetincoherent(inconsistent)

Processor 00: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash

Processor 1

0: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash

500V:500 500

D:400 500

D:400 500V:500

D:400 500D:400

Lecture 2 Slide 81EECS 570

Snooping Cache-Coherence ProtocolsBusprovidesserializationpoint

Eachcachecontroller“snoops”allbustransactions❒ takeactiontoensurecoherence

❍ invalidate❍ update❍ supplyvalue

❒ dependsonstateoftheblockandtheprotocol

Lecture 2 Slide 82EECS 570

Scalable Cache Coherence• Scalablecachecoherence:twopartsolution

• PartI:busbandwidth❒ Replacenon-scalablebandwidthsubstrate(bus)…❒ …withscalablebandwidthone(point-to-pointnetwork,e.g.,mesh)

• PartII:processorsnoopingbandwidth❒ Interesting:mostsnoopsresultinnoaction❒ Replacenon-scalablebroadcastprotocol(spameveryone)…❒ …withscalabledirectoryprotocol (onlyspamprocessorsthatcare)

• WewillcoverthisinUnit3

Lecture 2 Slide 83EECS 570

Shared Memory Summary• Shared-memorymultiprocessors

+ Simplesoftware:easydatasharing,handlesbothDLPandTLP– Complexhardware:mustprovideillusionofglobaladdress

space

• Twobasicimplementations❒ Symmetric(UMA)multi-processors(SMPs)

❍ Underlyingcommunicationnetwork:bus(ordered)+ Low-latency,simpleprotocolsthatrelyonglobalorder– Low-bandwidth,poorscalability

❒ Scalable(NUMA)multi-processors(MPPs)❍ Underlyingcommunicationnetwork:point-to-point

(unordered)+ Scalablebandwidth– Higher-latency,complexprotocols

top related