eecs 570 lecture 2 message passing & shared memory

83
Lecture 2 Slide 1 EECS 570 EECS 570 Lecture 2 Message Passing & Shared Memory Winter 2018 Prof. Satish Narayanasamy http://www.eecs.umich.edu/courses/eecs570/ Slides developed in part by Drs. Adve, Falsafi, Martin, Musuvathi, Narayanasamy, Nowatzyk, Wenisch, Sarkar, Mikko Lipasti, Jim Smith, John Shen, Mark Hill, David Wood, Guri Sohi, Jim Smith, Natalie Enright Jerger, Michel Dubois, Murali Annavaram, Per Stenström, and probably others Intel Paragon XP/S

Upload: phungphuc

Post on 02-Jan-2017

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 1EECS 570

EECS570Lecture2MessagePassing&SharedMemoryWinter2018

Prof.SatishNarayanasamy

http://www.eecs.umich.edu/courses/eecs570/

Slides developed in part by Drs. Adve, Falsafi, Martin, Musuvathi, Narayanasamy, Nowatzyk, Wenisch, Sarkar, Mikko Lipasti, Jim Smith, John Shen, Mark Hill, David Wood, Guri Sohi, Jim Smith, Natalie Enright Jerger, Michel Dubois, Murali Annavaram, Per Stenström, and probably others

IntelParagonXP/S

Page 2: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 2EECS 570

AnnouncementsDiscussionthisFriday.•Willdiscusscourseproject

Page 3: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 3EECS 570

ReadingsFortoday

❒ DavidWoodandMarkHill.“Cost-EffectiveParallelComputing,”IEEEComputer,1995.

❒ MarkHilletal.“21st CenturyComputerArchitecture.”CCCWhitePaper,2012.

ForWednesday1/13:❒ Seileretal.Larrabee:AMany-Corex86Architecturefor

VisualComputing.Siggraph 2008.

Page 4: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 4EECS 570

Sequential Merge Sort

16MBinput(32-bitintegers)

Recurse(left)

Recurse(right)

Copybacktoinputarray

Mergetoscratcharray

Time

SequentialExecution

Page 5: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 5EECS 570

Parallel Merge Sort (as Parallel Directed Acyclic Graph)

16MBinput(32-bitintegers)

Recurse(left) Recurse(right)

Copybacktoinputarray

Mergetoscratcharray

Time ParallelExecution

Page 6: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 6EECS 570

Parallel DAG for Merge Sort (2-core)

SequentialSort

Merge

SequentialSort

Time

Page 7: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 7EECS 570

Parallel DAG for Merge Sort (4-core)

Page 8: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 8EECS 570

Parallel DAG for Merge Sort (8-core)

Page 9: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 9EECS 570

The DAG Execution Model of a Parallel Computation

• Givenaninput,dynamicallycreateaDAG

• Nodesrepresentsequentialcomputation❒ Weightedbytheamountofwork

• Edgesrepresentdependencies:❒ NodeAà NodeBmeansthatBcannotbescheduledunless

Aisfinished

Page 10: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 10EECS 570

Sorting 16 elements in four cores

Page 11: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 11EECS 570

Sorting 16 elements in four cores(4 element arrays sorted in constant time)

1 16

1 8

1

1

1 8

1

1

Page 12: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 12EECS 570

Performance Measures• GivenagraphG,aschedulerS,andPprocessors

• Tp(S) :TimeonPprocessorsusingschedulerS

• Tp :TimeonPprocessorsusingbestscheduler

• T1 :Timeonasingleprocessor(sequentialcost)

• T∞ :Timeassuminginfiniteresources

Page 13: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 13EECS 570

Work and Depth

• T1=Work❒ Thetotalnumberofoperationsexecutedbyacomputation

• T∞=Depth❒ Thelongestchainofsequentialdependencies(criticalpath)

intheparallelDAG

Page 14: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 14EECS 570

T∞ (Depth): Critical Path Length(Sequential Bottleneck)

Page 15: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 15EECS 570

T1 (work): Time to Run Sequentially

Page 16: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 16EECS 570

Sorting 16 elements in four cores(4 element arrays sorted in constant time)

1 16

1 8

1

1

1 8

1

1

Work=Depth=

Page 17: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 17EECS 570

Some Useful Theorems

Page 18: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 18EECS 570

Work Law• “Youcannotavoidworkbyparallelizing”

T1 / P ≤ TP

Page 19: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 19EECS 570

Work Law• “Youcannotavoidworkbyparallelizing”

T1 / P ≤ TP

Speedup = T1 / TP

Page 20: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 20EECS 570

Work Law• “Youcannotavoidworkbyparallelizing”

• Canspeedupbemorethan2whenwegofrom1-coreto2-coreinpractice?

T1 / P ≤ TP

Speedup = T1 / TP

Page 21: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 21EECS 570

Depth Law•Moreresourcesshouldmakethingsfaster• Youarelimitedbythesequentialbottleneck

TP ≥ T∞

Page 22: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 22EECS 570

Amount of Parallelism

Parallelism = T1 / T∞

Page 23: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 23EECS 570

Maximum Speedup Possible

Parallelism

“speedupisboundedabovebyavailableparallelism”

Speedup T1 / TP ≤ T1 / T∞

Page 24: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 24EECS 570

Greedy Scheduler

• IfmorethanPnodescanbescheduled,pickanysubsetofsizeP

• IflessthanPnodescanbescheduled,schedulethemall

Page 25: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 25EECS 570

MoreReading:http://www.cs.cmu.edu/afs/cs/academic/class/15492-f07/www/scribe/lec4/lecture4.pdf

Page 26: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 26EECS 570

Performance of the Greedy Scheduler

Worklaw T1 / P ≤ TP

Depthlaw T∞ ≤ TP

TP(Greedy) ≤ T1 / P + T∞

Page 27: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 27EECS 570

Greedy is optimal within factor of 2

Worklaw T1 / P ≤ TP

Depthlaw T∞ ≤ TP

TP ≤ TP(Greedy) ≤ 2 TP

Page 28: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 28EECS 570

Work/Depth of Merge Sort (Sequential Merge)

•WorkT1 : O(n log n)• DepthT∞ : O(n)

❒ TakesO(n) timetomergen elements

• Parallelism:❒ T1 / T∞ = O(log n) à reallybad!

Page 29: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 29EECS 570

Main Message• AnalyzetheWorkandDepthofyouralgorithm• ParallelismisWork/Depth• TrytodecreaseDepth

❒ thecriticalpath❒ a sequential bottleneck

• IfyouincreaseDepth❒ betterincreaseWorkbyalotmore!

Page 30: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 30EECS 570

Amdahl’s law• Sortingtakes70%oftheexecutiontimeofasequentialprogram

• Youreplacethesortingalgorithmwithonethatscalesperfectlyonmulti-corehardware

• Howmanycoresdoyouneedtogeta4xspeed-upontheprogram?

Page 31: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 31EECS 570

Amdahl’s law, 𝑓 = 70%

f =theparallel portionofexecution1 - f =thesequential portionofexecution

c =numberofcoresused

Speedup(f, c) = 1 / ( 1 – f) + f / c

Page 32: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 32EECS 570

Amdahl’s law, 𝑓 = 70%

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Speedu

p

#cores

Desired4xspeedup

Speedupachieved(perfectscalingon70%)

Page 33: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 33EECS 570

Amdahl’s law, 𝑓 = 70%

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Speedu

p

#cores

Desired4xspeedup

Speedupachieved(perfectscalingon70%)

Limitasc→∞=1/(1-f)=3.33

Page 34: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 34EECS 570

Amdahl’s law, 𝑓 = 10%

0.94

0.96

0.98

1.00

1.02

1.04

1.06

1.08

1.10

1.12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Speedu

p

#cores

Speedupachievedwithperfectscaling

Amdahl’slawlimit,just1.11x

Page 35: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 35EECS 570

Amdahl’s law, 𝑓 = 98%

0

10

20

30

40

50

60

1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103

109

115

121

127

Speedu

p

#cores

Page 36: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 36EECS 570

Lesson• Speedupislimitedbysequential code

• Evenasmallpercentageofsequential codecangreatlylimitpotentialspeedup

Page 37: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 37EECS 570

21st CenturyComputer Architecture

A CCC community white paperhttp://cra.org/ccc/docs/init/21stcenturyarchitecturewhitepap

er.pdf

SlidesfromM.Hill,HPCA2014Keynote

Page 38: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 38EECS 570

20th Century ICT Set Up

• Information&CommunicationTechnology(ICT)HasChangedOurWorld❒ <longlistomitted>

• Requiredinnovationsinalgorithms,applications,programminglanguages,…,&systemsoftware

• Key(invisible)enablers(cost-)performancegains❒ Semiconductortechnology(“Moore’sLaw”)❒ Computerarchitecture(~80xperDanowitz etal.)

Page 39: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 39EECS 570

Enablers: Technology + Architecture

Danowitz etal.,CACM04/2012,Figure1

Technology

Architecture

Page 40: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 40EECS 570

21st Century ICT Promises More

Data-centricpersonalizedhealthcare Computation-drivenscientificdiscovery

Muchmore:known&unknownHumannetworkanalysis

Page 41: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 41EECS 570

21st Century App Characteristics

BIG DATA

Whither enablers of future (cost-)performance gains?

ALWAYS ONLINE

SECURE/PRIVATE

Page 42: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 42EECS 570

Technology’s Challenges 1/2Late 20th Century The New Reality

Moore’s Law —2× transistors/chip

Transistor count still 2× BUT…

Dennard Scaling —~constant power/chip

Gone. Can’t repeatedly double power/chip

42

Page 43: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 43EECS 570

Technology’s Challenges 2/2Late 20th Century The New Reality

Moore’s Law —2× transistors/chip

Transistor count still 2× BUT…

Dennard Scaling —~constant power/chip

Gone. Can’t repeatedly double power/chip

Modest (hidden)transistor unreliability

Increasing transistor unreliability can’t be hidden

Focus on computation over communication

Communication (energy) more expensive than computation

1-time costs amortized via mass market

One-time cost much worse &want specialized platforms

How should architects step up as technology falters?

Page 44: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 44EECS 570

“Timeline” from DARPA ISATSy

stem

Cap

abilit

y (lo

g)

80s 90s 00s 10s 20s 30s 40s

Fallow Period

50sSource:AdvancingComputerSystemswithoutTechnologyProgress,

ISATOutbrief (http://www.cs.wisc.edu/~markhill/papers/isat2012_ACSWTP.pdf)

MarkD.HillandChristosKozyrakis,DARPA/ISATWorkshop,March26-27,2012.ApprovedforPublicRelease,DistributionUnlimited

TheviewsexpressedarethoseoftheauthoranddonotreflecttheofficialpolicyorpositionoftheDepartmentofDefenseortheU.S.Government.

Page 45: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 45EECS 570

21st Century Comp Architecture20th Century 21st CenturySingle-chip ingeneric computer

Architecture as Infrastructure: Spanning sensors to cloudsPerformance + security, privacy, availability, programmability, …

Cross-Cutting:

Break current layers with new interfaces

Performance via invisible instr.-level parallelism

Energy First● Parallelism● Specialization● Cross-layer design

Predictable technologies: CMOS, DRAM, & disks

New technologies (non-volatile memory, near-threshold, 3D, photonics, …) Rethink: memory & storage, reliability, communication

X

X

Page 46: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 46EECS 570

21st Century Comp Architecture20th Century 21st CenturySingle-chip ingeneric computer

Architecture as Infrastructure: Spanning sensors to cloudsPerformance + security, privacy, availability, programmability, …

Cross-Cutting:

Break current layers with new interfaces

Performance via invisible instr.-level parallelism

Energy First● Parallelism● Specialization● Cross-layer design

Predictable technologies: CMOS, DRAM, & disks

New technologies (non-volatile memory, near-threshold, 3D, photonics, …) Rethink: memory & storage, reliability, communication

X

Page 47: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 47EECS 570

21st Century Comp Architecture20th Century 21st CenturySingle-chip ingeneric computer

Architecture as Infrastructure: Spanning sensors to cloudsPerformance + security, privacy, availability, programmability, …

Cross-Cutting:

Break current layers with new interfaces

Performance via invisible instr.-level parallelism

Energy First● Parallelism● Specialization● Cross-layer design

Predictable technologies: CMOS, DRAM, & disks

New technologies (non-volatile memory, near-threshold, 3D, photonics, …) Rethink: memory & storage, reliability, communication

Page 48: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 48EECS 570

21st Century Comp Architecture20th Century 21st CenturySingle-chip ingeneric computer

Architecture as Infrastructure: Spanning sensors to cloudsPerformance + security, privacy, availability, programmability, …

Cross-Cutting:

Break current layers with new interfaces

Performance via invisible instr.-level parallelism

Energy First● Parallelism● Specialization● Cross-layer design

Predictable technologies: CMOS, DRAM, & disks

New technologies (non-volatile memory, near-threshold, 3D, photonics, …) Rethink: memory & storage, reliability, communication

Page 49: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 49EECS 570

21st Century Comp Architecture20th Century 21st CenturySingle-chip instand-alone computer

Architecture as Infrastructure: Spanning sensors to cloudsPerformance + security, privacy, availability, programmability, …

Cross-Cutting:

Break current layers with new interfaces

Performance via invisible instr.-level parallelism

Energy First● Parallelism● Specialization● Cross-layer design

Predictable technologies: CMOS, DRAM, & disks

New technologies (non-volatile memory, near-threshold, 3D, photonics, …) Rethink: memory & storage, reliability, communication

Page 50: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 50EECS 570

Cost-Effective Computing[Wood & Hill, IEEE Computer 1995]

Premise:Isn’tspeedup(P)<Pinefficient?❒ Ifonlythroughputmatters,usePcomputersinstead…

• Keyobservation:muchofacomputer’scostisNOTCPULetCostup(P)=Cost(P)/Cost(1)Parallelcomputingiscost-effectiveif:

❒ Speedup(P)>Costup(P)

• E.g.,forSGIPowerChallenge w/500MB❒ Costup(32)=8.6

Page 51: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 51EECS 570

Parallel Programming Models and Interfaces

Page 52: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 52EECS 570

Programming Models• Highlevelparadigmforexpressinganalgorithm

❒ Examples:❍ Functional❍ Sequential,procedural❍ Sharedmemory❍MessagePassing

• Embodiedinlanguagesthatsupportconcurrentexecution❒ Incorporatedintolanguageconstructs❒ Incorporatedaslibrariesaddedtoexistingsequentiallanguage

• Toplevelfeatures:(Forconventionalmodels– sharedmemory,messagepassing)❒ Multiplethreadsareconceptuallyvisibletoprogrammer❒ Communication/synchronizationarevisibletoprogrammer

Page 53: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 53EECS 570

An Incomplete Taxonomy

• VLIW• SIMD• Vector• DataFlow• GPU• MapReduce /WSC• SystolicArray• Reconfigurable/FPGA• …

MP Systems

SharedMemory

Bus-based

SwitchingFabric-based

Incoherent Coherent

MessagePassing

COTS-based

SwitchingFabric-based

Emerging&LunaticFringe

Page 54: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 54EECS 570

Programming Model Elements• ForbothSharedMemoryandMessagePassing• Processesandthreads

❒ Process: Asharedaddressspaceandoneormorethreadsofcontrol❒ Thread: Aprogramsequencerandprivateaddressspace❒ Task:Lessformalterm– partofanoveralljob❒ Created,terminated,scheduled,etc.

• Communication❒ Passingofdata

• Synchronization❒ Communicatingcontrolinformation❒ Toassurereliable,deterministiccommunication

Page 55: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 55EECS 570

Historical View

Joinat:

Programwith:

P P P

M M M

IO IO IO

I/O(Network)

Messagepassing

P P P

M M M

IO IO IO

Memory

SharedMemory

P P P

M M M

IO IO IO

Processor

Dataflow,SIMD,VLIW,CUDA,

otherdataparallel

Page 56: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 56EECS 570

Message PassingProgramming Model

Page 57: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 57EECS 570

Message Passing Programming Model

• Userlevelsend/receiveabstraction❒ Matchvialocalbuffer(x,y),process(Q,P),andtag(t)❒ Neednaming/synchronizationconventions

ProcessPlocaladdressspace

ProcessQlocaladdressspace

match

Addressx

Addressy

sendx,Q,t

recv y,P,t

Page 58: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 58EECS 570

Message Passing Architectures

• Cannotdirectlyaccessmemoryofanothernode

• IBMSP-2,IntelParagon,Myrinet QuadricsQSW

• Clusterofworkstations(e.g.,MPIonfluxcluster)

CPU($)Mem

Interconnect

NicCPU($)Mem Nic

CPU($)Mem Nic

CPU($)Mem Nic

Page 59: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 59EECS 570

MPI –Message Passing Interface API

• Awidelyusedstandard❒ Foravarietyofdistributedmemorysystems

• SMPClusters,workstationclusters,MPPs,heterogeneoussystems

• AlsoworksonSharedMemoryMPs❒EasytoemulatedistributedmemoryonsharedmemoryHW

• Canbeusedwithanumberofhighlevellanguages

• AvailableintheFluxclusteratMichigan

Page 60: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 60EECS 570

Processes and Threads• Lotsofflexibility(advantageofmessagepassing)

1. Multiplethreadssharinganaddressspace2. Multipleprocessessharinganaddressspace3. Multipleprocesseswithdifferentaddressspaces

• anddifferentOSes

• 1and2easilyimplementedonsharedmemoryHW(withsingleOS)❒ Processandthreadcreation/managementsimilartosharedmemory

• 3probablymorecommoninpractice❒ Processcreationoftenexternaltoexecutionenvironment;e.g.shellscript❒ HardforuserprocessononesystemtocreateprocessonanotherOS

Page 61: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 61EECS 570

Communication and Synchronization• Combinedinthemessagepassingparadigm

❒ Synchronizationofmessagespartofcommunicationsemantics

• Point-to-pointcommunication❒ Fromoneprocesstoanother

• Collectivecommunication❒ Involvesgroupsofprocesses❒e.g.,broadcast

Page 62: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 62EECS 570

Message Passing: Send()

• Send(<what>,<where-to>,<how>)

• What:❒ Adatastructureorobjectinuserspace❒ Abufferallocatedfromspecialmemory❒ Awordorsignal

• Where-to:❒ Aspecificprocessor❒ Asetofspecificprocessors❒ Aqueue,dispatcher,scheduler

• How:❒ Asynchronouslyvs.synchronously❒ Typed❒ In-ordervs.out-of-order❒ Prioritized

Page 63: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 63EECS 570

Message Passing: Receive()

• Receive(<data>,<info>,<what>,<how>)

• Data:mechanismtoreturnmessagecontent❒ Abufferallocatedintheuserprocess❒ Memoryallocatedelsewhere

• Info:meta-infoaboutthemessage❒ Sender-ID❒ Type,Size,Priority❒ Flowcontrolinformation

• What:receiveonlycertainmessages❒ Sender-ID,Type,Priority

• How:❒ Blockingvs.non-blocking

Page 64: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 64EECS 570

Synchronous vs Asynchronous

• SynchronousSend❒ Stalluntilmessagehasactuallybeenreceived❒ Impliesamessageacknowledgementfromreceivertosender

• SynchronousReceive❒ Stalluntilmessagehasactuallybeenreceived

• AsynchronousSendandReceive❒ Senderandreceivercanproceedregardless❒Returnsrequesthandlethatcanbetestedformessagereceipt❒Requesthandlecanbetestedtoseeifmessagehasbeensent/received

Page 65: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 65EECS 570

Deadlock• Blockingcommunicationsmaydeadlock

• Requirescareful(safe)orderingofsends/receives

<Process 0> <Process 1>Send(Process1, Message); Receive (Process0, Message);Receive(Process1, Message); Send (Process0, Message);

<Process 0> <Process 1>Send(Process1, Message); Send(Process0, Message);Receive(Process1, Message); Receive(Process0, Message);

Page 66: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 66EECS 570

Message Passing Paradigm Summary

ProgrammingModel(Software)pointofview:

• Disjoint,separatenamespaces

• “Sharednothing”

• Communicationviaexplicit,typedmessages:send&receive

Page 67: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 67EECS 570

Message Passing Paradigm Summary

ComputerEngineering(Hardware)pointofview:

• Treatinter-processcommunicationasI/Odevice

• Criticalissues:❒ HowtooptimizeAPIoverhead

❒ Minimizecommunicationlatency

❒ Buffermanagement:howtodealwithearly/unsolicitedmessages,messagetyping,high-levelflowcontrol

❒ Eventsignaling&synchronization

❒ Librarysupportforcommonfunctions(barriersynchronization,taskdistribution,scatter/gather,datastructuremaintenance)

Page 68: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 68EECS 570

Shared Memory Programming Model

Page 69: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 69EECS 570

Shared-Memory Model

P1 P2 P3 P4

MemorySystem

❒ Multipleexecutioncontextssharingasingleaddressspace❍ Multipleprograms(MIMD)❍ Ormorefrequently:multiplecopiesofoneprogram(SPMD)

❒ Implicit(automatic)communicationvialoadsandstores❒ Theoreticalfoundation:PRAMmodel

Page 70: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 70EECS 570

Global Shared Physical Address Space

• Communication,sharing,synchronizationvialoads/storestosharedvariables

• Facilitiesforaddresstranslationbetweenlocal/globaladdressspaces

• RequiresOSsupporttomaintainthismapping

Sharedportionof

addressspacePrivate

portionofaddressspace

Commonphysical

addressspace

Pn private

P2private

P1private

P0private

storeP0

loadPn

Page 71: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 71EECS 570

Why Shared Memory?Pluses

❒ Forapplicationslookslikemultitaskinguniprocessor❒ ForOSonlyevolutionaryextensionsrequired❒ EasytodocommunicationwithoutOS❒ Softwarecanworryaboutcorrectnessfirstthenperformance

Minuses❒ Propersynchronizationiscomplex❒ Communicationisimplicitsohardertooptimize❒ Hardwaredesignersmustimplement

Result❒ Traditionallybus-basedSymmetricMultiprocessors(SMPs),

andnowCMPsarethemostsuccessparallelmachinesever❒ Andthefirstwithmulti-billion-dollarmarkets

Page 72: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 72EECS 570

Thread-Level Parallelism

• Thread-levelparallelism(TLP)❒ Collectionofasynchronoustasks:notstartedandstoppedtogether❒ Datasharedloosely,dynamically

• Example:database/webserver(eachqueryisathread)❒ accts isshared,can’tregisterallocateevenifitwerescalar❒ id andamt areprivatevariables,registerallocatedtor1,r2

struct acct_t { int bal; };shared struct acct_t accts[MAX_ACCT];int id,amt;if (accts[id].bal >= amt){

accts[id].bal -= amt;spew_cash();

}

0: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash

Page 73: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 73EECS 570

Synchronization• Mutualexclusion :locks,…

• Order :barriers,signal-wait,…

• Implementedusingread/write/modifytosharedlocation❒ Language-level:

❍ libraries(e.g.,locksinpthread)❍ Programmerscanwritecustomsynchronizations

❒ HardwareISA❍ E.g.,test-and-set

• OSprovidessupportformanagingthreads❒ scheduling,fork,join,futex signal/wait

We’llcoversynchronizationinmoredetailinafewweeks

Page 74: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 74EECS 570

Paired vs. Separate Processor/Memory?• Separateprocessor/memory

❒ Uniformmemoryaccess(UMA):equallatencytoallmemory+ Simplesoftware,doesn’tmatterwhereyouputdata– Lowerpeakperformance❒ Bus-basedUMAscommon:symmetricmulti-processors(SMP)

• Pairedprocessor/memory❒ Non-uniformmemoryaccess(NUMA):fastertolocalmemory– Morecomplexsoftware:whereyouputdatamatters+ Higherpeakperformance:assumingproperdataplacement

CPU($)

Mem

CPU($)

Mem

CPU($)

Mem

CPU($)

Mem

CPU($)Mem

CPU($)Mem

CPU($)Mem

CPU($)MemR RRR

Page 75: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 75EECS 570

Shared vs. Point-to-Point Networks• Sharednetwork:e.g.,bus(left)

+ Lowlatency– Lowbandwidth:doesn’tscalebeyond~16processors+ Sharedpropertysimplifiescachecoherenceprotocols(later)

• Point-to-pointnetwork:e.g.,meshorring(right)– Longerlatency:mayneedmultiple“hops”tocommunicate+ Higherbandwidth:scalesto1000sofprocessors– Cachecoherenceprotocolsarecomplex

CPU($)Mem

CPU($)Mem R

CPU($)Mem R

CPU($)MemR

CPU($)MemR

CPU($)Mem

CPU($)Mem

CPU($)Mem RRRR

Page 76: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 76EECS 570

Implementation #1: Snooping Bus MP

• Twobasicimplementations

• Bus-basedsystems❒ Typicallysmall:2–8(maybe16)processors❒ Typicallyprocessorssplitfrommemories(UMA)

❍ Sometimesmultipleprocessorsonsinglechip(CMP)❍ Symmetricmultiprocessors(SMPs)❍ Common,Iuseoneeveryday

CPU($)

Mem

CPU($)

Mem

CPU($)

Mem

CPU($)

Mem

Page 77: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 77EECS 570

Implementation #2: Scalable MP

• Generalpoint-to-pointnetwork-basedsystems❒ Typicallyprocessor/memory/routerblocks(NUMA)

❍ Glueless MP:noneedforadditional“glue”chips❒ Canbearbitrarilylarge:1000’sofprocessors

❍ Massivelyparallelprocessors(MPPs)❒ Inrealityonlygovernment(DoD)hasMPPs…

❍ Companieshavemuchsmallersystems:32–64processors❍ Scalablemulti-processors

CPU($)Mem R

CPU($)Mem R

CPU($)MemR

CPU($)MemR

Page 78: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 78EECS 570

Cache Coherence

• Two$100withdrawalsfromaccount#241attwoATMs❒ Eachtransactionmapstothreadondifferentprocessor❒ Trackaccts[241].bal (addressisinr3)

Processor 00: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash

Processor 1

0: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash

CPU0 MemCPU1

Page 79: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 79EECS 570

No-Cache, No-Problem

• ScenarioI:processorshavenocaches❒ Noproblem

Processor 00: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash

Processor 1

0: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash

500500

400

400

300

Page 80: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 80EECS 570

Cache Incoherence

• ScenarioII:processorshavewrite-backcaches❒ Potentially3copiesofaccts[241].bal:memory,p0$,p1$❒ Cangetincoherent(inconsistent)

Processor 00: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash

Processor 1

0: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash

500V:500 500

D:400 500

D:400 500V:500

D:400 500D:400

Page 81: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 81EECS 570

Snooping Cache-Coherence ProtocolsBusprovidesserializationpoint

Eachcachecontroller“snoops”allbustransactions❒ takeactiontoensurecoherence

❍ invalidate❍ update❍ supplyvalue

❒ dependsonstateoftheblockandtheprotocol

Page 82: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 82EECS 570

Scalable Cache Coherence• Scalablecachecoherence:twopartsolution

• PartI:busbandwidth❒ Replacenon-scalablebandwidthsubstrate(bus)…❒ …withscalablebandwidthone(point-to-pointnetwork,e.g.,mesh)

• PartII:processorsnoopingbandwidth❒ Interesting:mostsnoopsresultinnoaction❒ Replacenon-scalablebroadcastprotocol(spameveryone)…❒ …withscalabledirectoryprotocol (onlyspamprocessorsthatcare)

• WewillcoverthisinUnit3

Page 83: EECS 570 Lecture 2 Message Passing & Shared Memory

Lecture 2 Slide 83EECS 570

Shared Memory Summary• Shared-memorymultiprocessors

+ Simplesoftware:easydatasharing,handlesbothDLPandTLP– Complexhardware:mustprovideillusionofglobaladdress

space

• Twobasicimplementations❒ Symmetric(UMA)multi-processors(SMPs)

❍ Underlyingcommunicationnetwork:bus(ordered)+ Low-latency,simpleprotocolsthatrelyonglobalorder– Low-bandwidth,poorscalability

❒ Scalable(NUMA)multi-processors(MPPs)❍ Underlyingcommunicationnetwork:point-to-point

(unordered)+ Scalablebandwidth– Higher-latency,complexprotocols