eecs 570 lecture 2 message passing & shared memory

Lecture 2 Slide 1EECS 570

EECS570Lecture2MessagePassing&SharedMemoryWinter2018

Prof.SatishNarayanasamy

http://www.eecs.umich.edu/courses/eecs570/

Slides developed in part by Drs. Adve, Falsafi, Martin, Musuvathi, Narayanasamy, Nowatzyk, Wenisch, Sarkar, Mikko Lipasti, Jim Smith, John Shen, Mark Hill, David Wood, Guri Sohi, Jim Smith, Natalie Enright Jerger, Michel Dubois, Murali Annavaram, Per Stenström, and probably others

IntelParagonXP/S

AnnouncementsDiscussionthisFriday.•Willdiscusscourseproject

ReadingsFortoday

❒ DavidWoodandMarkHill.“Cost-EffectiveParallelComputing,”IEEEComputer,1995.

❒ MarkHilletal.“21st CenturyComputerArchitecture.”CCCWhitePaper,2012.

ForWednesday1/13:❒ Seileretal.Larrabee:AMany-Corex86Architecturefor

VisualComputing.Siggraph 2008.

Sequential Merge Sort

16MBinput(32-bitintegers)

Recurse(left)

Recurse(right)

Copybacktoinputarray

Mergetoscratcharray

SequentialExecution

Parallel Merge Sort (as Parallel Directed Acyclic Graph)

16MBinput(32-bitintegers)

Recurse(left) Recurse(right)

Copybacktoinputarray

Mergetoscratcharray

Time ParallelExecution

Parallel DAG for Merge Sort (2-core)

SequentialSort

The DAG Execution Model of a Parallel Computation

• Givenaninput,dynamicallycreateaDAG

• Nodesrepresentsequentialcomputation❒ Weightedbytheamountofwork

• Edgesrepresentdependencies:❒ NodeAà NodeBmeansthatBcannotbescheduledunless

Aisfinished

Sorting 16 elements in four cores

Sorting 16 elements in four cores(4 element arrays sorted in constant time)

Performance Measures• GivenagraphG,aschedulerS,andPprocessors

• Tp(S) :TimeonPprocessorsusingschedulerS

• Tp :TimeonPprocessorsusingbestscheduler

• T1 :Timeonasingleprocessor(sequentialcost)

• T∞ :Timeassuminginfiniteresources

Work and Depth

• T1=Work❒ Thetotalnumberofoperationsexecutedbyacomputation

• T∞=Depth❒ Thelongestchainofsequentialdependencies(criticalpath)

intheparallelDAG

T∞ (Depth): Critical Path Length(Sequential Bottleneck)

T1 (work): Time to Run Sequentially

Sorting 16 elements in four cores(4 element arrays sorted in constant time)

Work=Depth=

Some Useful Theorems

Work Law• “Youcannotavoidworkbyparallelizing”

T1 / P ≤ TP

Speedup = T1 / TP

• Canspeedupbemorethan2whenwegofrom1-coreto2-coreinpractice?

T1 / P ≤ TP

Speedup = T1 / TP

Depth Law•Moreresourcesshouldmakethingsfaster• Youarelimitedbythesequentialbottleneck

TP ≥ T∞

Amount of Parallelism

Parallelism = T1 / T∞

Maximum Speedup Possible

Parallelism

“speedupisboundedabovebyavailableparallelism”

Speedup T1 / TP ≤ T1 / T∞

Greedy Scheduler

• IfmorethanPnodescanbescheduled,pickanysubsetofsizeP

• IflessthanPnodescanbescheduled,schedulethemall

MoreReading:http://www.cs.cmu.edu/afs/cs/academic/class/15492-f07/www/scribe/lec4/lecture4.pdf

Performance of the Greedy Scheduler

Worklaw T1 / P ≤ TP

Depthlaw T∞ ≤ TP

TP(Greedy) ≤ T1 / P + T∞

Greedy is optimal within factor of 2

Worklaw T1 / P ≤ TP

Depthlaw T∞ ≤ TP

TP ≤ TP(Greedy) ≤ 2 TP

Work/Depth of Merge Sort (Sequential Merge)

•WorkT1 : O(n log n)• DepthT∞ : O(n)

❒ TakesO(n) timetomergen elements

• Parallelism:❒ T1 / T∞ = O(log n) à reallybad!

Main Message• AnalyzetheWorkandDepthofyouralgorithm• ParallelismisWork/Depth• TrytodecreaseDepth

❒ thecriticalpath❒ a sequential bottleneck

• IfyouincreaseDepth❒ betterincreaseWorkbyalotmore!

Amdahl’s law• Sortingtakes70%oftheexecutiontimeofasequentialprogram

• Youreplacethesortingalgorithmwithonethatscalesperfectlyonmulti-corehardware

• Howmanycoresdoyouneedtogeta4xspeed-upontheprogram?

Amdahl’s law, 𝑓 = 70%

f =theparallel portionofexecution1 - f =thesequential portionofexecution

c =numberofcoresused

Speedup(f, c) = 1 / ( 1 – f) + f / c

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Speedu

#cores

Desired4xspeedup

Speedupachieved(perfectscalingon70%)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Speedu

#cores

Desired4xspeedup

Speedupachieved(perfectscalingon70%)

Limitasc→∞=1/(1-f)=3.33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Speedu

#cores

Speedupachievedwithperfectscaling

Amdahl’slawlimit,just1.11x

1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103

Speedu

#cores

Lesson• Speedupislimitedbysequential code

• Evenasmallpercentageofsequential codecangreatlylimitpotentialspeedup

21st CenturyComputer Architecture

A CCC community white paperhttp://cra.org/ccc/docs/init/21stcenturyarchitecturewhitepap

er.pdf

SlidesfromM.Hill,HPCA2014Keynote

20th Century ICT Set Up

• Information&CommunicationTechnology(ICT)HasChangedOurWorld❒ <longlistomitted>

• Requiredinnovationsinalgorithms,applications,programminglanguages,…,&systemsoftware

• Key(invisible)enablers(cost-)performancegains❒ Semiconductortechnology(“Moore’sLaw”)❒ Computerarchitecture(~80xperDanowitz etal.)

Enablers: Technology + Architecture

Danowitz etal.,CACM04/2012,Figure1

Technology

Architecture

21st Century ICT Promises More

Data-centricpersonalizedhealthcare Computation-drivenscientificdiscovery

Muchmore:known&unknownHumannetworkanalysis

21st Century App Characteristics

BIG DATA

Whither enablers of future (cost-)performance gains?

ALWAYS ONLINE

SECURE/PRIVATE

Technology’s Challenges 1/2Late 20th Century The New Reality

Moore’s Law —2× transistors/chip

Transistor count still 2× BUT…

Dennard Scaling —~constant power/chip

Gone. Can’t repeatedly double power/chip

Technology’s Challenges 2/2Late 20th Century The New Reality

Moore’s Law —2× transistors/chip

Transistor count still 2× BUT…

Dennard Scaling —~constant power/chip

Gone. Can’t repeatedly double power/chip

Modest (hidden)transistor unreliability

Increasing transistor unreliability can’t be hidden

Focus on computation over communication

Communication (energy) more expensive than computation

1-time costs amortized via mass market

One-time cost much worse &want specialized platforms

How should architects step up as technology falters?

“Timeline” from DARPA ISATSy

abilit

80s 90s 00s 10s 20s 30s 40s

Fallow Period

50sSource:AdvancingComputerSystemswithoutTechnologyProgress,

ISATOutbrief (http://www.cs.wisc.edu/~markhill/papers/isat2012_ACSWTP.pdf)

MarkD.HillandChristosKozyrakis,DARPA/ISATWorkshop,March26-27,2012.ApprovedforPublicRelease,DistributionUnlimited

TheviewsexpressedarethoseoftheauthoranddonotreflecttheofficialpolicyorpositionoftheDepartmentofDefenseortheU.S.Government.

21st Century Comp Architecture20th Century 21st CenturySingle-chip ingeneric computer

Architecture as Infrastructure: Spanning sensors to cloudsPerformance + security, privacy, availability, programmability, …

Cross-Cutting:

Break current layers with new interfaces

Performance via invisible instr.-level parallelism

Energy First● Parallelism● Specialization● Cross-layer design

Predictable technologies: CMOS, DRAM, & disks

New technologies (non-volatile memory, near-threshold, 3D, photonics, …) Rethink: memory & storage, reliability, communication

Cross-Cutting:

21st Century Comp Architecture20th Century 21st CenturySingle-chip instand-alone computer

Cross-Cutting:

Cost-Effective Computing[Wood & Hill, IEEE Computer 1995]

Premise:Isn’tspeedup(P)<Pinefficient?❒ Ifonlythroughputmatters,usePcomputersinstead…

• Keyobservation:muchofacomputer’scostisNOTCPULetCostup(P)=Cost(P)/Cost(1)Parallelcomputingiscost-effectiveif:

❒ Speedup(P)>Costup(P)

• E.g.,forSGIPowerChallenge w/500MB❒ Costup(32)=8.6

Parallel Programming Models and Interfaces

Programming Models• Highlevelparadigmforexpressinganalgorithm

❒ Examples:❍ Functional❍ Sequential,procedural❍ Sharedmemory❍MessagePassing

• Embodiedinlanguagesthatsupportconcurrentexecution❒ Incorporatedintolanguageconstructs❒ Incorporatedaslibrariesaddedtoexistingsequentiallanguage

• Toplevelfeatures:(Forconventionalmodels– sharedmemory,messagepassing)❒ Multiplethreadsareconceptuallyvisibletoprogrammer❒ Communication/synchronizationarevisibletoprogrammer

An Incomplete Taxonomy

• VLIW• SIMD• Vector• DataFlow• GPU• MapReduce /WSC• SystolicArray• Reconfigurable/FPGA• …

MP Systems

SharedMemory

Bus-based

SwitchingFabric-based

Incoherent Coherent

MessagePassing

COTS-based

SwitchingFabric-based

Emerging&LunaticFringe

Programming Model Elements• ForbothSharedMemoryandMessagePassing• Processesandthreads

❒ Process: Asharedaddressspaceandoneormorethreadsofcontrol❒ Thread: Aprogramsequencerandprivateaddressspace❒ Task:Lessformalterm– partofanoveralljob❒ Created,terminated,scheduled,etc.

• Communication❒ Passingofdata

• Synchronization❒ Communicatingcontrolinformation❒ Toassurereliable,deterministiccommunication

Historical View

Joinat:

Programwith:

IO IO IO

I/O(Network)

Messagepassing

IO IO IO

Memory

SharedMemory

IO IO IO

Processor

Dataflow,SIMD,VLIW,CUDA,

otherdataparallel

Message PassingProgramming Model

Message Passing Programming Model

• Userlevelsend/receiveabstraction❒ Matchvialocalbuffer(x,y),process(Q,P),andtag(t)❒ Neednaming/synchronizationconventions

ProcessPlocaladdressspace

ProcessQlocaladdressspace

Addressx

Addressy

sendx,Q,t

recv y,P,t

Message Passing Architectures

• Cannotdirectlyaccessmemoryofanothernode

• IBMSP-2,IntelParagon,Myrinet QuadricsQSW

• Clusterofworkstations(e.g.,MPIonfluxcluster)

CPU($)Mem

Interconnect

NicCPU($)Mem Nic

CPU($)Mem Nic

MPI –Message Passing Interface API

• Awidelyusedstandard❒ Foravarietyofdistributedmemorysystems

• SMPClusters,workstationclusters,MPPs,heterogeneoussystems

• AlsoworksonSharedMemoryMPs❒EasytoemulatedistributedmemoryonsharedmemoryHW

• Canbeusedwithanumberofhighlevellanguages

• AvailableintheFluxclusteratMichigan

Processes and Threads• Lotsofflexibility(advantageofmessagepassing)

1. Multiplethreadssharinganaddressspace2. Multipleprocessessharinganaddressspace3. Multipleprocesseswithdifferentaddressspaces

• anddifferentOSes

• 1and2easilyimplementedonsharedmemoryHW(withsingleOS)❒ Processandthreadcreation/managementsimilartosharedmemory

• 3probablymorecommoninpractice❒ Processcreationoftenexternaltoexecutionenvironment;e.g.shellscript❒ HardforuserprocessononesystemtocreateprocessonanotherOS

Communication and Synchronization• Combinedinthemessagepassingparadigm

❒ Synchronizationofmessagespartofcommunicationsemantics

• Point-to-pointcommunication❒ Fromoneprocesstoanother

• Collectivecommunication❒ Involvesgroupsofprocesses❒e.g.,broadcast

Message Passing: Send()

• Send(<what>,<where-to>,<how>)

• What:❒ Adatastructureorobjectinuserspace❒ Abufferallocatedfromspecialmemory❒ Awordorsignal

• Where-to:❒ Aspecificprocessor❒ Asetofspecificprocessors❒ Aqueue,dispatcher,scheduler

• How:❒ Asynchronouslyvs.synchronously❒ Typed❒ In-ordervs.out-of-order❒ Prioritized

Message Passing: Receive()

• Receive(<data>,<info>,<what>,<how>)

• Data:mechanismtoreturnmessagecontent❒ Abufferallocatedintheuserprocess❒ Memoryallocatedelsewhere

• Info:meta-infoaboutthemessage❒ Sender-ID❒ Type,Size,Priority❒ Flowcontrolinformation

• What:receiveonlycertainmessages❒ Sender-ID,Type,Priority

• How:❒ Blockingvs.non-blocking

Synchronous vs Asynchronous

• SynchronousSend❒ Stalluntilmessagehasactuallybeenreceived❒ Impliesamessageacknowledgementfromreceivertosender

• SynchronousReceive❒ Stalluntilmessagehasactuallybeenreceived

• AsynchronousSendandReceive❒ Senderandreceivercanproceedregardless❒Returnsrequesthandlethatcanbetestedformessagereceipt❒Requesthandlecanbetestedtoseeifmessagehasbeensent/received

Deadlock• Blockingcommunicationsmaydeadlock

• Requirescareful(safe)orderingofsends/receives

<Process 0> <Process 1>Send(Process1, Message); Receive (Process0, Message);Receive(Process1, Message); Send (Process0, Message);

<Process 0> <Process 1>Send(Process1, Message); Send(Process0, Message);Receive(Process1, Message); Receive(Process0, Message);

Message Passing Paradigm Summary

ProgrammingModel(Software)pointofview:

• Disjoint,separatenamespaces

• “Sharednothing”

• Communicationviaexplicit,typedmessages:send&receive

Message Passing Paradigm Summary

ComputerEngineering(Hardware)pointofview:

• Treatinter-processcommunicationasI/Odevice

• Criticalissues:❒ HowtooptimizeAPIoverhead

❒ Minimizecommunicationlatency

❒ Buffermanagement:howtodealwithearly/unsolicitedmessages,messagetyping,high-levelflowcontrol

❒ Eventsignaling&synchronization

❒ Librarysupportforcommonfunctions(barriersynchronization,taskdistribution,scatter/gather,datastructuremaintenance)

Shared Memory Programming Model

Shared-Memory Model

P1 P2 P3 P4

MemorySystem

❒ Multipleexecutioncontextssharingasingleaddressspace❍ Multipleprograms(MIMD)❍ Ormorefrequently:multiplecopiesofoneprogram(SPMD)

❒ Implicit(automatic)communicationvialoadsandstores❒ Theoreticalfoundation:PRAMmodel

Global Shared Physical Address Space

• Communication,sharing,synchronizationvialoads/storestosharedvariables

• Facilitiesforaddresstranslationbetweenlocal/globaladdressspaces

• RequiresOSsupporttomaintainthismapping

Sharedportionof

addressspacePrivate

portionofaddressspace

Commonphysical

addressspace

Pn private

P2private

P1private

P0private

storeP0

loadPn

Why Shared Memory?Pluses

❒ Forapplicationslookslikemultitaskinguniprocessor❒ ForOSonlyevolutionaryextensionsrequired❒ EasytodocommunicationwithoutOS❒ Softwarecanworryaboutcorrectnessfirstthenperformance

Minuses❒ Propersynchronizationiscomplex❒ Communicationisimplicitsohardertooptimize❒ Hardwaredesignersmustimplement

Result❒ Traditionallybus-basedSymmetricMultiprocessors(SMPs),

andnowCMPsarethemostsuccessparallelmachinesever❒ Andthefirstwithmulti-billion-dollarmarkets

Thread-Level Parallelism

• Thread-levelparallelism(TLP)❒ Collectionofasynchronoustasks:notstartedandstoppedtogether❒ Datasharedloosely,dynamically

• Example:database/webserver(eachqueryisathread)❒ accts isshared,can’tregisterallocateevenifitwerescalar❒ id andamt areprivatevariables,registerallocatedtor1,r2

struct acct_t { int bal; };shared struct acct_t accts[MAX_ACCT];int id,amt;if (accts[id].bal >= amt){

accts[id].bal -= amt;spew_cash();

0: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash

Synchronization• Mutualexclusion :locks,…

• Order :barriers,signal-wait,…

• Implementedusingread/write/modifytosharedlocation❒ Language-level:

❍ libraries(e.g.,locksinpthread)❍ Programmerscanwritecustomsynchronizations

❒ HardwareISA❍ E.g.,test-and-set

• OSprovidessupportformanagingthreads❒ scheduling,fork,join,futex signal/wait

We’llcoversynchronizationinmoredetailinafewweeks

Paired vs. Separate Processor/Memory?• Separateprocessor/memory

❒ Uniformmemoryaccess(UMA):equallatencytoallmemory+ Simplesoftware,doesn’tmatterwhereyouputdata– Lowerpeakperformance❒ Bus-basedUMAscommon:symmetricmulti-processors(SMP)

• Pairedprocessor/memory❒ Non-uniformmemoryaccess(NUMA):fastertolocalmemory– Morecomplexsoftware:whereyouputdatamatters+ Higherpeakperformance:assumingproperdataplacement

CPU($)

CPU($)Mem

CPU($)MemR RRR

Shared vs. Point-to-Point Networks• Sharednetwork:e.g.,bus(left)

+ Lowlatency– Lowbandwidth:doesn’tscalebeyond~16processors+ Sharedpropertysimplifiescachecoherenceprotocols(later)

• Point-to-pointnetwork:e.g.,meshorring(right)– Longerlatency:mayneedmultiple“hops”tocommunicate+ Higherbandwidth:scalesto1000sofprocessors– Cachecoherenceprotocolsarecomplex

CPU($)Mem

CPU($)Mem R

CPU($)MemR

CPU($)Mem

CPU($)Mem RRRR

Implementation #1: Snooping Bus MP

• Twobasicimplementations

• Bus-basedsystems❒ Typicallysmall:2–8(maybe16)processors❒ Typicallyprocessorssplitfrommemories(UMA)

❍ Sometimesmultipleprocessorsonsinglechip(CMP)❍ Symmetricmultiprocessors(SMPs)❍ Common,Iuseoneeveryday

CPU($)

Implementation #2: Scalable MP

• Generalpoint-to-pointnetwork-basedsystems❒ Typicallyprocessor/memory/routerblocks(NUMA)

❍ Glueless MP:noneedforadditional“glue”chips❒ Canbearbitrarilylarge:1000’sofprocessors

❍ Massivelyparallelprocessors(MPPs)❒ Inrealityonlygovernment(DoD)hasMPPs…

❍ Companieshavemuchsmallersystems:32–64processors❍ Scalablemulti-processors

CPU($)Mem R

CPU($)MemR

Cache Coherence

• Two$100withdrawalsfromaccount#241attwoATMs❒ Eachtransactionmapstothreadondifferentprocessor❒ Trackaccts[241].bal (addressisinr3)

Processor 00: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash

Processor 1

CPU0 MemCPU1

No-Cache, No-Problem

• ScenarioI:processorshavenocaches❒ Noproblem

Processor 1

500500

Cache Incoherence

• ScenarioII:processorshavewrite-backcaches❒ Potentially3copiesofaccts[241].bal:memory,p0$,p1$❒ Cangetincoherent(inconsistent)

Processor 1

500V:500 500

D:400 500

D:400 500V:500

D:400 500D:400

Snooping Cache-Coherence ProtocolsBusprovidesserializationpoint

Eachcachecontroller“snoops”allbustransactions❒ takeactiontoensurecoherence

❍ invalidate❍ update❍ supplyvalue

❒ dependsonstateoftheblockandtheprotocol

Scalable Cache Coherence• Scalablecachecoherence:twopartsolution

• PartI:busbandwidth❒ Replacenon-scalablebandwidthsubstrate(bus)…❒ …withscalablebandwidthone(point-to-pointnetwork,e.g.,mesh)

• PartII:processorsnoopingbandwidth❒ Interesting:mostsnoopsresultinnoaction❒ Replacenon-scalablebroadcastprotocol(spameveryone)…❒ …withscalabledirectoryprotocol (onlyspamprocessorsthatcare)

• WewillcoverthisinUnit3

Shared Memory Summary• Shared-memorymultiprocessors

+ Simplesoftware:easydatasharing,handlesbothDLPandTLP– Complexhardware:mustprovideillusionofglobaladdress

• Twobasicimplementations❒ Symmetric(UMA)multi-processors(SMPs)

❍ Underlyingcommunicationnetwork:bus(ordered)+ Low-latency,simpleprotocolsthatrelyonglobalorder– Low-bandwidth,poorscalability

❒ Scalable(NUMA)multi-processors(MPPs)❍ Underlyingcommunicationnetwork:point-to-point

(unordered)+ Scalablebandwidth– Higher-latency,complexprotocols

eecs 570 lecture 2 message passing & shared memory

Documents

optimization models - eecs 127 / eecs 227at · 2020. 4....

eecs 311: chapter 2 notes chris riesbeck eecs northwestern

finance 570

eecs 570 lecture 19 interconnect - intro · lecture 19 eecs...

eecs 570 lecture 1 parallel computer architecture ·...

proview 570

a cousins research group report ships passing ...ships...

disciplined message passing - eecs at uc berkeley ·...

eecs 570 lecture 16 sc-preserving hardware

eecs 570 lecture 11 directory-based coherencelecture 11 eecs...

ultratech dwell - ultrafabrics llc portal · 2019. 5....

ranger 570

eecs 582 introduction mosharaf chowdhury eecs 582 –...

ssa 570 revised - institute of singapore chartered ... ·...

eecs 570 lecture 13 memory consistency · lecture 14 eecs...

eecs 570 lecture 13 memory consistencyeecs 570 slide 6...

eecs 570 lecture 10 bus-based smpslecture 10 eecs 570 slide...

cpts 570 – machine learning school of eecs washington...

eecs department

570 manual