lecture 02: parallel architecture

Lecture02:ParallelArchitectureILP:Instruc4onLevelParallelism,TLP:ThreadLevel

ParallelismandDLP:DataLevelParallelism

CSCE790:ParallelProgrammingModelsforMul4coreandManycoreProcessors

DepartmentofComputerScienceandEngineering

YonghongYanyanyh@cse.sc.edu

h:p://cse.sc.edu/~yanyh

Flynn’sTaxonomyofParallelArchitectures

h:ps://en.wikipedia.org/wiki/Flynn%27s_taxonomy

SISD:SingleInstruc4onSingleData

•  AtoneIme,oneinstrucIonoperatesononedata•  BasedontradiIonalVonNeumannuniprocessorarchitecture–  instrucIonsareexecutedsequenIallyorserially,onestepaOerthe

next.•  UnIlmostrecently,mostcomputersareofSISDtype.

SIMD:SingleInstruc4onMul4pleData

•  Alsoknownasarray-processorsfromearlyon•  AsingleinstrucIonstreamisbroadcastedtomulIpleprocessors,eachhavingitsowndatastream–  SIllusedinsomegraphicscardstoday

InstrucIonsstream

processor processor processor processor

Data Data Data Data

Controlunit

MIMD:Mul4pleInstruc4onsMul4pleData

•  EachprocessorhasitsowninstrucIonstreamandinputdata

•  Verygeneralcase–  everyotherscenariocanbemappedtoMIMD•  FurtherbreakdownofMIMDusuallybasedonthememoryorganizaIon–  Sharedmemorysystems–  Distributedmemorysystems

ParallelisminHardwareArchitecture

•  SISD:inherentlysequenIal–  InstrucIonLevelParallel:overlappingexecuIonof

instrucIonsthroughpipeliningsincewecansplitaninstrucIonexecuIonintomulIplestages

–  Out-of-OrderexecuIon–  SpeculaIon–  Superscalar•  SIMD:Inherentlyparallelwithconstraints–  DataLevelParallel:OneinstrucIonstreammulIpledata•  MIMD:Inherentlyparallel–  ThreadLevelParallel:mulIpleinstrucIonstreams

independently

Abstrac4on:LevelsofRepresenta4on/Interpreta4on

lw $t0,0($2)lw $t1,4($2)sw $t1,0($2)sw $t0,4($2)

HighLevelLanguageProgram(e.g.,C)

AssemblyLanguageProgram(e.g.,MIPS)

MachineLanguageProgram(MIPS)

HardwareArchitectureDescrip4on(e.g.,blockdiagrams)

Compiler

Assembler

MachineInterpreta4on

temp=v[k];v[k]=v[k+1];v[k+1]=temp;

0000 1001 1100 0110 1010 1111 0101 1000 1010 1111 0101 1000 0000 1001 1100 0110 1100 0110 1010 1111 0101 1000 0000 1001 0101 1000 0000 1001 1100 0110 1010 1111 !

LogicCircuitDescrip4on(CircuitSchema4cDiagrams)

ArchitectureImplementa4on

Anythingcanberepresentedasanumber,

i.e.,dataorinstrucIons

Instruc4onLevelParallelism

•  InstrucIonexecuIoncanbedividedintomulIplestages–  5stagesinRISC–  Instruc4onfetchcycle(IF):sendPCtomemory,fetchthecurrent

instrucIonfrommemory,andupdatePCtothenextsequenIalPCbyadding4tothePC.

–  Instruc4ondecode/registerfetchcycle(ID):decodetheinstrucIon,readtheregisterscorrespondingtoregistersourcespecifiersfromtheregisterfile.

–  Execu4on/effec4veaddresscycle(EX):performMemoryaddresscalculaIonforLoad/Store,Register-RegisterALUinstrucIonandRegister-ImmediateALUinstrucIon.

–  Memoryaccess(MEM):Performmemoryaccessforload/storeinstrucIons.

–  Write-backcycle(WB):WritebackresultstothedestoperandsforRegister-RegisterALUinstrucIonorLoadinstrucIon.

PipelinedInstruc4onExecu4on

I n s t r. O r d e r

Time (clock cycles)

Reg ALU

DMem Ifetch Reg

Reg ALU

DMem Ifetch Reg

Reg ALU

DMem Ifetch Reg

Reg ALU

DMem Ifetch Reg

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5

Pipelining: Its Natural!

•  Laundry Example •  Ann, Brian, Cathy, Dave

each have one load of clothes to wash, dry, and fold –  Washer takes 30 minutes –  Dryer takes 40 minutes –  “Folder” takes 20 minutes

•  One load: 90 minutes

A B C D

Sequential Laundry

•  Sequential laundry takes 6 hours for 4 loads •  If they learned pipelining, how long would laundry take?

30 40 20 30 40 20 30 40 20 30 40 20

6PM 7 8 9 10 11 Midnight

TaskOrder

Pipelined Laundry Start Work ASAP

•  Pipelined laundry takes 3.5 hours for 4 loads

6PM 7 8 9 10 11 Midnight

TaskOrder

30 40 40 40 40 20

Sequential laundry takes 6 hours for 4 loads

Classic5-StagePipelineforaRISC

•  EachcyclethehardwarewilliniIateanewinstrucIonandwillbeexecuIngsomepartofthefivedifferentinstrucIons.–  OnecycleperinstrucIonvs5cycleperinstrucIon

Clock number

Instruction number 1 2 3 4 5 6 7 8 9

Instruction i IF ID EX MEM WB

Instruction i+1 IF ID EX MEM WB

PipelineandSuperscalar

AdvancedILP

•  DynamicSchedulingàOut-of-orderExecuIon•  SpeculaIonàIn-orderCommit•  SuperscalaràMulIpleIssue

Techniques Goals Implementa4on Addressing Approaches

DynamicScheduling

Out-of-orderexecu4on

Reserva4onSta4ons,Load/StoreBufferandCDB

Datahazards(RAW,WAW,WAR)

Registerrenaming

Specula4on In-ordercommit

BranchPredic4on(BHT/BTB)andReorderBuffer

Controlhazards(branch,func,excep4on)

Predic4onandmispredic4onrecovery

Superscalar/VLIW

Mul4pleissue

SocwareandHardware ToIncreaseCPI Bycompilerorhardware

Problemsoftradi4onalILPscaling

•  FundamentalcircuitlimitaIons1–  delays⇑asissuequeues⇑andmulI-portregisterfiles⇑–  increasingdelayslimitperformancereturnsfromwiderissue•  LimitedamountofinstrucIon-levelparallelism1

–  inefficientforcodeswithdifficult-to-predictbranches

•  Powerandheatstallclockfrequencies

[1]Thecaseforasingle-chipmulIprocessor,K.Olukotun,B.Nayfeh,L.Hammond,K.Wilson,andK.Chang,ASPLOS-VII,1996.

ILPimpacts

Simula4onsof8-issueSuperscalar

Power/heatdensitylimitsfrequency

•  Somefundamentalphysicallimitsarebeingreached

Wewillhavethis…

Revolu4onishappeningnow•  Chipdensityis

conInuingincrease~2xevery2years–  Clockspeedisnot–  Numberofprocessor

coresmaydoubleinstead

•  Thereisli:leornohiddenparallelism(ILP)tobefound

•  ParallelismmustbeexposedtoandmanagedbysoOware–  Nofreelunch

Source:Intel,MicrosoO(Su:er)andStanford(Olukotun,Hammond)

CurrentTrendsinArchitecture

•  CannotconInuetoleverageInstrucIon-Levelparallelism(ILP)–  Singleprocessorperformanceimprovementendedin2003

•  Recentmodelsforperformance:–  ToexploreData-levelparallelism(DLP)viaSIMDarchitectureandGPUs

–  ToexploreThread-levelparallelism(TLP)viaMIMD

–  Others

SIMD:SingleInstruc4on,Mul4pleData(DataLevelParallelism)

•  SIMDarchitecturescanexploitsignificantdata-levelparallelismfor:–  matrix-orientedscienIficcompuIng–  media-orientedimageandsoundprocessors•  SIMDismoreenergyefficientthanMIMD–  OnlyneedstofetchoneinstrucIonperdataoperaIon

processingmulIpledataelements–  MakesSIMDa:racIveforpersonalmobiledevices•  SIMDallowsprogrammertoconInuetothinksequenIally

InstrucIonsstream

processor processor processor processor

Data Data Data Data

Controlunit

SIMDParallelism

•  Threevaria4ons–  Vectorarchitectures(earlyage)–  SIMDextensions–  GraphicsProcessorUnits(GPUs)(dedicatedweeksforGPUs)

•  Forx86processors:–  ExpecttwoaddiIonalcoresperchipperyear(MIMD)–  SIMDwidthtodoubleeveryfouryears–  PotenIalspeedupfromSIMDtobetwicethatfromMIMD!

VectorArchitectures

•  VectorprocessorsabstractoperaIonsonvectors,e.g.replacethefollowingloop

by•  Somelanguagesofferhigh-levelsupportfortheseoperaIons(e.g.Fortran90ornewer)

for (i=0; i<n; i++) { a[i] = b[i] + c[i];

a = b + c; ADDV.D V10, V8, V6

VectorProgrammingModel

+ + + + + +

[0] [1] [VLR-1]

Vector Arithmetic Instructions ADDV v3, v1, v2 v3

Scalar Registers

r15 Vector Registers

[0] [1] [2] [VLRMAX-1] VLR Vector Length Register

v1 Vector Load and Store Instructions LV v1, (r1, r2)

Base, r1 Stride in r2 Memory

Vector Register

VectorwasSupercomputers•  Epitomy:Cray-1,1976•  ScalarUnit–  Load/StoreArchitecture

•  VectorExtension–  VectorRegisters–  VectorInstrucIons

•  ImplementaIon–  HardwiredControl–  HighlyPipelinedFuncIonalUnits–  InterleavedMemorySystem–  NoDataCaches–  NoVirtualMemory

AXPY(64elements)(Y=a*X+Y)inMIPSandVMIPS

•  #instrs:–  6vs~600•  Pipelinestalls–  64xhigherbyMIPS•  Vectorchaining(forwarding)–  V1,V2,V3andV4

for (i=0; i<64; i++) Y[i] = a* X[i] + Y[i];

ThestarIngaddressesofXandYareinRxandRy,respecIvely

SIMDInstruc4ons

•  OriginallydevelopedforMulImediaapplicaIons•  SameoperaIonexecutedformulIpledataitems•  UsesafixedlengthregisterandparIIonsthecarrychaintoallowuIlizingthesamefuncIonalunitformulIpleoperaIons– E.g.a64bitaddercanbeuIlizedfortwo32-bitaddoperaIonssimultaneously

SIMDInstruc4ons

•  MMX(Mult-MediaExtension)-1996–  ExisIng64bitfloaIngpointregistercouldbeusedforeight8-

bitoperaIonsorfour16-bitoperaIons•  SSE(StreamingSIMDExtension)–1999–  SuccessortoMMXinstrucIons–  Separate128-bitregistersaddedforsixteen8-bit,eight16-bit,

orfour32-bitoperaIons•  SSE2–2001,SSE3–2004,SSE4-2007–  AddedsupportfordoubleprecisionoperaIons•  AVX(AdvancedVectorExtensions)-2010–  256-bitregistersadded

•  256-bitSIMDexts–  4doubleFP

•  MIPS:578insts•  SIMDMIPS:149–  4×reducIon

•  VMIPS:6instrs–  100×reducIon

for (i=0; i<64; i++) Y[i] = a* X[i] + Y[i];

StateoftheArt:IntelXeonPhiManycoreVectorCapability

•  IntelXeonPhiKnightCorner,2012,~60cores,4-waySMT•  IntelXeonPhiKnightLanding,2016,~60cores,4-waySMTandHBM–  h:p://www.hotchips.org/wp-content/uploads/hc_archives/hc27/HC27.25-Tuesday-

Epub/HC27.25.70-Processors-Epub/HC27.25.710-Knights-Landing-Sodani-Intel.pdf

h:p://primeurmagazine.com/repository/PrimeurMagazine-AE-PR-12-14-32.pdf

StateoftheArt:ARMScalableVectorExtensions(SVE)

•  AnnouncedinAugust2016–  h:ps://community.arm.com/groups/processors/blog/

2016/08/22/technology-update-the-scalable-vector-extension-sve-for-the-armv8-a-architecture

–  h:p://www.hotchips.org/wp-content/uploads/hc_archives/hc28/HC28.22-Monday-Epub/HC28.22.10-GPU-HPC-Epub/HC28.22.131-ARMv8-vector-Stephens-Yoshida-ARM-v8-23_51-v11.pdf

•  Beyondvectorarchitecturewelearned–  Vectorloop,predictandspeculaIon–  VectorLengthAgnosIc(VLA)programming

–  Checktheslide

Limita4onsofop4mizingasingleinstruc4onstream

•  Problem:withinasingleinstrucIonstreamwedonotfindenoughindependentinstrucIonstoexecutesimultaneouslydueto–  datadependencies–  limitaIonsofspeculaIveexecuIonacrossmulIplebranches–  difficulIestodetectmemorydependenciesamonginstrucIon

(aliasanalysis)•  Consequence:significantnumberoffuncIonalunitsareidlingat

anygivenIme•  QuesIon:CanwemaybeexecuteinstrucIonsfromanother

instrucIonsstream–  Anotherthread?–  Anotherprocess?

Thread-levelparallelism

•  ProblemsforexecuInginstrucIonsfrommulIplethreadsatthesameIme–  TheinstrucIonsineachthreadmightusethesameregister

names–  Eachthreadhasitsownprogramcounter•  VirtualmemorymanagementallowsfortheexecuIonofmulIplethreadsandsharingofthemainmemory

•  Whentoswitchbetweendifferentthreads:–  FinegrainmulIthreading:switchesbetweeneveryinstrucIon–  CoursegrainmulIthreading:switchesonlyoncostlystalls(e.g.

level2cachemisses)

ConvertThread-levelparallelismtoinstruc4on-levelparallelism

Time (

le) Superscalar Fine-Grained Coarse-Grained

Simultaneous Multithreading

Thread 1 Thread 2

Thread 3 Thread 4

Thread 5 Idle slot

ILPtoDoTLP:e.g.SimultaneousMul4-Threading

•  Workswellif–  Numberofcomputeintensivethreadsdoesnotexceedthenumberof

threadssupportedinSMT–  ThreadshavehighlydifferentcharacterisIcs(e.g.onethreaddoingmostly

integeroperaIons,anothermainlydoingfloaIngpointoperaIons)•  Doesnotworkwellif–  ThreadstrytouIlizethesamefuncIonunits•  e.g.adualprocessorsystem,eachprocessorsupporIng2threadssimultaneously(OSthinksthereare4processors)

•  2computeintensiveapplicaIonprocessesmightenduponthesameprocessorinsteadofdifferentprocessors(OSdoesnotseethedifferencebetweenSMTandrealprocessors!)

Power,FrequencyandILP

Note:EvenMoore’sLawisendingaround2021:h:p://spectrum.ieee.org/semiconductors/devices/transistors-could-stop-shrinking-in-2021h:ps://www.technologyreview.com/s/601441/moores-law-is-dead-now-what/h:p://www.forbes.com/sites/Imworstall/2016/07/26/economics-is-important-the-end-of-moores-law

CPUfrequencyincreasewasfla:enedaround2000-2005Twomainreasons:1.  LimitedILPand2.  PowerconsumpIonandheat

dissipaIon

History–Past(2000)andToday

Flynn’sTaxonomy

h:ps://en.wikipedia.org/wiki/Flynn%27s_taxonomy

✔✔

ExamplesofMIMDMachines•  SymmetricShared-Memory

MulIprocessor(SMP)–  MulIpleprocessorsinboxwith

sharedmemorycommunicaIon–  CurrentMulIcorechipslikethis–  EveryprocessorrunscopyofOS•  Distributed/Non-uniformShared-

MemoryMulIprocessor–  MulIpleprocessors•  Eachwithlocalmemory•  generalscalablenetwork

–  Extremelylight“OS”onnodeprovidessimpleservices•  Scheduling/synchronizaIon

–  Network-accessiblehostforI/O•  Cluster–  Manyindependentmachine

connectedwithgeneralnetwork–  CommunicaIonthroughmessages

P P P P

Memory

P/M P/M P/M P/M

Network

Symmetric(Shared-Memory)Mul4processors(SMP)

•  Smallnumbersofcores–  Typicallyeightorfewer,andno

morethan32inmostcases•  Shareasinglecentralizedmemorythatallprocessorshaveequalaccessto,–  Hencethetermsymmetric.•  AllexisIngmulIcoresareSMPs.

•  Alsocalleduniformmemoryaccess(UMA)mulIprocessors–  allprocessorshaveauniform

latency

CentralizedSharedMemorySystem(I)

•  MulI-coreprocessors–  Typicallyconnectedoveracache,–  PreviousSMPsystemsweretypicallyconnectedoverthemain

memory•  IntelX7350quad-core(Tigerton)–  PrivateL1cache:32KBinstrucIon,32KBdata–  SharedL2cache:4MBunifiedcache

CoreL1

sharedL2

CoreL1

sharedL2

1066MHzFSB

CentralizedSharedMemorySystem(SMP)(II)

•  IntelX7350quad-core(Tigerton)mulI-processorconfiguraIon

Socket0 Socket1 Socket2 Socket3

MemoryControllerHub(MCH)

Memory Memory Memory Memory

8GB/s8GB/s8GB/s8GB/s

DistributedShared-MemoryMul4processor•  Largeprocessorcount–  64to1000s•  Distributedmemory–  Remotevslocalmemory–  Longvsshortlatency–  Highvslowlatency

§  Interconnec4onnetwork–  Bandwidth,topology,etc

§  Nonuniformmemoryaccess(NUMA)

§  EachprocessormayhaslocalI/O

DistributedShared-MemoryMul4processor(NUMA)

•  Reducesthememorybo:leneckcomparedtoSMPs•  Moredifficulttoprogramefficiently–  E.g.firsttouchpolicy:dataitemwillbelocatedinthememory

oftheprocessorwhichusesadataitemfirst•  Toreduceeffectsofnon-uniformmemoryaccess,cachesareoOenused–  ccNUMA:cache-coherentnon-uniformmemoryaccess

architectures•  Largestexampleasoftoday:SGIOriginwith512processors

Shared-MemoryMul4processor

•  SMPandDSMareallsharedmemorymulIprocessors–  UMAorNUMA•  MulIcoreareSMPsharedmemory•  MostmulI-CPUmachinesareDSM–  NUMA

•  SharedAddressSpace(VirtualAddressSpace)–  Notalwayssharedmemory

CurrentTrendsinComputerArchitecture•  CannotconInuetoleverageILP–  Singleprocessorperformanceimprovementendedin2003

•  Currentmodelsforperformance:–  ToexploreData-levelparallelism(DLP)viaSIMDarchitecture(vector,SIMD

extensionsandGPUs)–  ToexploreThread-levelparallelism(TLP)viaMIMD

–  Heterogeneity:integratemul4pleanddifferentarchitecturestogetherinchip/systemlevel

•  Emergingarchitectures–  Domain-specificarchitectures:DeepLearningPU(e.g.TPU,etc)–  E.g.MachineLearningPullsProcessorArchitecturesontoNewPath•  hJps://www.top500.org/news/machine-learning-pulls-processor-architectures-onto-new-path/

Theserequireexplicitrestructuringoftheapplica4onßParallelProgramming

The“Future”ofMoore’sLaw

•  ThechipsaredownforMoore’slaw–  h:p://www.nature.com/news/the-chips-are-down-for-moore-

s-law-1.19338•  SpecialReport:50YearsofMoore'sLaw–  h:p://spectrum.ieee.org/staIc/special-report-50-years-of-

moores-law•  Moore’slawreallyisdeadthisIme–  h:p://arstechnica.com/informaIon-technology/2016/02/

moores-law-really-is-dead-this-Ime/•  RebooIngtheITRevoluIon:ACalltoAcIon(SIA/SRC,2015)–  h:ps://www.semiconductors.org/clientuploads/Resources/

RITR%20WEB%20version%20FINAL.pdf

lecture 02: parallel architecture

Documents

ece669 l21: routing april 15, 2004 ece 669 parallel computer...

architecture of parallel computers csc / ece 506 message...

lecture notes on architecture of rdbms,...

cs 61c: 61c survey great ideas in computer architecture...

ece 669 parallel computer architecture lecture 10 graph...

cs 61c: great ideas in computer architecture lecture 18:...

lecture notes on architecture of rdbms,...

lecture 13: memory consistency - 15-418/618 fall...

cs 61c: great ideas in computer architecture lecture 18...

cs 258 parallel computer architecture lecture 5 routing

parallel computer architecture - university of...

lecture 23: domain-speci c parallel programming€¦ ·...

©rg:e0243:l2- parallel architecture 1 e0-243: computer...

why parallel architecture · lecture 25 architecture of...

principles of parallel computer architecture - · pdf...

parallel architecture - colorado state universitycs560...

cs213 parallel processing architecture lecture 7: ...

architecture exploration lecture 9iverbauw/courses/... ·...

ece 669 parallel computer architecture lecture 3 design...

ece 669 parallel computer architecture - umass · pdf...