cs 61c: great ideas in computer architecture lecture 18: parallel processing –...

CS61C:GreatIdeasinComputerArchitecture

Lecture18:

ParallelProcessing–SIMD

JohnWawrzynek&NickWeaverhttp://inst.eecs.berkeley.edu/~cs61c

Lecture18:ParallelProcessing-SIMD

61CSurvey

Itwouldbenicetohaveareviewlectureeveryonceinawhile,actuallyshowingus

howthingsfitinthebiggerpicture

CS61c 2


Agenda• 61C–thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Loopunrolling• Memoryaccessstrategy-blocking• AndinConclusion,…

CS61c 3


61CTopicssofar…• Whatwelearned:

1. Binarynumbers2. C3. Pointers4. Assemblylanguage5. Processormicro-architecture6. Pipelining7. Caches8. Floatingpoint

• Whatdoesthisbuyus?− Promise:executionspeed− Let’scheck!

CS61c 4


ReferenceProblem•Matrixmultiplication

−Basicoperationinmanyengineering,data,andimagingprocessingtasks− Ex:,Imagefiltering,noisereduction,…

−CoreoperationinNeuralNetsandDeepLearning− Imageclassification(cats…)−RobotCars−Machinetranslation− Fingerprintverification−Automaticgameplaying

• dgemm −double-precisionfloating-pointgeneralmatrix-multiply−Standardwell-studiedandwidelyusedroutine−PartofLinpack/Lapack

CS61c 5


2D-Matrices

CS61c 6

N-1

N-1

0

0

• SquarematrixofdimensionNxN• iindexesthroughrows• jindexesthroughcolumns

MatrixMultiplication

CS61c 7

C=A*BCij=Σk(Aik*Bkj)

A

B

C

2DMatrixMemoryLayout

• a[][]inCusesrow-major• Fortranusescolumn-major• Ourexamplesuscolumn-major

8

a00 a01 a02 a03a10 a11 a12 a13a20 a21 a22 a23a30 a31 a32 a33

aij a31a21a11a01a30a20a10a00

a13a12a11a10a03a02a01a00 0x0

0x8

0x10

0x18

0x20

0x28

0x30

0x38

0x0

0x8

0x10

0x18

0x20

0x28

0x30

0x38

Row

Row Column

Column

Row-Major Column-Major

aij:a[i*N+j] aij:a[i+j*N]


dgemmReferenceCode:Python• Linearaddressing,assumes“column-major”memorylayout

CS61c 9

N Python[Mflops]

32 5.4160 5.5

480 5.4960 5.3

• 1MFLOP=1Millionfloating-pointoperationspersecond(fadd,fmul)

• dgemm(N…)takes2*N3flops


C• c=a*b• a,b,careNxNmatrices

CS61c 10


TimingProgramExecution

CS61c 11


CversusPython

CS61c 12

N C[GFLOPS] Python[GFLOPS]

32 1.30 0.0054160 1.30 0.0055

480 1.32 0.0054960 0.91 0.0053

Whichotherclassgivesyouthiskindofpower?Wecouldstophere…butwhy?Let’sdobetter!

240x!


Agenda• 61C–thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy-blocking• AndinConclusion,…

CS61c 13


WhyParallelProcessing?• CPUClockRatesarenolongerincreasing

− Technical&economicchallenges▪ Advancedcoolingtechnologytooexpensiveorimpracticalformostapplications▪ Energycostsareprohibitive

• Parallelprocessingisonlypathtohigherspeed− Compareairlines:

▪ Maximumair-speedlimitedbyeconomics▪ Usemoreandlargerairplanestoincreasethroughput▪ (Andsmallerseats…)

CS61c 14


UsingParallelismforPerformance• Twobasicapproachestoparallelism:

−Multiprogramming▪ runmultipleindependentprogramsinparallel▪ “Easy”

−Parallelcomputing▪ runoneprogramfaster▪ “Hard”

•We’llfocusonparallelcomputinginthenextfewlectures

15CS61c

New-SchoolMachineStructures(It’sabitmorecomplicated!)

•ParallelRequestsAssignedtocomputere.g.,Search“Katz”•ParallelThreadsAssignedtocoree.g.,Lookup,Ads•ParallelInstructions>[email protected].,5pipelinedinstructions•ParallelData>[email protected].,Addof4pairsofwords•HardwaredescriptionsAllgates@onetime•ProgrammingLanguages

16

SmartPhone

WarehouseScale

Computer

SoftwareHardware

Harness Parallelism&AchieveHigh Performance

LogicGates

Core Core…Memory(Cache)Input/Output

Computer

CacheMemory

CoreInstructionUnit(s) Functional

Unit(s)A3+B3A2+B2A1+B1A0+B0

Today’sLecture


Single-Instruction/Single-DataStream(SISD)

• Sequentialcomputerthatexploitsnoparallelismineithertheinstructionordatastreams.ExamplesofSISDarchitecturearetraditionaluniprocessormachines－E.g.OurRISC-Vprocessor

17

ProcessingUnit

CS61c

Thisiswhatwediduptonowin61C


Single-Instruction/Multiple-DataStream(SIMDor“sim-dee”)

• SIMDcomputerprocessesmultipledatastreamsusingasingleinstructionstream,e.g.,IntelSIMDinstructionextensionsorNVIDIAGraphicsProcessingUnit(GPU)

18CS61c

Today’stopic.


Multiple-Instruction/Multiple-DataStreams(MIMDor“mim-dee”)

• Multipleautonomousprocessorssimultaneouslyexecutingdifferentinstructionsondifferentdata.• MIMDarchitecturesincludemulticoreandWarehouse-ScaleComputers

19

InstructionPool

PU

PU

PU

PU

DataPoo

l

CS61c

TopicofLecture19andbeyond.


Multiple-Instruction/Single-DataStream(MISD)

• Multiple-Instruction,Single-Datastreamcomputerthatprocessesmultipleinstructionstreamswithasingledatastream.• Historicalsignificance

20CS61c

Thishasfewapplications.Notcoveredin61C.


Flynn*Taxonomy,1966

• SIMDandMIMDarecurrentlythemostcommonparallelisminarchitectures–usuallybothinsamesystem!

• Mostcommonparallelprocessingprogrammingstyle:SingleProgramMultipleData(“SPMD”)

− SingleprogramthatrunsonallprocessorsofaMIMD− Cross-processorexecutioncoordinationusingsynchronizationprimitives

21CS61c

*Prof.MichaelFlynn,Stanford

22

NEW YORK, NY, March 21, 2018 – ACM, the Association for Computing Machinery, today named John L. Hennessy, former President of Stanford University, and David A. Patterson, retired Professor of the University of California, Berkeley, recipients of the 2017 ACM A.M. Turing Award for pioneering a systematic, quantitative approach to the design and evaluation of computer architectures with enduring impact on the microprocessor industry. Hennessy and Patterson created a systematic and quantitative approach to designing faster, lower power, and reduced instruction set computer (RISC) microprocessors.

PattersonandHennessywinTurningAward!

http://www.acm.org/

https://awards.acm.org/award_winners/hennessy_1426931

https://awards.acm.org/award_winners/hennessy_1426931

https://awards.acm.org/award_winners/patterson_2316693

https://awards.acm.org/award_winners/patterson_2316693



CS61c 23


SIMD–“SingleInstructionMultipleData”

24CS61c


SIMD(Vector)Mode

CS61c 25


SIMDApplications&Implementations•Applications

−Scientificcomputing▪Matlab,NumPy

−Graphicsandvideoprocessing▪Photoshop,…

−BigData▪Deeplearning

−Gaming−…

•Implementations−x86−ARM−RISC-Vvectorextensions

CS61c 26

27

FirstSIMDExtensions:MITLincolnLabsTX-2,1957

CS61c

28

Intelx86MMX1997

29https://chrisadkin.io/2015/06/04/under-the-hood-of-the-batch-engine-simd-with-sql-server-2016-ctp/

AVXalsosupportedbyAMDprocessors


LaptopCPUSpecs$ sysctl -a | grep cpuhw.physicalcpu: 2hw.logicalcpu: 4

machdep.cpu.brand_string: Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz

machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C

machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 BMI2 INVPCID SMAP RDSEED ADX IPT FPU_CSDS

CS61c 30


AVXSIMDRegisters

CS61c 31


SIMDDataTypes

CS61c 32

(NowalsoAVX-512available)



CS61c 33


Problem• Today’scompilerscangenerateSIMDcode• Butinsomecases,betterresultsbyhand(assembly)•Wewillstudyx86(notusingRISC-Vasnovectorhardwarewidelyavailableyet)−Over1000instructionstolearn…

• Canweusethecompilertogenerateallnon-SIMDinstructions?

CS61c 34

Lecture18:ParallelProcessing-SIMDhttps://software.intel.com/sites/landingpage/IntrinsicsGuide/

CS61c 35

4parallelmultiplies

2instructionsperclockcycle(CPI=0.5)

assemblyinstruction

x86SIMD“Intrinsics”

Intrinsic


x86IntrinsicsAVXDataTypes

CS61c 36

Intrinsics:DirectaccesstoassemblyfromC


IntrinsicsAVXCodeNomenclature

CS61c 37


RawDouble-PrecisionThroughput

Characteristic Value

CPU i7-5557U

Clockrate(sustained) 3.1GHz

Instructionsperclock(mul_pd) 2

Parallelmultipliesperinstruction 4

PeakdoubleFLOPS 24.8GFLOPS

CS61c 38

Actualperformanceislowerbecauseofoverheadhttps://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/

https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/

https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/

VectorizedMatrixMultiplication

CS61c 39

InnerLoop:

fori…;i+=4forj...

i+=4


“Vectorized”dgemm

CS61c 40


Performance

NGflops

scalar avx

32 1.30 4.56

160 1.30 5.47480 1.32 5.27960 0.91 3.64

CS61c 41

• 4xfaster• Butstill<<theoretical25GFLOPS!


Agenda• 61C–thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Loopunrolling• Memoryaccessstrategy-blocking• AndinConclusion,…

CS61c 42

LoopUnrolling• Onhighperformanceprocessors,optimizingcompilersperforms“loopunrolling”operationtoexposemoreparallelismandimproveperformance:

for(i=0; i<N; i++) x[i] = x[i] + s;• Couldbecome: for(i=0; i<N; i+=4) { x[i] = x[i] + s;

x[i-1] = x[i+1] + s; x[i-2] = x[i+2] + s; x[i-3] = x[i+3] + s;}

43

1. Exposedata-levelparallelismforvector(SIMD)instructionsorsuper-scalarmultipleinstructionissue

2. Mixpipelinewithunrelatedoperationstohelpwithreducehazards

3. Reduceloop“overhead”4. Makescodesizelarger


Amdahl’sLaw*appliedtodgemm• Measureddgemmperformance

− Peak5.5GFLOPS− Largematrices3.6GFLOPS− Processor 24.8GFLOPS

• Whyarewenotgetting(closeto)25GFLOPS?− Somethingelse(notfloating-pointALU)islimitingperformance!− Butwhat?Possibleculprits:

▪ Cache▪ Hazards▪ Let’slookatboth!

CS61c 44

* Amdahl’sLawstatesthattheamountaspeeduppossiblethroughparallelismislimitedbythesequential(non-parallel)work.

Lecture18:ParallelProcessing-SIMDCS61c 45

PipelineHazards–dgemm

“add_pd”dependsonresultof“mult_pd”whichdependson“load_pd”


LoopUnrolling

CS61c 46

Compilerdoestheunrolling

Howdoyouverifythatthegeneratedcodeisactuallyunrolled?

4registers


Performance

NGflops

scalar avx unroll

32 1.30 4.56 12.95

160 1.30 5.47 19.70480 1.32 5.27 14.50960 0.91 3.64 6.91

CS61c 47

?

WOW!



CS61c 48


FPUversusMemoryAccess• Howmanyfloating-pointoperationsdoesmatrixmultiplytake?

− F=2xN3(N3multiplies,N3adds)

• Howmanymemoryload/stores?− M=3xN2(forA,B,C)

• Manymorefloating-pointoperationsthanmemoryaccesses− q=F/M=2/3*N− Good,sincearithmeticisfasterthanmemoryaccess− Let’scheckthecode…

CS61c 49


Butmemoryisaccessedrepeatedly

• q=F/M=1.6!(1.25loadsand2floating-pointoperations)

CS61c 50

Innerloop:

Lecture18:ParallelProcessing-SIMDCS61c 51

Second-LevelCache(SRAM)

TypicalMemoryHierarchyControl

Datapath

SecondaryMemory(Disk

OrFlash)

On-ChipComponents

RegFile

MainMemory(DRAM)Data

CacheInstrCache

Speed(cycles):½’s1’s10’s100’s-10001,000,000’sSize(bytes):100’s10K’sM’sG’sT’s

• Wherearetheoperands(A,B,C)stored?• WhathappensasNincreases?• Idea:arrangethatmostaccessesaretofastcache!

Cost/bit:highestlowest

Third-LevelCache(SRAM)


Blocking• Idea:

− Rearrangecodetousevaluesloadedincachemanytimes− Only“few”accessestoslowmainmemory(DRAM)perfloatingpointoperation

−à throughputlimitedbyFPhardwareandcache,notslowDRAM

− P&H,RISC-Veditionp.465

CS61c 52

BlockingMatrixMultiply(divideandconquer:sub-matrixmultiplication)

53

X =

X =

X =

X =

X =

X =


MemoryAccessBlocking

CS61c 54


Performance

NGflops

scalar avx unroll blocking

32 1.30 4.56 12.95 13.80

160 1.30 5.47 19.70 21.79480 1.32 5.27 14.50 20.17960 0.91 3.64 6.91 15.82

CS61c 55



CS61c 56


AndinConclusion,…• ApproachestoParallelism

− SISD,SIMD,MIMD(nextlecture)• SIMD

−Oneinstructionoperatesonmultipleoperandssimultaneously

• Example:matrixmultiplication− Floatingpointheavyà exploitMoore’slawtomakefast

57CS61c

cs 61c: great ideas in computer architecture lecture 18: parallel processing –...

Documents