cs 61c: great ideas in computer architecture lecture 18: parallel processing –...

57
CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMD John Wawrzynek & Nick Weaver http://inst.eecs.berkeley.edu/~cs61c

Upload: others

Post on 30-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

CS61C:GreatIdeasinComputerArchitecture

Lecture18:

ParallelProcessing–SIMD

JohnWawrzynek&NickWeaverhttp://inst.eecs.berkeley.edu/~cs61c

Page 2: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

61CSurvey

Itwouldbenicetohaveareviewlectureeveryonceinawhile,actuallyshowingus

howthingsfitinthebiggerpicture

CS61c 2

Page 3: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

Agenda• 61C–thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Loopunrolling• Memoryaccessstrategy-blocking• AndinConclusion,…

CS61c 3

Page 4: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

61CTopicssofar…• Whatwelearned:

1. Binarynumbers2. C3. Pointers4. Assemblylanguage5. Processormicro-architecture6. Pipelining7. Caches8. Floatingpoint

• Whatdoesthisbuyus?− Promise:executionspeed− Let’scheck!

CS61c 4

Page 5: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

ReferenceProblem•Matrixmultiplication

−Basicoperationinmanyengineering,data,andimagingprocessingtasks− Ex:,Imagefiltering,noisereduction,…

−CoreoperationinNeuralNetsandDeepLearning− Imageclassification(cats…)−RobotCars−Machinetranslation− Fingerprintverification−Automaticgameplaying

• dgemm −double-precisionfloating-pointgeneralmatrix-multiply−Standardwell-studiedandwidelyusedroutine−PartofLinpack/Lapack

CS61c 5

Page 6: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

2D-Matrices

CS61c 6

N-1

N-1

0

0

• SquarematrixofdimensionNxN• iindexesthroughrows• jindexesthroughcolumns

Page 7: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

MatrixMultiplication

CS61c 7

C=A*BCij=Σk(Aik*Bkj)

A

B

C

Page 8: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

2DMatrixMemoryLayout

• a[][]inCusesrow-major• Fortranusescolumn-major• Ourexamplesuscolumn-major

8

a00 a01 a02 a03a10 a11 a12 a13a20 a21 a22 a23a30 a31 a32 a33

aij a31a21a11a01a30a20a10a00

a13a12a11a10a03a02a01a00 0x0

0x8

0x10

0x18

0x20

0x28

0x30

0x38

0x0

0x8

0x10

0x18

0x20

0x28

0x30

0x38

Row

Row Column

Column

Row-Major Column-Major

aij:a[i*N+j] aij:a[i+j*N]

Page 9: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

dgemmReferenceCode:Python• Linearaddressing,assumes“column-major”memorylayout

CS61c 9

N Python[Mflops]

32 5.4160 5.5

480 5.4960 5.3

• 1MFLOP=1Millionfloating-pointoperationspersecond(fadd,fmul)

• dgemm(N…)takes2*N3flops

Page 10: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

C• c=a*b• a,b,careNxNmatrices

CS61c 10

Page 11: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

TimingProgramExecution

CS61c 11

Page 12: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

CversusPython

CS61c 12

N C[GFLOPS] Python[GFLOPS]

32 1.30 0.0054160 1.30 0.0055

480 1.32 0.0054960 0.91 0.0053

Whichotherclassgivesyouthiskindofpower?Wecouldstophere…butwhy?Let’sdobetter!

240x!

Page 13: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

Agenda• 61C–thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy-blocking• AndinConclusion,…

CS61c 13

Page 14: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

WhyParallelProcessing?• CPUClockRatesarenolongerincreasing

− Technical&economicchallenges▪ Advancedcoolingtechnologytooexpensiveorimpracticalformostapplications▪ Energycostsareprohibitive

• Parallelprocessingisonlypathtohigherspeed− Compareairlines:

▪ Maximumair-speedlimitedbyeconomics▪ Usemoreandlargerairplanestoincreasethroughput▪ (Andsmallerseats…)

CS61c 14

Page 15: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

UsingParallelismforPerformance• Twobasicapproachestoparallelism:

−Multiprogramming▪ runmultipleindependentprogramsinparallel▪ “Easy”

−Parallelcomputing▪ runoneprogramfaster▪ “Hard”

•We’llfocusonparallelcomputinginthenextfewlectures

15CS61c

Page 16: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

New-SchoolMachineStructures(It’sabitmorecomplicated!)

•ParallelRequestsAssignedtocomputere.g.,Search“Katz”•ParallelThreadsAssignedtocoree.g.,Lookup,Ads•ParallelInstructions>[email protected].,5pipelinedinstructions•ParallelData>[email protected].,Addof4pairsofwords•HardwaredescriptionsAllgates@onetime•ProgrammingLanguages

16

SmartPhone

WarehouseScale

Computer

SoftwareHardware

Harness Parallelism&AchieveHigh Performance

LogicGates

Core Core…Memory(Cache)Input/Output

Computer

CacheMemory

CoreInstructionUnit(s) Functional

Unit(s)A3+B3A2+B2A1+B1A0+B0

Today’sLecture

Page 17: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

Single-Instruction/Single-DataStream(SISD)

• Sequentialcomputerthatexploitsnoparallelismineithertheinstructionordatastreams.ExamplesofSISDarchitecturearetraditionaluniprocessormachines-E.g.OurRISC-Vprocessor

17

ProcessingUnit

CS61c

Thisiswhatwediduptonowin61C

Page 18: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

Single-Instruction/Multiple-DataStream(SIMDor“sim-dee”)

• SIMDcomputerprocessesmultipledatastreamsusingasingleinstructionstream,e.g.,IntelSIMDinstructionextensionsorNVIDIAGraphicsProcessingUnit(GPU)

18CS61c

Today’stopic.

Page 19: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

Multiple-Instruction/Multiple-DataStreams(MIMDor“mim-dee”)

• Multipleautonomousprocessorssimultaneouslyexecutingdifferentinstructionsondifferentdata.• MIMDarchitecturesincludemulticoreandWarehouse-ScaleComputers

19

InstructionPool

PU

PU

PU

PU

DataPoo

l

CS61c

TopicofLecture19andbeyond.

Page 20: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

Multiple-Instruction/Single-DataStream(MISD)

• Multiple-Instruction,Single-Datastreamcomputerthatprocessesmultipleinstructionstreamswithasingledatastream.• Historicalsignificance

20CS61c

Thishasfewapplications.Notcoveredin61C.

Page 21: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

Flynn*Taxonomy,1966

• SIMDandMIMDarecurrentlythemostcommonparallelisminarchitectures–usuallybothinsamesystem!

• Mostcommonparallelprocessingprogrammingstyle:SingleProgramMultipleData(“SPMD”)

− SingleprogramthatrunsonallprocessorsofaMIMD− Cross-processorexecutioncoordinationusingsynchronizationprimitives

21CS61c

*Prof.MichaelFlynn,Stanford

Page 22: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

22

NEW YORK, NY, March 21, 2018 – ACM, the Association for Computing Machinery, today named John L. Hennessy, former President of Stanford University, and David A. Patterson, retired Professor of the University of California, Berkeley, recipients of the 2017 ACM A.M. Turing Award for pioneering a systematic, quantitative approach to the design and evaluation of computer architectures with enduring impact on the microprocessor industry. Hennessy and Patterson created a systematic and quantitative approach to designing faster, lower power, and reduced instruction set computer (RISC) microprocessors.

PattersonandHennessywinTurningAward!

Page 23: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

Agenda• 61C–thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy-blocking• AndinConclusion,…

CS61c 23

Page 24: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

SIMD–“SingleInstructionMultipleData”

24CS61c

Page 25: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

SIMD(Vector)Mode

CS61c 25

Page 26: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

SIMDApplications&Implementations•Applications

−Scientificcomputing▪Matlab,NumPy

−Graphicsandvideoprocessing▪Photoshop,…

−BigData▪Deeplearning

−Gaming−…

•Implementations−x86−ARM−RISC-Vvectorextensions

CS61c 26

Page 27: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

27

FirstSIMDExtensions:MITLincolnLabsTX-2,1957

CS61c

Page 28: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

28

Intelx86MMX1997

Page 29: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

29https://chrisadkin.io/2015/06/04/under-the-hood-of-the-batch-engine-simd-with-sql-server-2016-ctp/

AVXalsosupportedbyAMDprocessors

Page 30: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

LaptopCPUSpecs$ sysctl -a | grep cpuhw.physicalcpu: 2hw.logicalcpu: 4

machdep.cpu.brand_string: Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz

machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C

machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 BMI2 INVPCID SMAP RDSEED ADX IPT FPU_CSDS

CS61c 30

Page 31: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

AVXSIMDRegisters

CS61c 31

Page 32: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

SIMDDataTypes

CS61c 32

(NowalsoAVX-512available)

Page 33: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

Agenda• 61C–thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy-blocking• AndinConclusion,…

CS61c 33

Page 34: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

Problem• Today’scompilerscangenerateSIMDcode• Butinsomecases,betterresultsbyhand(assembly)•Wewillstudyx86(notusingRISC-Vasnovectorhardwarewidelyavailableyet)−Over1000instructionstolearn…

• Canweusethecompilertogenerateallnon-SIMDinstructions?

CS61c 34

Page 35: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMDhttps://software.intel.com/sites/landingpage/IntrinsicsGuide/

CS61c 35

4parallelmultiplies

2instructionsperclockcycle(CPI=0.5)

assemblyinstruction

x86SIMD“Intrinsics”

Intrinsic

Page 36: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

x86IntrinsicsAVXDataTypes

CS61c 36

Intrinsics:DirectaccesstoassemblyfromC

Page 37: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

IntrinsicsAVXCodeNomenclature

CS61c 37

Page 38: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

RawDouble-PrecisionThroughput

Characteristic Value

CPU i7-5557U

Clockrate(sustained) 3.1GHz

Instructionsperclock(mul_pd) 2

Parallelmultipliesperinstruction 4

PeakdoubleFLOPS 24.8GFLOPS

CS61c 38

Actualperformanceislowerbecauseofoverheadhttps://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/

Page 39: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

VectorizedMatrixMultiplication

CS61c 39

InnerLoop:

fori…;i+=4forj...

i+=4

Page 40: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

“Vectorized”dgemm

CS61c 40

Page 41: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

Performance

NGflops

scalar avx

32 1.30 4.56

160 1.30 5.47480 1.32 5.27960 0.91 3.64

CS61c 41

• 4xfaster• Butstill<<theoretical25GFLOPS!

Page 42: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

Agenda• 61C–thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Loopunrolling• Memoryaccessstrategy-blocking• AndinConclusion,…

CS61c 42

Page 43: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

LoopUnrolling• Onhighperformanceprocessors,optimizingcompilersperforms“loopunrolling”operationtoexposemoreparallelismandimproveperformance:

for(i=0; i<N; i++) x[i] = x[i] + s;• Couldbecome: for(i=0; i<N; i+=4) { x[i] = x[i] + s;

x[i-1] = x[i+1] + s; x[i-2] = x[i+2] + s; x[i-3] = x[i+3] + s;}

43

1. Exposedata-levelparallelismforvector(SIMD)instructionsorsuper-scalarmultipleinstructionissue

2. Mixpipelinewithunrelatedoperationstohelpwithreducehazards

3. Reduceloop“overhead”4. Makescodesizelarger

Page 44: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

Amdahl’sLaw*appliedtodgemm• Measureddgemmperformance

− Peak5.5GFLOPS− Largematrices3.6GFLOPS− Processor 24.8GFLOPS

• Whyarewenotgetting(closeto)25GFLOPS?− Somethingelse(notfloating-pointALU)islimitingperformance!− Butwhat?Possibleculprits:

▪ Cache▪ Hazards▪ Let’slookatboth!

CS61c 44

* Amdahl’sLawstatesthattheamountaspeeduppossiblethroughparallelismislimitedbythesequential(non-parallel)work.

Page 45: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMDCS61c 45

PipelineHazards–dgemm

“add_pd”dependsonresultof“mult_pd”whichdependson“load_pd”

Page 46: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

LoopUnrolling

CS61c 46

Compilerdoestheunrolling

Howdoyouverifythatthegeneratedcodeisactuallyunrolled?

4registers

Page 47: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

Performance

NGflops

scalar avx unroll

32 1.30 4.56 12.95

160 1.30 5.47 19.70480 1.32 5.27 14.50960 0.91 3.64 6.91

CS61c 47

?

WOW!

Page 48: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

Agenda• 61C–thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy-blocking• AndinConclusion,…

CS61c 48

Page 49: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

FPUversusMemoryAccess• Howmanyfloating-pointoperationsdoesmatrixmultiplytake?

− F=2xN3(N3multiplies,N3adds)

• Howmanymemoryload/stores?− M=3xN2(forA,B,C)

• Manymorefloating-pointoperationsthanmemoryaccesses− q=F/M=2/3*N− Good,sincearithmeticisfasterthanmemoryaccess− Let’scheckthecode…

CS61c 49

Page 50: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

Butmemoryisaccessedrepeatedly

• q=F/M=1.6!(1.25loadsand2floating-pointoperations)

CS61c 50

Innerloop:

Page 51: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMDCS61c 51

Second-LevelCache(SRAM)

TypicalMemoryHierarchyControl

Datapath

SecondaryMemory(Disk

OrFlash)

On-ChipComponents

RegFile

MainMemory(DRAM)Data

CacheInstrCache

Speed(cycles):½’s1’s10’s100’s-10001,000,000’sSize(bytes):100’s10K’sM’sG’sT’s

• Wherearetheoperands(A,B,C)stored?• WhathappensasNincreases?• Idea:arrangethatmostaccessesaretofastcache!

Cost/bit:highestlowest

Third-LevelCache(SRAM)

Page 52: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

Blocking• Idea:

− Rearrangecodetousevaluesloadedincachemanytimes− Only“few”accessestoslowmainmemory(DRAM)perfloatingpointoperation

−à throughputlimitedbyFPhardwareandcache,notslowDRAM

− P&H,RISC-Veditionp.465

CS61c 52

Page 53: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

BlockingMatrixMultiply(divideandconquer:sub-matrixmultiplication)

53

X =

X =

X =

X =

X =

X =

Page 54: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

MemoryAccessBlocking

CS61c 54

Page 55: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

Performance

NGflops

scalar avx unroll blocking

32 1.30 4.56 12.95 13.80

160 1.30 5.47 19.70 21.79480 1.32 5.27 14.50 20.17960 0.91 3.64 6.91 15.82

CS61c 55

Page 56: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

Agenda• 61C–thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy-blocking• AndinConclusion,…

CS61c 56

Page 57: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMDcs61c/sp18/lec/18/lec18.pdf · 2018-03-22 · Lecture 18: Parallel Processing - SIMD Flynn* Taxonomy,

Lecture18:ParallelProcessing-SIMD

AndinConclusion,…• ApproachestoParallelism

− SISD,SIMD,MIMD(nextlecture)• SIMD

−Oneinstructionoperatesonmultipleoperandssimultaneously

• Example:matrixmultiplication− Floatingpointheavyà exploitMoore’slawtomakefast

57CS61c