cs 61c: great ideas in computer architecture lecture 18: parallel processing –...

Post on 30-Jul-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CS61C:GreatIdeasinComputerArchitecture

Lecture18:

ParallelProcessing–SIMD

JohnWawrzynek&NickWeaverhttp://inst.eecs.berkeley.edu/~cs61c

Lecture18:ParallelProcessing-SIMD

61CSurvey

Itwouldbenicetohaveareviewlectureeveryonceinawhile,actuallyshowingus

howthingsfitinthebiggerpicture

CS61c 2

Lecture18:ParallelProcessing-SIMD

Agenda• 61C–thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Loopunrolling• Memoryaccessstrategy-blocking• AndinConclusion,…

CS61c 3

Lecture18:ParallelProcessing-SIMD

61CTopicssofar…• Whatwelearned:

1. Binarynumbers2. C3. Pointers4. Assemblylanguage5. Processormicro-architecture6. Pipelining7. Caches8. Floatingpoint

• Whatdoesthisbuyus?− Promise:executionspeed− Let’scheck!

CS61c 4

Lecture18:ParallelProcessing-SIMD

ReferenceProblem•Matrixmultiplication

−Basicoperationinmanyengineering,data,andimagingprocessingtasks− Ex:,Imagefiltering,noisereduction,…

−CoreoperationinNeuralNetsandDeepLearning− Imageclassification(cats…)−RobotCars−Machinetranslation− Fingerprintverification−Automaticgameplaying

• dgemm −double-precisionfloating-pointgeneralmatrix-multiply−Standardwell-studiedandwidelyusedroutine−PartofLinpack/Lapack

CS61c 5

Lecture18:ParallelProcessing-SIMD

2D-Matrices

CS61c 6

N-1

N-1

0

0

• SquarematrixofdimensionNxN• iindexesthroughrows• jindexesthroughcolumns

MatrixMultiplication

CS61c 7

C=A*BCij=Σk(Aik*Bkj)

A

B

C

2DMatrixMemoryLayout

• a[][]inCusesrow-major• Fortranusescolumn-major• Ourexamplesuscolumn-major

8

a00 a01 a02 a03a10 a11 a12 a13a20 a21 a22 a23a30 a31 a32 a33

aij a31a21a11a01a30a20a10a00

a13a12a11a10a03a02a01a00 0x0

0x8

0x10

0x18

0x20

0x28

0x30

0x38

0x0

0x8

0x10

0x18

0x20

0x28

0x30

0x38

Row

Row Column

Column

Row-Major Column-Major

aij:a[i*N+j] aij:a[i+j*N]

Lecture18:ParallelProcessing-SIMD

dgemmReferenceCode:Python• Linearaddressing,assumes“column-major”memorylayout

CS61c 9

N Python[Mflops]

32 5.4160 5.5

480 5.4960 5.3

• 1MFLOP=1Millionfloating-pointoperationspersecond(fadd,fmul)

• dgemm(N…)takes2*N3flops

Lecture18:ParallelProcessing-SIMD

C• c=a*b• a,b,careNxNmatrices

CS61c 10

Lecture18:ParallelProcessing-SIMD

TimingProgramExecution

CS61c 11

Lecture18:ParallelProcessing-SIMD

CversusPython

CS61c 12

N C[GFLOPS] Python[GFLOPS]

32 1.30 0.0054160 1.30 0.0055

480 1.32 0.0054960 0.91 0.0053

Whichotherclassgivesyouthiskindofpower?Wecouldstophere…butwhy?Let’sdobetter!

240x!

Lecture18:ParallelProcessing-SIMD

Agenda• 61C–thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy-blocking• AndinConclusion,…

CS61c 13

Lecture18:ParallelProcessing-SIMD

WhyParallelProcessing?• CPUClockRatesarenolongerincreasing

− Technical&economicchallenges▪ Advancedcoolingtechnologytooexpensiveorimpracticalformostapplications▪ Energycostsareprohibitive

• Parallelprocessingisonlypathtohigherspeed− Compareairlines:

▪ Maximumair-speedlimitedbyeconomics▪ Usemoreandlargerairplanestoincreasethroughput▪ (Andsmallerseats…)

CS61c 14

Lecture18:ParallelProcessing-SIMD

UsingParallelismforPerformance• Twobasicapproachestoparallelism:

−Multiprogramming▪ runmultipleindependentprogramsinparallel▪ “Easy”

−Parallelcomputing▪ runoneprogramfaster▪ “Hard”

•We’llfocusonparallelcomputinginthenextfewlectures

15CS61c

New-SchoolMachineStructures(It’sabitmorecomplicated!)

•ParallelRequestsAssignedtocomputere.g.,Search“Katz”•ParallelThreadsAssignedtocoree.g.,Lookup,Ads•ParallelInstructions>1instruction@onetimee.g.,5pipelinedinstructions•ParallelData>1dataitem@onetimee.g.,Addof4pairsofwords•HardwaredescriptionsAllgates@onetime•ProgrammingLanguages

16

SmartPhone

WarehouseScale

Computer

SoftwareHardware

Harness Parallelism&AchieveHigh Performance

LogicGates

Core Core…Memory(Cache)Input/Output

Computer

CacheMemory

CoreInstructionUnit(s) Functional

Unit(s)A3+B3A2+B2A1+B1A0+B0

Today’sLecture

Lecture18:ParallelProcessing-SIMD

Single-Instruction/Single-DataStream(SISD)

• Sequentialcomputerthatexploitsnoparallelismineithertheinstructionordatastreams.ExamplesofSISDarchitecturearetraditionaluniprocessormachines-E.g.OurRISC-Vprocessor

17

ProcessingUnit

CS61c

Thisiswhatwediduptonowin61C

Lecture18:ParallelProcessing-SIMD

Single-Instruction/Multiple-DataStream(SIMDor“sim-dee”)

• SIMDcomputerprocessesmultipledatastreamsusingasingleinstructionstream,e.g.,IntelSIMDinstructionextensionsorNVIDIAGraphicsProcessingUnit(GPU)

18CS61c

Today’stopic.

Lecture18:ParallelProcessing-SIMD

Multiple-Instruction/Multiple-DataStreams(MIMDor“mim-dee”)

• Multipleautonomousprocessorssimultaneouslyexecutingdifferentinstructionsondifferentdata.• MIMDarchitecturesincludemulticoreandWarehouse-ScaleComputers

19

InstructionPool

PU

PU

PU

PU

DataPoo

l

CS61c

TopicofLecture19andbeyond.

Lecture18:ParallelProcessing-SIMD

Multiple-Instruction/Single-DataStream(MISD)

• Multiple-Instruction,Single-Datastreamcomputerthatprocessesmultipleinstructionstreamswithasingledatastream.• Historicalsignificance

20CS61c

Thishasfewapplications.Notcoveredin61C.

Lecture18:ParallelProcessing-SIMD

Flynn*Taxonomy,1966

• SIMDandMIMDarecurrentlythemostcommonparallelisminarchitectures–usuallybothinsamesystem!

• Mostcommonparallelprocessingprogrammingstyle:SingleProgramMultipleData(“SPMD”)

− SingleprogramthatrunsonallprocessorsofaMIMD− Cross-processorexecutioncoordinationusingsynchronizationprimitives

21CS61c

*Prof.MichaelFlynn,Stanford

22

NEW YORK, NY, March 21, 2018 – ACM, the Association for Computing Machinery, today named John L. Hennessy, former President of Stanford University, and David A. Patterson, retired Professor of the University of California, Berkeley, recipients of the 2017 ACM A.M. Turing Award for pioneering a systematic, quantitative approach to the design and evaluation of computer architectures with enduring impact on the microprocessor industry. Hennessy and Patterson created a systematic and quantitative approach to designing faster, lower power, and reduced instruction set computer (RISC) microprocessors.

PattersonandHennessywinTurningAward!

Lecture18:ParallelProcessing-SIMD

Agenda• 61C–thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy-blocking• AndinConclusion,…

CS61c 23

Lecture18:ParallelProcessing-SIMD

SIMD–“SingleInstructionMultipleData”

24CS61c

Lecture18:ParallelProcessing-SIMD

SIMD(Vector)Mode

CS61c 25

Lecture18:ParallelProcessing-SIMD

SIMDApplications&Implementations•Applications

−Scientificcomputing▪Matlab,NumPy

−Graphicsandvideoprocessing▪Photoshop,…

−BigData▪Deeplearning

−Gaming−…

•Implementations−x86−ARM−RISC-Vvectorextensions

CS61c 26

27

FirstSIMDExtensions:MITLincolnLabsTX-2,1957

CS61c

28

Intelx86MMX1997

29https://chrisadkin.io/2015/06/04/under-the-hood-of-the-batch-engine-simd-with-sql-server-2016-ctp/

AVXalsosupportedbyAMDprocessors

Lecture18:ParallelProcessing-SIMD

LaptopCPUSpecs$ sysctl -a | grep cpuhw.physicalcpu: 2hw.logicalcpu: 4

machdep.cpu.brand_string: Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz

machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C

machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 BMI2 INVPCID SMAP RDSEED ADX IPT FPU_CSDS

CS61c 30

Lecture18:ParallelProcessing-SIMD

AVXSIMDRegisters

CS61c 31

Lecture18:ParallelProcessing-SIMD

SIMDDataTypes

CS61c 32

(NowalsoAVX-512available)

Lecture18:ParallelProcessing-SIMD

Agenda• 61C–thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy-blocking• AndinConclusion,…

CS61c 33

Lecture18:ParallelProcessing-SIMD

Problem• Today’scompilerscangenerateSIMDcode• Butinsomecases,betterresultsbyhand(assembly)•Wewillstudyx86(notusingRISC-Vasnovectorhardwarewidelyavailableyet)−Over1000instructionstolearn…

• Canweusethecompilertogenerateallnon-SIMDinstructions?

CS61c 34

Lecture18:ParallelProcessing-SIMDhttps://software.intel.com/sites/landingpage/IntrinsicsGuide/

CS61c 35

4parallelmultiplies

2instructionsperclockcycle(CPI=0.5)

assemblyinstruction

x86SIMD“Intrinsics”

Intrinsic

Lecture18:ParallelProcessing-SIMD

x86IntrinsicsAVXDataTypes

CS61c 36

Intrinsics:DirectaccesstoassemblyfromC

Lecture18:ParallelProcessing-SIMD

IntrinsicsAVXCodeNomenclature

CS61c 37

Lecture18:ParallelProcessing-SIMD

RawDouble-PrecisionThroughput

Characteristic Value

CPU i7-5557U

Clockrate(sustained) 3.1GHz

Instructionsperclock(mul_pd) 2

Parallelmultipliesperinstruction 4

PeakdoubleFLOPS 24.8GFLOPS

CS61c 38

Actualperformanceislowerbecauseofoverheadhttps://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/

VectorizedMatrixMultiplication

CS61c 39

InnerLoop:

fori…;i+=4forj...

i+=4

Lecture18:ParallelProcessing-SIMD

“Vectorized”dgemm

CS61c 40

Lecture18:ParallelProcessing-SIMD

Performance

NGflops

scalar avx

32 1.30 4.56

160 1.30 5.47480 1.32 5.27960 0.91 3.64

CS61c 41

• 4xfaster• Butstill<<theoretical25GFLOPS!

Lecture18:ParallelProcessing-SIMD

Agenda• 61C–thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Loopunrolling• Memoryaccessstrategy-blocking• AndinConclusion,…

CS61c 42

LoopUnrolling• Onhighperformanceprocessors,optimizingcompilersperforms“loopunrolling”operationtoexposemoreparallelismandimproveperformance:

for(i=0; i<N; i++) x[i] = x[i] + s;• Couldbecome: for(i=0; i<N; i+=4) { x[i] = x[i] + s;

x[i-1] = x[i+1] + s; x[i-2] = x[i+2] + s; x[i-3] = x[i+3] + s;}

43

1. Exposedata-levelparallelismforvector(SIMD)instructionsorsuper-scalarmultipleinstructionissue

2. Mixpipelinewithunrelatedoperationstohelpwithreducehazards

3. Reduceloop“overhead”4. Makescodesizelarger

Lecture18:ParallelProcessing-SIMD

Amdahl’sLaw*appliedtodgemm• Measureddgemmperformance

− Peak5.5GFLOPS− Largematrices3.6GFLOPS− Processor 24.8GFLOPS

• Whyarewenotgetting(closeto)25GFLOPS?− Somethingelse(notfloating-pointALU)islimitingperformance!− Butwhat?Possibleculprits:

▪ Cache▪ Hazards▪ Let’slookatboth!

CS61c 44

* Amdahl’sLawstatesthattheamountaspeeduppossiblethroughparallelismislimitedbythesequential(non-parallel)work.

Lecture18:ParallelProcessing-SIMDCS61c 45

PipelineHazards–dgemm

“add_pd”dependsonresultof“mult_pd”whichdependson“load_pd”

Lecture18:ParallelProcessing-SIMD

LoopUnrolling

CS61c 46

Compilerdoestheunrolling

Howdoyouverifythatthegeneratedcodeisactuallyunrolled?

4registers

Lecture18:ParallelProcessing-SIMD

Performance

NGflops

scalar avx unroll

32 1.30 4.56 12.95

160 1.30 5.47 19.70480 1.32 5.27 14.50960 0.91 3.64 6.91

CS61c 47

?

WOW!

Lecture18:ParallelProcessing-SIMD

Agenda• 61C–thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy-blocking• AndinConclusion,…

CS61c 48

Lecture18:ParallelProcessing-SIMD

FPUversusMemoryAccess• Howmanyfloating-pointoperationsdoesmatrixmultiplytake?

− F=2xN3(N3multiplies,N3adds)

• Howmanymemoryload/stores?− M=3xN2(forA,B,C)

• Manymorefloating-pointoperationsthanmemoryaccesses− q=F/M=2/3*N− Good,sincearithmeticisfasterthanmemoryaccess− Let’scheckthecode…

CS61c 49

Lecture18:ParallelProcessing-SIMD

Butmemoryisaccessedrepeatedly

• q=F/M=1.6!(1.25loadsand2floating-pointoperations)

CS61c 50

Innerloop:

Lecture18:ParallelProcessing-SIMDCS61c 51

Second-LevelCache(SRAM)

TypicalMemoryHierarchyControl

Datapath

SecondaryMemory(Disk

OrFlash)

On-ChipComponents

RegFile

MainMemory(DRAM)Data

CacheInstrCache

Speed(cycles):½’s1’s10’s100’s-10001,000,000’sSize(bytes):100’s10K’sM’sG’sT’s

• Wherearetheoperands(A,B,C)stored?• WhathappensasNincreases?• Idea:arrangethatmostaccessesaretofastcache!

Cost/bit:highestlowest

Third-LevelCache(SRAM)

Lecture18:ParallelProcessing-SIMD

Blocking• Idea:

− Rearrangecodetousevaluesloadedincachemanytimes− Only“few”accessestoslowmainmemory(DRAM)perfloatingpointoperation

−à throughputlimitedbyFPhardwareandcache,notslowDRAM

− P&H,RISC-Veditionp.465

CS61c 52

BlockingMatrixMultiply(divideandconquer:sub-matrixmultiplication)

53

X =

X =

X =

X =

X =

X =

Lecture18:ParallelProcessing-SIMD

MemoryAccessBlocking

CS61c 54

Lecture18:ParallelProcessing-SIMD

Performance

NGflops

scalar avx unroll blocking

32 1.30 4.56 12.95 13.80

160 1.30 5.47 19.70 21.79480 1.32 5.27 14.50 20.17960 0.91 3.64 6.91 15.82

CS61c 55

Lecture18:ParallelProcessing-SIMD

Agenda• 61C–thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy-blocking• AndinConclusion,…

CS61c 56

Lecture18:ParallelProcessing-SIMD

AndinConclusion,…• ApproachestoParallelism

− SISD,SIMD,MIMD(nextlecture)• SIMD

−Oneinstructionoperatesonmultipleoperandssimultaneously

• Example:matrixmultiplication− Floatingpointheavyà exploitMoore’slawtomakefast

57CS61c

top related