cs 61c: great ideas in computer architecture lecture 18: parallel processing –...

CS61C:GreatIdeasinComputerArchitecture

Lecture18:

ParallelProcessing–SIMD

JohnWawrzynek&NickWeaverhttp://inst.eecs.berkeley.edu/~cs61c

Lecture18:ParallelProcessing-SIMD

61CSurvey

Itwouldbenicetohaveareviewlectureeveryonceinawhile,actuallyshowingus

howthingsfitinthebiggerpicture

CS61c 2

Agenda• 61C–thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Loopunrolling• Memoryaccessstrategy-blocking• AndinConclusion,…

CS61c 3

61CTopicssofar…• Whatwelearned:

1. Binarynumbers2. C3. Pointers4. Assemblylanguage5. Processormicro-architecture6. Pipelining7. Caches8. Floatingpoint

• Whatdoesthisbuyus?− Promise:executionspeed− Let’scheck!

CS61c 4

ReferenceProblem•Matrixmultiplication

−Basicoperationinmanyengineering,data,andimagingprocessingtasks− Ex:,Imagefiltering,noisereduction,…

−CoreoperationinNeuralNetsandDeepLearning− Imageclassification(cats…)−RobotCars−Machinetranslation− Fingerprintverification−Automaticgameplaying

• dgemm −double-precisionfloating-pointgeneralmatrix-multiply−Standardwell-studiedandwidelyusedroutine−PartofLinpack/Lapack

CS61c 5

2D-Matrices

CS61c 6

• SquarematrixofdimensionNxN• iindexesthroughrows• jindexesthroughcolumns

MatrixMultiplication

CS61c 7

C=A*BCij=Σk(Aik*Bkj)

2DMatrixMemoryLayout

• a[][]inCusesrow-major• Fortranusescolumn-major• Ourexamplesuscolumn-major

a00 a01 a02 a03a10 a11 a12 a13a20 a21 a22 a23a30 a31 a32 a33

aij a31a21a11a01a30a20a10a00

a13a12a11a10a03a02a01a00 0x0

Row Column

Column

Row-Major Column-Major

aij:a[i*N+j] aij:a[i+j*N]

dgemmReferenceCode:Python• Linearaddressing,assumes“column-major”memorylayout

CS61c 9

N Python[Mflops]

32 5.4160 5.5

480 5.4960 5.3

• 1MFLOP=1Millionfloating-pointoperationspersecond(fadd,fmul)

• dgemm(N…)takes2*N3flops

C• c=a*b• a,b,careNxNmatrices

CS61c 10

TimingProgramExecution

CS61c 11

CversusPython

CS61c 12

N C[GFLOPS] Python[GFLOPS]

32 1.30 0.0054160 1.30 0.0055

480 1.32 0.0054960 0.91 0.0053

Whichotherclassgivesyouthiskindofpower?Wecouldstophere…butwhy?Let’sdobetter!

Agenda• 61C–thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy-blocking• AndinConclusion,…

CS61c 13

WhyParallelProcessing?• CPUClockRatesarenolongerincreasing

− Technical&economicchallenges▪ Advancedcoolingtechnologytooexpensiveorimpracticalformostapplications▪ Energycostsareprohibitive

• Parallelprocessingisonlypathtohigherspeed− Compareairlines:

▪ Maximumair-speedlimitedbyeconomics▪ Usemoreandlargerairplanestoincreasethroughput▪ (Andsmallerseats…)

CS61c 14

UsingParallelismforPerformance• Twobasicapproachestoparallelism:

−Multiprogramming▪ runmultipleindependentprogramsinparallel▪ “Easy”

−Parallelcomputing▪ runoneprogramfaster▪ “Hard”

•We’llfocusonparallelcomputinginthenextfewlectures

15CS61c

New-SchoolMachineStructures(It’sabitmorecomplicated!)

•ParallelRequestsAssignedtocomputere.g.,Search“Katz”•ParallelThreadsAssignedtocoree.g.,Lookup,Ads•ParallelInstructions>1instruction@onetimee.g.,5pipelinedinstructions•ParallelData>1dataitem@onetimee.g.,Addof4pairsofwords•HardwaredescriptionsAllgates@onetime•ProgrammingLanguages

SmartPhone

WarehouseScale

Computer

SoftwareHardware

Harness Parallelism&AchieveHigh Performance

LogicGates

Core Core…Memory(Cache)Input/Output

Computer

CacheMemory

CoreInstructionUnit(s) Functional

Unit(s)A3+B3A2+B2A1+B1A0+B0

Today’sLecture

Single-Instruction/Single-DataStream(SISD)

• Sequentialcomputerthatexploitsnoparallelismineithertheinstructionordatastreams.ExamplesofSISDarchitecturearetraditionaluniprocessormachines－E.g.OurRISC-Vprocessor

ProcessingUnit

Thisiswhatwediduptonowin61C

Single-Instruction/Multiple-DataStream(SIMDor“sim-dee”)

• SIMDcomputerprocessesmultipledatastreamsusingasingleinstructionstream,e.g.,IntelSIMDinstructionextensionsorNVIDIAGraphicsProcessingUnit(GPU)

18CS61c

Today’stopic.

Multiple-Instruction/Multiple-DataStreams(MIMDor“mim-dee”)

• Multipleautonomousprocessorssimultaneouslyexecutingdifferentinstructionsondifferentdata.• MIMDarchitecturesincludemulticoreandWarehouse-ScaleComputers

InstructionPool

DataPoo

TopicofLecture19andbeyond.

Multiple-Instruction/Single-DataStream(MISD)

• Multiple-Instruction,Single-Datastreamcomputerthatprocessesmultipleinstructionstreamswithasingledatastream.• Historicalsignificance

20CS61c

Thishasfewapplications.Notcoveredin61C.

Flynn*Taxonomy,1966

• SIMDandMIMDarecurrentlythemostcommonparallelisminarchitectures–usuallybothinsamesystem!

• Mostcommonparallelprocessingprogrammingstyle:SingleProgramMultipleData(“SPMD”)

− SingleprogramthatrunsonallprocessorsofaMIMD− Cross-processorexecutioncoordinationusingsynchronizationprimitives

21CS61c

*Prof.MichaelFlynn,Stanford

NEW YORK, NY, March 21, 2018 – ACM, the Association for Computing Machinery, today named John L. Hennessy, former President of Stanford University, and David A. Patterson, retired Professor of the University of California, Berkeley, recipients of the 2017 ACM A.M. Turing Award for pioneering a systematic, quantitative approach to the design and evaluation of computer architectures with enduring impact on the microprocessor industry. Hennessy and Patterson created a systematic and quantitative approach to designing faster, lower power, and reduced instruction set computer (RISC) microprocessors.

PattersonandHennessywinTurningAward!

CS61c 23

SIMD–“SingleInstructionMultipleData”

24CS61c

SIMD(Vector)Mode

CS61c 25

SIMDApplications&Implementations•Applications

−Scientificcomputing▪Matlab,NumPy

−Graphicsandvideoprocessing▪Photoshop,…

−BigData▪Deeplearning

−Gaming−…

•Implementations−x86−ARM−RISC-Vvectorextensions

CS61c 26

FirstSIMDExtensions:MITLincolnLabsTX-2,1957

Intelx86MMX1997

29https://chrisadkin.io/2015/06/04/under-the-hood-of-the-batch-engine-simd-with-sql-server-2016-ctp/

AVXalsosupportedbyAMDprocessors

LaptopCPUSpecs$ sysctl -a | grep cpuhw.physicalcpu: 2hw.logicalcpu: 4

machdep.cpu.brand_string: Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz

machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C

machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 BMI2 INVPCID SMAP RDSEED ADX IPT FPU_CSDS

CS61c 30

AVXSIMDRegisters

CS61c 31

SIMDDataTypes

CS61c 32

(NowalsoAVX-512available)

CS61c 33

Problem• Today’scompilerscangenerateSIMDcode• Butinsomecases,betterresultsbyhand(assembly)•Wewillstudyx86(notusingRISC-Vasnovectorhardwarewidelyavailableyet)−Over1000instructionstolearn…

• Canweusethecompilertogenerateallnon-SIMDinstructions?

CS61c 34

Lecture18:ParallelProcessing-SIMDhttps://software.intel.com/sites/landingpage/IntrinsicsGuide/

CS61c 35

4parallelmultiplies

2instructionsperclockcycle(CPI=0.5)

assemblyinstruction

x86SIMD“Intrinsics”

Intrinsic

x86IntrinsicsAVXDataTypes

CS61c 36

Intrinsics:DirectaccesstoassemblyfromC

IntrinsicsAVXCodeNomenclature

CS61c 37

RawDouble-PrecisionThroughput

Characteristic Value

CPU i7-5557U

Clockrate(sustained) 3.1GHz

Instructionsperclock(mul_pd) 2

Parallelmultipliesperinstruction 4

PeakdoubleFLOPS 24.8GFLOPS

CS61c 38

Actualperformanceislowerbecauseofoverheadhttps://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/

VectorizedMatrixMultiplication

CS61c 39

InnerLoop:

fori…;i+=4forj...

“Vectorized”dgemm

CS61c 40

Performance

NGflops

scalar avx

32 1.30 4.56

160 1.30 5.47480 1.32 5.27960 0.91 3.64

CS61c 41

• 4xfaster• Butstill<<theoretical25GFLOPS!

Agenda• 61C–thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Loopunrolling• Memoryaccessstrategy-blocking• AndinConclusion,…

CS61c 42

LoopUnrolling• Onhighperformanceprocessors,optimizingcompilersperforms“loopunrolling”operationtoexposemoreparallelismandimproveperformance:

for(i=0; i<N; i++) x[i] = x[i] + s;• Couldbecome: for(i=0; i<N; i+=4) { x[i] = x[i] + s;

x[i-1] = x[i+1] + s; x[i-2] = x[i+2] + s; x[i-3] = x[i+3] + s;}

1. Exposedata-levelparallelismforvector(SIMD)instructionsorsuper-scalarmultipleinstructionissue

2. Mixpipelinewithunrelatedoperationstohelpwithreducehazards

3. Reduceloop“overhead”4. Makescodesizelarger

Amdahl’sLaw*appliedtodgemm• Measureddgemmperformance

− Peak5.5GFLOPS− Largematrices3.6GFLOPS− Processor 24.8GFLOPS

• Whyarewenotgetting(closeto)25GFLOPS?− Somethingelse(notfloating-pointALU)islimitingperformance!− Butwhat?Possibleculprits:

▪ Cache▪ Hazards▪ Let’slookatboth!

CS61c 44

* Amdahl’sLawstatesthattheamountaspeeduppossiblethroughparallelismislimitedbythesequential(non-parallel)work.

Lecture18:ParallelProcessing-SIMDCS61c 45

PipelineHazards–dgemm

“add_pd”dependsonresultof“mult_pd”whichdependson“load_pd”

LoopUnrolling

CS61c 46

Compilerdoestheunrolling

Howdoyouverifythatthegeneratedcodeisactuallyunrolled?

4registers

Performance

NGflops

scalar avx unroll

32 1.30 4.56 12.95

160 1.30 5.47 19.70480 1.32 5.27 14.50960 0.91 3.64 6.91

CS61c 47

CS61c 48

FPUversusMemoryAccess• Howmanyfloating-pointoperationsdoesmatrixmultiplytake?

− F=2xN3(N3multiplies,N3adds)

• Howmanymemoryload/stores?− M=3xN2(forA,B,C)

• Manymorefloating-pointoperationsthanmemoryaccesses− q=F/M=2/3*N− Good,sincearithmeticisfasterthanmemoryaccess− Let’scheckthecode…

CS61c 49

Butmemoryisaccessedrepeatedly

• q=F/M=1.6!(1.25loadsand2floating-pointoperations)

CS61c 50

Innerloop:

Lecture18:ParallelProcessing-SIMDCS61c 51

Second-LevelCache(SRAM)

TypicalMemoryHierarchyControl

Datapath

SecondaryMemory(Disk

OrFlash)

On-ChipComponents

RegFile

MainMemory(DRAM)Data

CacheInstrCache

Speed(cycles):½’s1’s10’s100’s-10001,000,000’sSize(bytes):100’s10K’sM’sG’sT’s

• Wherearetheoperands(A,B,C)stored?• WhathappensasNincreases?• Idea:arrangethatmostaccessesaretofastcache!

Cost/bit:highestlowest

Third-LevelCache(SRAM)

Blocking• Idea:

− Rearrangecodetousevaluesloadedincachemanytimes− Only“few”accessestoslowmainmemory(DRAM)perfloatingpointoperation

−à throughputlimitedbyFPhardwareandcache,notslowDRAM

− P&H,RISC-Veditionp.465

CS61c 52

BlockingMatrixMultiply(divideandconquer:sub-matrixmultiplication)

MemoryAccessBlocking

CS61c 54

Performance

NGflops

scalar avx unroll blocking

32 1.30 4.56 12.95 13.80

160 1.30 5.47 19.70 21.79480 1.32 5.27 14.50 20.17960 0.91 3.64 6.91 15.82

CS61c 55

CS61c 56

AndinConclusion,…• ApproachestoParallelism

− SISD,SIMD,MIMD(nextlecture)• SIMD

−Oneinstructionoperatesonmultipleoperandssimultaneously

• Example:matrixmultiplication− Floatingpointheavyà exploitMoore’slawtomakefast

57CS61c

cs 61c: great ideas in computer architecture lecture 18: parallel processing –...

Documents

parallel systems: performance analysis of parallel...

parallel processing & distributed...

parallel processing aok

cs575 parallel processing

memory. cognitive processing automatic processing controlled...

understanding parallel computers parallel processing ee 613

seminar-parallel processing

parallel processing (1)

chapter 18 parallel processing - edward bosworth€¦ ·...

chapter 18 parallel processing (multiprocessing)

crblaster: a fast parallel-processing program … a fast...

pipelining parallel processing

parallel processing concepts

parallel processing howto

csce 5160 parallel processing - csrl.cse.unt.edu · 3/18/19...

chapter 18 parallel processing - yonsei...

parallel processing architectures mimd varieties...

advanced parallel processing

amd accelerated parallel processing -...

17 parallel processing