lecture: manycoregpu architectures and programming, part 4€¦ · lecture: manycoregpu...

61
Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenACC for Accelerators 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering Yonghong Yan [email protected] https://passlab.github.io/CSCE569/

Upload: others

Post on 16-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

Lecture:Manycore GPUArchitecturesandProgramming,Part4

-- IntroducingOpenACC forAccelerators

1

CSCE569ParallelComputing

DepartmentofComputerScienceandEngineeringYonghong Yan

[email protected]://passlab.github.io/CSCE569/

Page 2: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

Manycore GPUArchitecturesandProgramming:Outline

• Introduction– GPUarchitectures,GPGPUs,andCUDA• GPUExecutionmodel• CUDAProgrammingmodel• WorkingwithMemoryinCUDA– Globalmemory,sharedandconstantmemory• Streamsandconcurrency• CUDAinstructionintrinsicandlibrary• Performance,profiling,debugging,anderrorhandling• Directive-basedhigh-levelprogrammingmodel– OpenMPandOpenACC

2

Page 3: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC

• OpenACC’s guidingprincipleissimplicity– Wanttoremoveasmuchburdenfromtheprogrammeras

possible– Noneedtothinkaboutdatamovement,writingkernels,

parallelism,etc.– OpenACC compilersautomaticallyhandleallofthat

• Inreality,itisn’talwaysthatsimple– Don’texpecttogetmassivespeedupsfromverylittlework

• However,OpenACC canbeaneasyandstraightforwardprogrammingmodeltostartwith– http://www.openacc-standard.org/

3

Page 4: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC

• OpenACC sharesalotofprincipleswithOpenMP– Compiler#pragma based,andrequiresacompilerthat

supportsOpenACC– Expressthetypeofparallelism,letthecompilerandruntime

handletherest– OpenACC alsoallowsyoutoexpressdatamovementusing

compiler#pragmas

#pragma acc

4

Page 5: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC Directives

5

Programmyscience...serialcode...

!$acc kernelsdok=1,n1doi =1,n2...parallelcode...

enddoenddo

!$acc endkernels...EndProgrammyscience

CPU GPU SimpleCompilerhints

CompilerParallelizescode

Worksonmany-coreGPUs&multicoreCPUs

OpenACCCompiler

Hint

Page 6: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC

• CreatingparallelisminOpenACC ispossiblewitheitherofthefollowingtwocomputedirectives:

#pragma acc kernels#pragma acc parallel

• kernels andparallel eachhavetheirownstrengths– kernels isahigherabstractionwithmoreautomation– parallel offersmorelow-levelcontrolbutalsorequires

moreworkfromtheprogrammer

6

Page 7: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC ComputeDirectives

• Thekernels directivemarksacoderegionthattheprogrammerwantstoexecuteonanaccelerator– Thecoderegionisanalyzed forparallelizableloopsbythe

compiler– Necessarydatamovementisalsoautomaticallygenerated

#pragma acc kernels{

for (i = 0; i < N; i++)C[i] = A[i] + B[i];

for (i = 0; i < N; i++)D[i] = C[i] * A[i];

}7

Page 8: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC ComputeDirectives

• LikeOpenMP,OpenACC compilerdirectivessupportclauseswhichcanbeusedtomodifythebehavior ofOpenACC#pragmas

#pragma acc kernels clause1 clause2 ...

• kernels supportsanumberofclauses,forexample:– if(cond): Onlyruntheparallelregiononanacceleratorifcond

istrue– async(id): Don’twaitfortheparallelcoderegiontocomplete

ontheacceleratorbeforereturningtothehostapplication.Instead,id canbeusedtocheckforcompletion.

– wait(id): waitfortheasync workassociatedwithid tofinishfirst

– ...8

Page 9: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC ComputeDirectives

• Take a look at the simple-kernels.c example

– Compile with an OpenACC compiler, e.g. PGI:$ pgcc –acc simple-kernels.c –o simple-kernels

– You may be able to add compiler-specific flags to print more diagnostic information on the accelerator code generation, e.g.:

$ pgcc -acc simple-kernels.c –o simple-kernels –Minfo=accel

Wedonot havethiscompileronoursystems9

Page 10: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC ComputeDirectives

• On the other hand, the parallel compute directive offers much more control over exactly how a parallel code region is executed

– With just kernels, we have little control over which loops are parallelized or how they are parallelized

– Think of #pragma acc parallel similarly to #pragma omp parallel

#pragma acc parallel

10

Page 11: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC ComputeDirectives

• With parallel, all parallelism is created at the start of the parallel region and does not change until the end– The execution mode of a parallel region changes

depending on programmer-inserted #pragmas

• parallel supports similar clauses to kernels, plus:– num_gangs(g), num_workers(w), vector_length(v): Used to configure the amount of parallelism in a parallel region

– reduction(op:var1, var2, ...): Perform a reduction across gangs of the provided variables using the specified operation

– ...11

Page 12: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC

• MappingfromtheabstractGPUExecutionModeltoOpenACC conceptsandterminology– OpenACC Vectorelement =athread• Theuseof“vector”inOpenACC terminologyemphasizesthatatthelowestlevel,OpenACC usesvector-parallelism

– OpenACC Worker =SIMTGroup• Eachworkerhasavectorwidthandcancontainmanyvectorelements

– OpenACC Gang =SIMTGroupsonthesameSM• OnegangperOpenACC PU• OpenACC supportsmultiplegangsexecutingconcurrently

12

Page 13: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC

• MappingtoCUDAthreadingmodel:

– GangParallelism:WorkisrunacrossmultipleOpenACC Pus• CUDABlocks

– WorkerParallelism:Workisrunacrossmultipleworkers(i.e.SIMTGroups)• ThreadsperBlocks

– VectorParallelism:Workisrunacrossvectorelements(i.e.threads)• WithinWrap

13

Page 14: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC ComputeDirectives

• In addition to kernels and parallel, a third OpenACC compute directive can help control parallelism (but does not actually create threads):

#pragma acc loop

• The loop directive allows you to explicitly mark loops as parallel and control the type of parallelism used to execute them

14

Page 15: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC ComputeDirectives

• Using #pragma acc loop gang/worker/vectorallows you to explicitly mark loops that should use gang, worker, or vector parallelism in your OpenACCapplication– Can be used inside both parallel and kernels

regions

• Using #pragma acc independent allows you to explicitly mark loops as parallelizable, overriding any automatic compiler analysis– Compilers must naturally be conservative when auto-

parallelizing, the independent clause allows you to use detailed knowledge of the application to give hints to the compiler

15

Page 16: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC ComputeDirectives

• Consider simple-parallel.c, in which the loopand parallel directives are used to implement the same computation as simple-kernels.c

#pragma acc parallel{

#pragma acc loopfor (i = 0; i < N; i++)

...#pragma acc loopfor (i = 0; i < N; i++)

...}

16

Page 17: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC ComputeDirectives

• As a syntactic nicety, you can combine parallel/kernels directives with loop directives:

#pragma acc kernels loopfor (i = 0; i < N; i++) {

...}

#pragma acc parallel loopfor (i = 0; i < N; i++) {

...}

17

Page 18: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC ComputeDirectives

• This combination has the same effect as a loopdirective immediately following a parallel/kernels directive:

#pragma acc kernels#pragma acc loopfor (i = 0; i < N; i++) { ... }

#pragma acc parallel#pragma acc loopfor (i = 0; i < N; i++) { ... }

18

Page 19: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC ComputeDirectives

• In summary, the kernels, parallel, and loopdirectives all offer different ways to control the OpenACC parallelism of an application

– kernels is highly automated, but your rely heavily on the compiler to create an efficient parallelization strategy• A short-form of parallel/loop for GPU

– parallel is more manual, but allows programmer knowledge about the application to improve the parallelization strategy• Like OpenMP parallel

– loop allows you to take more manual control over both• Like OpenMP worksharing

19

Page 20: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

SuggestedReadings

1. ThesectionsonUsingOpenACC andUsingOpenACC ComputeDirectives inChapter8ofProfessionalCUDACProgramming

2. OpenACC Standard.2013.http://www.openacc.org/sites/default/files/OpenACC.2.0a_1.pdf

3. JeffLarkin.IntroductiontoAcceleratedComputingUsingCompilerDirectives.2014.http://on-demand.gputechconf.com/gtc/2014/presentations/S4167-intro-accelerated- computing-directives.pdf

4. MichaelWolfe.PerformanceAnalysisandOptimizationwithOpenACC.2014.http://on-demand.gputechconf.com/gtc/2014/presentations/S4472-performance-analysis- optimization-openacc-apps.pdf

20

Page 21: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC DataDirectives

• #pragma acc data canbeusedtoexplicitlyperformcommunicationbetweenahostprogramandaccelerators

• Thedata clauseisappliedtoacoderegionanddefinesthecommunicationtobeperformedatthestartandendofthatcoderegion

• Thedata clausealonedoesnothing,butittakesclauseswhichdefinetheactualtransferstobeperformed

21

Page 22: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC DataDirectives

• Common clauses used with #pragma acc data:

Clause Descriptioncopy(list) Transfer all variables in list to the

accelerator at the start of the data region and back to the host at the end.

copyin(list) Transfer all variables in list to the accelerator at the start of the data

region.

copyout(list) Transfer all variables in list back to the host at the end of the data region.

present_or_copy(list)

If the variables specified in list are not already on the accelerator, transfer them to it at the start of the data region and

back at the end.

if(cond) Only perform the operations defined by this data directive if cond is true.

22

Page 23: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC DataDirectives

• Consider the example in simple-data.c, which mirrors simple-parallel.c and simple-kernels.c:

#pragma acc data copyin(A[0:N], B[0:N]) copyout(C[0:N], D[0:N]){#pragma acc parallel

{ #pragma acc loop

for (i = 0; i < N; i++)...

#pragma acc loopfor (i = 0; i < N; i++)

...}

}

23

Page 24: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC DataDirectives

• OpenACC also supports:#pragma acc enter data#pragma acc exit data

• Rather than bracketing a code region, these #pragmas allow you to copy data to and from the accelerator at arbitrary points in time– Data transferred to an accelerator with enter data

will remain there until a matching exit data is reached or until the application terminates

24

Page 25: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC DataDirectives

• Finally, OpenACC also allows you to specify data movement as part of the compute directives through data clauses

#pragma acc data copyin(A[0:N], B[0:N]) copyout(C[0:N], D[0:N]){#pragma acc parallel{}

}

#pragma acc parallel copyin(A[0:N], B[0:N]) copyout(C[0:N], D[0:N])

25

Page 26: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC DataSpecification

• You may have noticed that OpenACC data directives use an unusual array dimension specification, for example:

#pragma acc data copy(A[start:length])

• In some cases, data specifications may not even be necessary as the OpenACC compiler can infer the size of the array:

int a[5];#pragma acc data copy(a){

...}

26

Page 27: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC DataSpecification

• If the compiler is unable to infer an array size, error messages like the one below will be emitted– Example code:

int *a = (int *)malloc(sizeof(int) * 5);#pragma acc data copy(a){

...}

– Example error message:PGCC-S-0155-Cannot determine bounds for array a

27

Page 28: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC DataSpecification

• Instead, you must specify the full array bounds to be transferred

int *a = (int *)malloc(sizeof(int) * 5);#pragma acc data copy(a[0:5]){

...}

– The lower bound is inclusive and, if not explicitly set, will default to 0

– The length must be provided if it cannot be inferred

28

Page 29: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

AsynchronousWorkinOpenACC

• InOpenACC,thedefaultbehavior isalwaystoblockthehostwhileexecutinganacc region– Hostexecutiondoesnotcontinuepastakernels/parallel regionuntilalloperationswithinitcomplete

– Hostexecutiondoesnotenterorexitadataregionuntilallprescribeddatatransfershavecompleted

29

Page 30: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

AsynchronousWorkinOpenACC

• Whenthehostblocks,hostcyclesarewasted:

Single-threadedhost

Acceleratorw/manyPUs

#pragma acc { ... }

Wasted cycles

30

Page 31: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

AsynchronousWorkinOpenACC

• Inmanycasesthisdefaultcanbeoverriddentoperformoperationsasynchronously– Asynchronouslycopydatatotheaccelerator– Asynchronouslyexecutecomputation

• Asaresult,hostcyclesarenotwastedidlingwhiletheacceleratorisworking

31

Page 32: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

AsynchronousWorkinOpenACC

• Asynchronousworkiscreatedusingtheasync clauseoncomputeanddatadirectives,andeveryasynchronoustaskhasanid– Runakernels regionasynchronously:

#pragma acc kernels async(id)– Runaparallel regionasynchronously:

#pragma acc parallel async(id)– Performanenter data asynchronously:

#pragma acc enter data async(id)– Performanexit data asynchronously:

#pragma acc exit data async(id)– async isnotsupportedonthedata directive

32

Page 33: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

AsynchronousWorkinOpenACC

• Havingasynchronousworkmeanswealsoneedawaytowaitforit– Notethateveryasync clauseonthepreviousslidetookanid

– Theasynchronoustaskcreatedisuniquelyidentifiedbythatid

• Wecanthenwaitonthatid usingeither:– Thewait clauseoncomputeordatadirectives– TheOpenACC RuntimeAPI’sAsynchronousControlfunctions

33

Page 34: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

AsynchronousWorkinOpenACC

• Addingawait(id) clausetoacomputeordatadirectivemakestheassociateddatatransferorcomputationwaituntiltheasynchronoustaskassociatedwiththatidcompletes

• TheOpenACC RuntimeAPIsupportsexplicitlywaitingusing:void acc_wait(int id);void acc_wait_all();

• Youcanalsocheckifasynchronoustaskshavecompletedusing:

int acc_async_test(int id);int acc_async_test_all();

34

Page 35: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

AsynchronousWorkinOpenACC

• Let’stakeasimplecodesnippetasanexample:

#pragma acc data copyin(A[0:N]) copyout(B[0:N]){#pragma acc kernels{for (i = 0; i < N; i++)B[i] = foo(A[i]);

}}do_work_on_host(C);

Host is blocked

Host is working

35

Page 36: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

AsynchronousWorkinOpenACC

Single-threadedhost

Acceleratorw/manyPUs

copyin Idling

acc kernels

copyout do_work_on_host

36

Page 37: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

AsynchronousWorkinOpenACC

• Performingthetransferandcomputeasynchronouslyallowsustooverlapthehostandacceleratorwork:

#pragma acc enter data async(0)copyin(A[0:N]) create(B[0:N])#pragma acc kernels wait(0) async(1){

for (i = 0; i < N; i++)B[i] = foo(A[i]);

}#pragma acc exit data wait(1) async(2)copyout(B[0:N])do_work_on_host(C);

acc_wait(2);

37

Page 38: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

AsynchronousWorkinOpenACC

Single-threadedhost

Acceleratorw/manyPUs

acc kernels

do_work_on_host

38

Page 39: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

ReductionsinOpenACC

• OpenACC supportstheabilitytoperformautomaticparallelreductions– Thereduction clausecanbeaddedtotheparallel andloop directives,buthasasubtledifferenceinmeaningoneach

#pragma acc parallel reduction(op:var1, var2, ...)#pragma acc loop reduction(op:var1, var2, ...)

– op definesthereductionoperationtoperform– Thevariablelistdefinesasetofprivatevariablescreatedand

initializedinthesubsequentcomputeregion

39

Page 40: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

ReductionsinOpenACC

• Whenappliedtoaparallel region,reductioncreatesaprivatecopyofeachvariableforeachgangcreatedforthatparallel region

• Whenappliedtoaloop directive,reduction createsaprivatecopyofeachvariableforeachvectorelementintheloopregion

• Theresultingvalueistransferredbacktothehostoncethecurrentcomputeregioncompletes

40

Page 41: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC ParallelRegionOptimizations

• Tosomeextent,optimizingtheparallelcoderegionsinOpenACC iscontradictorytothewholeOpenACC principle– OpenACC wantsprogrammerstofocusonwritingapplication

logicandworrylessaboutnitty-grittyoptimizationtricks– Often,low-levelcodeoptimizationsrequireintimate

understandingofthehardwareyouarerunningon

• InOpenACC,optimizingismoreaboutavoidingsymptomaticallyhorriblescenariossothatthecompilerhasthebestcodetoworkwith,ratherthanmakingverylow-leveloptimizations– Memoryaccesspatterns– Loopscheduling

41

Page 42: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC ParallelRegionOptimizations

• GPUsareoptimizedforaligned,coalescedmemoryaccesses– Aligned:thelowestaddressaccessedbytheelementsina

vectortobe32- or128-bitaligned(dependingonarchitecture)– Coalesced:neighboring vectorelementsaccessneighboring

memorycells

42

Page 43: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC ParallelRegionOptimizations

• ImprovingalignmentinOpenACC isdifficultbecausethereislessvisibilityintohowOpenACC threadsarescheduledonGPU

• Improvingcoalescingisalsodifficult,theOpenACC compilermaychooseanumberofdifferentwaystoschedulealoopacrossthreadsontheGPU

• Ingeneral,trytoensurethatneighboring iterationsoftheinnermostparallelloopsarereferencingneighboringmemorycells

43

Page 44: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC ParallelRegionOptimizations

• Vecadd exampleusingcoalescingandnoncoalescing access

CLIFlag Average ComputeTime

Without –b (coalescing) 122.02us

With –b (noncoalescing) 624.04ms

44

Page 45: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC ParallelRegionOptimizations

• Theloop directivesupportsthreespecialclausesthatcontrolhowloopsareparallelized:gang,worker,andvector– The meaning of these clauses changes depending on whether

they are used in a parallel or kernels region

• The gang clause:– In a parallel region, causes the iterations of the loop to be

parallelized across gangs created by the parallel region, transitioning from gang-redundant to gang-partitioned mode.

– In a kernels region, does the same but also allows the user to specify the number of gangs to use, using gang(ngangs)

45

Page 46: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC ParallelRegionOptimizations

• Theworker clause:– In a parallel region, causes the iterations of the loop

to be parallelized across workers created by the parallel region, transitioning from worker-single to worker-partitioned modes.

– In a kernels region, does the same but also allows the user to specify the number of workers per gang, using worker(nworkers)

46

Page 47: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC ParallelRegionOptimizations

• Thevector clause:– In a parallel region, causes the iterations of the loop

to be parallelized using vector/SIMD parallelism with the vector length specified by parallel, transitioning from vector-single to vector-partitioned modes.

– In a kernels region, does the same but also allows the user to specify the vector length to use, using vector(vector_length)

47

Page 48: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC ParallelRegionOptimizations

• Manipulatingthegang,worker,andvector clausesresultsindifferentschedulingofloopiterationsontheunderlyinghardware– Can result in significant performance improvement or

loss

• Consider the example of loop schedule– The gang and vector clauses are used to change the

parallelization of two nested loops in a parallelregion

– The # of gangs is set with the command-line flag -g, vector width is set with –v

48

Page 49: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC ParallelRegionOptimizations

• Tryplayingwith-g and-v toseehowgang andvectoraffectperformance– Optionsforgangandvectorsizes

#pragma acc parallel copyin(A[0:M * N], B[0:M * N]) copyout(C[0:M * N])#pragma acc loop gang(gangs)

for (int i = 0; i < M; i++) {#pragma acc loop vector(vector_length)

for (int j = 0; j < N; j++) {...

}}

49

Page 50: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC ParallelRegionOptimizations

Exampleresults:

-g -v(constant)

Time

1 128 5.7590ms

2 128 2.8855ms

4 128 1.4478ms

8 128 730.11us

16 128 373.40us

32 128 202.89us

64 128 129.85us

-g(constant)

-v Time

32 2 9.3165ms

32 8 2.7953ms

32 32 716.45us

32 128 203.02us

32 256 129.76us

32 512 125.16us

32 1024 124.83us

50

Page 51: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

OpenACC ParallelRegionOptimizations

• YouroptionsforoptimizingOpenACC parallelregionsarefairlylimited– ThewholeideaofOpenACC isthatthecompilercanhandle

thatforyou

• TherearesomethingsyoucandotoavoidpoorcodecharacteristicsontheGPUthatthatcompilercan’toptimizeyououtof(memoryaccesspatterns)

• Therearealsotunables youcantweakwhichmayimproveperformance(e.g.gang,worker,vector)

51

Page 52: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

TheTileClause

• Likethegang,worker,andvector clauses,thetileclauseisusedtocontroltheschedulingofloopiterations– Usedonloop directivesonly

• Itspecifieshowyouwouldlikeloopiterationsgroupedacrosstheiterationspace– Iterationgrouping(morecommonlycalledlooptiling)canbe

beneficialforlocalityonbothCPUsandGPUs

52

Page 53: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

TheTileClause

• Supposeyouhavealooplikethefollowing:#pragma loopfor (int i = 0; i < N; i++) {

...}

• Thetile clausecanbeaddedlikethis:#pragma loop tile(8)for (int i = 0; i < N; i++) {

...}

53

Page 54: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

TheTileClause

• Analogoustoaddingasecondinnerloop:#pragma loopfor (int i = 0; i < N; i+=8) {

for (int ii = 0; ii < 8; ii++) {...

}}

– Thesameiterationsareperformed,butthecompilermaychoosetoschedulethemdifferentlyonhardwarethreads

54

Page 55: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

TheCacheDirective

• Thecache directiveisusedtooptimizememoryaccessesontheaccelerator.Itmarksdatawhichwillbefrequentlyaccessed,andwhichthereforeshouldbekeptcloseinthecachehierarchy

• Thecache directiveisappliedimmediatelyinsideofaloopthatisbeingparallelizedontheaccelerator:– Notethesamedataspecificationisusedhereasfordata

directives#pragma acc loopfor (int i = 0; i < N; i++) {

#pragma acc cache(A[i:1])...

55

Page 56: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

TheCacheDirective

• Forexample,supposeyouhaveanapplicationwhereeverythreadi accessescellsi-1,i,andi+1 inavectorA

3 4 -1 11 7 5 2 22 5 3 6 9

Threads

A

56

Page 57: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

TheCacheDirective

• Thisresultsinlotsofwastedmemoryaccessesasneighboring elementsinthevectorreferencethesamecellsinthearrayA

• Instead, we can use the cache directive to indicate to the compiler which array elements we expect to benefit from caching:

#pragma acc parallel loopfor (int i = 0; i < N; i++) {

B[i] = A[i-1] + A[i] + A[i+1];}

#pragma acc parallel loopfor (int i = 0; i < N; i++) {

#pragma acc cache(A[i-1:2])B[i] = A[i-1] + A[i] +

A[i+1];}

57

Page 58: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

TheCacheDirective

• Now,thecompilerwillautomaticallycacheA[i-1],A[i],andA[i+1] andonlyloadthemfromacceleratormemoryonce

3 4 -1 11 7 5 2 22 5 3 6 9

Threads

A

3 4 -1 11 7 5 2 22 5 3 6 9Cache

58

Page 59: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

TheCacheDirective

• Thecache directiverequiresalotofcomplexcodeanalysisfromthecompilertoensurethisisasafeoptimization

• Asaresult,itisnotalwayspossibletousethecacheoptimizationwitharbitraryapplicationcode– Somerestructuringmaybenecessarybeforethecompileris

abletodeterminehowtoeffectivelyusethecacheoptimization

59

Page 60: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

TheCacheDirective

• Thecache directivecanresultinsignificantperformancegainsthankstomuchimproveddatalocality

• However,forcomplexapplicationsitgenerallyrequiressignificantcoderefactoringtoexposethecache-abilityofthecodetothecompiler– JustliketousesharedmemoryinCUDA

60

Page 61: Lecture: ManycoreGPU Architectures and Programming, Part 4€¦ · Lecture: ManycoreGPU Architectures and Programming, Part 4--Introducing OpenACCfor Accelerators 1 CSCE 569 Parallel

SuggestedReadings

1. OpenACC Standard.2013.http://www.openacc.org/sites/default/files/OpenACC.2.0a_1.pdf

2. Peter Messmer. Optimizing OpenACC Codes. http://on-demand.gputechconf.com/gtc/2013/presentations/S3019-Optimizing-OpenACC-Codes.pdf

61