auto-parallelizing option

04/19/23WSU High Performance Computing Center (HiPeCC)

1

Auto-Parallelizing OptionAuto-Parallelizing OptionJohn Matrow, M.S.

System Administrator/Trainer


2

OutlineOutline

Compiler* Options* Output

Incomplete Optimization* Does not detect a loop is safe to

parallelize* Parallelizes the wrong loop* Unnecessarily parallelizes a loop

Strategies for Assisting APO


3

Auto-Parallelizing Option (APO)Auto-Parallelizing Option (APO)

The MIPSpro Auto-Parallelizing Option (APO) from SGI is used to automatically detect and exploit parallelism in Fortran 77, Fortran 90, C and C++ programs.


4

SGI MIPSpro compilersSGI MIPSpro compilers

APOIPA (interprocedural analysis)LNO (loop nest optimization)


5

SyntaxSyntax

f77/cc: -apo[{list|keep}] [-mplist] [-On]

f90/CC: -apo[{list|keep}] [-On]


6

SyntaxSyntax

-apo list: produce a .l file, a listing of those parts of the program that can run in parallel and those that cannot

-apo keep: produce .l, .w2c.c, .m and .anl files. Do not use with –mplist

-mplist: Generate the equivalent program for f77 in .w2f.f file or for c in a .w2c.c file

-On: optimization level, 3= aggressive (recommended)


7

LinkLink

If you link separately, you must have one of the following in the command line:

The –apo flagThe –mp option


8

Interprocedural Analysis (IPA)Interprocedural Analysis (IPA)

Procedure inliningIdentification of global constantsDead function eliminationDead variable eliminationDead call eliminationInterprocedural alias analysisInterprocedural constant propagation


9

Loop Nest Optimization (LNO)Loop Nest Optimization (LNO)

Loop interchangeLoop fusionLoop fissionCache blocking and outer loop unrolling

LNO runs when you use the –O3 option


10

Sample sourceSample source

SUBROUTINE sub(arr, n) REAL*8 arr(n) DO i = 1, n arr(i) = arr(i) + arr(i-1) END DO DO i = 1, n arr(i) = arr(i) + 7.0 CALL foo(a) END DO DO I = 1, n arr(i) = arr(i) + 7.0 END DOEND


11

Sample APO listingSample APO listing

Parallelization log for Subprogram sub_3: Not Parallel

Array dependence from arr on line 4 to arr on line 4.

6: Not ParallelCall foo on line 8

10: PARALLEL (Auto) _mpdo_sub_1


12

Sample source listingSample source listing

C PARALLEL DO will be converted to SUBROUTINE _mpdo_sub_1

C$OMP PARALLEL DO private (i), shared (a)DO I = 1, 10000, 1

a(I) = 0.0END DO


13

Running Your ProgramRunning Your Program

Environment variable used to specify the number of threads: OMP_NUM_THREADS

Example:setenv OMP_NUM_THREADS 4


14

Running Your ProgramRunning Your Program

Environment variable used to allow a dynamic number of threads to be used (as available): OMP_DYNAMIC

Example:setenv OMP_DYNAMIC FALSE

Default: TRUE


15

Incomplete OptimizationIncomplete Optimization

Does not detect a loop is safe to parallelize Parallelizes the wrong loop Unnecessarily parallelizes a loop


16

Failing to Parallelize Safe LoopsFailing to Parallelize Safe Loops

Does NOT parallelize loops containing: Data dependencies* Function calls GO TO Statements* Problematic Array Subscripts Conditionally Assigned Temporary Nonlocal

Variables Unanalyzable Pointer Usage (C/C++)

*not discussed here


17

Function CallsFunction Calls

You can tell APO to ignore dependencies of function calls by using

Fortran:C*$* ASSERT CONCURRENT CALL

C/C++:#pragma concurrent call


18

Problematic Array SubscriptsProblematic Array Subscripts

Too complicated: Indirect array references

A(IB(I)) = . . . Unanalyzable subscripts

Allowable elements: literal constants, variables, product, sum, difference

Rely on hidden knowledgeA(I) = A(I+M)


19

Conditionally Assigned Conditionally Assigned Temporary Nonlocal VariablesTemporary Nonlocal VariablesSUBROUTINE S1(A,B)COMMON TDO I = 1, N

IF B(I) THENT = . . .A(I) = A(I) + T

END OFEND DOCALL S2()

END


20

Unanalyzable Pointer Usage Unanalyzable Pointer Usage (C/C++)(C/C++) Arbitrary pointer dereferences Arrays of arrays

Use p[n][n] instead of **p Loops bounded by pointer comparisons Aliased parameter information

Use __restrict type qualifier to say arrays do not overlap


21

Parallelizing the Wrong LoopParallelizing the Wrong Loop

Inner Loops Small Trip Counts Poor Data Locality


22

Inner LoopsInner Loops

APO tries to parallelize the outermost loop, after possibly interchanging loops to make a more promising one outermost

If the outermost loop attempt fails, APO parallelizes an inner loop if possible

Inner loop parallelized probably because of “Failing to Parallelize Safe Loops” discussed earlier

Probably advantageous to modify code so the outermost loop is the one parallelized


23

Small Trip CountsSmall Trip Counts

Small trips counts generally run faster when they are not parallelized

Use AssertionC*$* ASSERT DO PREFER #pragma prefer

Use manual parallelization directives


24

Poor Data LocalityPoor Data Locality

DO I = 1, N. . .A(I)

END DODO I = N, 1, -1. . .A(I). . .

END DO


25

Poor Data LocalityPoor Data Locality

DO I = 1, NDO J = 1, N

A(I,J) = B(J,I) + . . .END DO

END DO

DO I = 1, NDO J = 1, N

B(I,J) = A(J,I) + . . .END DO

END DO


26

Incurring Unnecessary Incurring Unnecessary Parallelization OverheadParallelization OverheadUnknown Trip CountsNested parallelism


27

Unknown Trip CountsUnknown Trip Counts

If the trip count is not known (and sometimes even if it is), APO parallelizes the loop conditionally

It generates code for both a parallel and a sequential version

APO can avoid running in parallel if the loops turns out to have a small trip count

Choice also includes number of processors available, overhead cost, code inside loop


28

Nested ParallelismNested Parallelism

SUBROUTINE CALLERDO I = 1, N

CALL SUBEND DO

END

SUBROUTINE SUBDO I = 1, N. . .END DO

END


29

Strategies for Assisting APOStrategies for Assisting APO

Modify code to avoid coding practices that will not analyze well

Manual parallelization options [OpenMP]Use APO directives to give APO more

information about code


30

Compiler Directives for Compiler Directives for Automatic ParallelizationAutomatic Parallelization

C*$* [NO] CONCURRENTIZE C*$* ASSERT DO (CONCURRENT|SERIAL) C*$* ASSERT CONCURRENT CALL C*$* ASSERT PERMUTATION (array_name) C*$* ASSERT DO PREFER (CONCURRENT|SERIAL)


31

Compiler DirectivesCompiler Directives

The following affect compilation even if –apo is not specified:C*$* ASSERT DO (CONCURRENT)C*$* ASSERT CONCURRENT CALLC*$* ASSERT PERMUTATION

-LNO:ignore_pragmas causes APO to ignore all directives, assertions and pragmas


32

C*$* NO CONCURRENTIZEC*$* NO CONCURRENTIZE

Place inside subroutinePlace outside subroutine to affect all

subroutines C*$* CONCURRENTIZE used to overrideC*$* NO CONCURRENTIZE placed outside of it


33

C*$* ASSERT DO (CONCURRENT)C*$* ASSERT DO (CONCURRENT)

Tells APO to ignore array dependenciesApplying to inner loop may cause loop to

be made outermost by loop interchangeDoes not affect CALLIgnored if obvious real dependencies foundIf multiple loops can be parallelized, it

causes APO to prefer loop immediately following the assertion


34

C*$* ASSERT DO (SERIAL)C*$* ASSERT DO (SERIAL)

Do not parallelize the loop following the assertion

APO may parallelize another loop in the same nest

The parallelized loop may be either inside or outside the designated sequential loop


35

C*$* ASSERT CONCURRENT CALLC*$* ASSERT CONCURRENT CALL

Applies to the loop that immediately follows it and to all loops nested inside that loop

A subroutine inside the loop cannot read from a location that is written to during another iteration (shared)

A subroutine inside the loop cannot write to a location that is read from or written to during another iteration (shared)


36

C*$* ASSERT PERMUTATIONC*$* ASSERT PERMUTATION

C*$* ASSERT PERMUTATION (array_name) tells APO that array_name is a permutation array: Every element of the array has distinct value

The array can thus be used for indirect addressing

Affects every loop in subroutine, even those appearing ahead of it


37

C*$* ASSERT DO PREFERC*$* ASSERT DO PREFER

C*$* ASSERT DO PREFER (CONCURRENT) instructs APO to parallelize the following loop if it is safe to do so

With nested loops, if it is not safe, APO uses heuristics to choose among loops that are safe

If applied to inner loop, APO may make it the outer loop

If applied to multiple loops, APO uses heuristics to choose one of the specified loops


38

C*$* ASSERT DO PREFERC*$* ASSERT DO PREFER

C*$* ASSERT DO PREFER (SERIAL) is essentially the same asC*$* ASSERT DO (SERIAL)

Used in cases with small trip counts

Used in cases with poor data locality


39

Example 1: AddOpac.fExample 1: AddOpac.f

do nd=1,ndust if( lgDustOn1(nd) ) then do i=1,nupper

dstab(i) = dstab(i) + dstab1(i,nd) * dstab3(nd) dstsc(i) = dstsc(i) + dstsc1(i,nd) * dstsc2(nd) end do

endifend do

408: Not Parallel Array dependence from DSTAB on line 412 to

DSTAB on line 412. Array dependence from DSTSC on line 413 to

DSTSC on line 413.


40

Example 1: AddOpac.fExample 1: AddOpac.fC*$* ASSERT DO CONCURRENT before outer DO resulted in:

DO ND = 1, 20, 1 IF(LGDUSTON3(ND)) THENC PARALLEL DO will be converted to SUBROUTINE __mpdo_addopac_10C$OMP PARALLEL DO if(((DBLE(__mp_sug_numthreads_func$()) *((DBLE(C$& __mp_sug_numthreads_func$()) * 1.23D+02) + 2.6D+03)) .LT.((DBLE(C$& NUPPER0) * DBLE((__mp_sug_numthreads_func$() + -1))) * 6.0D00))),C$& private(I6), shared(DSTAB2, DSTABUND0, DSTAB3, DSTSC2, DSTSC3, ND,C$& NUPPER0) DO I6 = 1, NUPPER0, 1 DSTAB2(I6) = (DSTAB2(I6) +(DSTABUND0(ND) * DSTAB3(I6, ND))) DSTSC2(I6) = (DSTSC2(I6) +(DSTABUND0(ND) * DSTSC3(I6, ND))) END DO ENDIF END DO


41

Example 2:BiDiag.fExample 2:BiDiag.f

135: Not Parallel Array dependence from DESTROY on line 166 to

DESTROY on line 137. Array dependence from DESTROY on line 166 to






DESTROY on line 174.<more of same>


42

Example 2:BiDiag.fExample 2:BiDiag.f

C$OMP PARALLEL DO PRIVATE(ns, nej, nelec, max, ratio) do i=IonLow(nelem),IonHigh(nelem)-1. . . C$OMP CRITICAL destroy(nelem,max) = destroy(nelem,max) + 1 PhotoRate(nelem,i,ns,1) * ` 1

vyield(nelem,i,ns,nej) * ratioC$OMP END CRITICAL


43

Example 3: ContRate.fExample 3: ContRate.f 78: Not Parallel Scalar dependence on XMAXSUB. Scalar XMAXSUB without unique last value. Scalar FREQSUB without unique last value. Scalar OPACSUB without unique last value.

Solution: same as previous example


44

ExercisesExercises

Copy ~jmatrow/openmp/apo*.f Compile and examine the .list fileEach program requires one change

apo1.f – Assertion neededapo2.f – OpenMP directive needed

auto-parallelizing option

Documents

arri arri

n arri

c file

omp parallel

setenv omp

outer loop

parallel auto

private i