auto-parallelizing option
DESCRIPTION
Auto-Parallelizing Option. John Matrow, M.S. System Administrator/Trainer. Outline. Compiler * Options * Output Incomplete Optimization * Does not detect a loop is safe to parallelize * Parallelizes the wrong loop * Unnecessarily parallelizes a loop Strategies for Assisting APO. - PowerPoint PPT PresentationTRANSCRIPT
04/19/23WSU High Performance Computing Center (HiPeCC)
1
Auto-Parallelizing OptionAuto-Parallelizing OptionJohn Matrow, M.S.
System Administrator/Trainer
04/19/23WSU High Performance Computing Center (HiPeCC)
2
OutlineOutline
Compiler* Options* Output
Incomplete Optimization* Does not detect a loop is safe to
parallelize* Parallelizes the wrong loop* Unnecessarily parallelizes a loop
Strategies for Assisting APO
04/19/23WSU High Performance Computing Center (HiPeCC)
3
Auto-Parallelizing Option (APO)Auto-Parallelizing Option (APO)
The MIPSpro Auto-Parallelizing Option (APO) from SGI is used to automatically detect and exploit parallelism in Fortran 77, Fortran 90, C and C++ programs.
04/19/23WSU High Performance Computing Center (HiPeCC)
4
SGI MIPSpro compilersSGI MIPSpro compilers
APOIPA (interprocedural analysis)LNO (loop nest optimization)
04/19/23WSU High Performance Computing Center (HiPeCC)
5
SyntaxSyntax
f77/cc: -apo[{list|keep}] [-mplist] [-On]
f90/CC: -apo[{list|keep}] [-On]
04/19/23WSU High Performance Computing Center (HiPeCC)
6
SyntaxSyntax
-apo list: produce a .l file, a listing of those parts of the program that can run in parallel and those that cannot
-apo keep: produce .l, .w2c.c, .m and .anl files. Do not use with –mplist
-mplist: Generate the equivalent program for f77 in .w2f.f file or for c in a .w2c.c file
-On: optimization level, 3= aggressive (recommended)
04/19/23WSU High Performance Computing Center (HiPeCC)
7
LinkLink
If you link separately, you must have one of the following in the command line:
The –apo flagThe –mp option
04/19/23WSU High Performance Computing Center (HiPeCC)
8
Interprocedural Analysis (IPA)Interprocedural Analysis (IPA)
Procedure inliningIdentification of global constantsDead function eliminationDead variable eliminationDead call eliminationInterprocedural alias analysisInterprocedural constant propagation
04/19/23WSU High Performance Computing Center (HiPeCC)
9
Loop Nest Optimization (LNO)Loop Nest Optimization (LNO)
Loop interchangeLoop fusionLoop fissionCache blocking and outer loop unrolling
LNO runs when you use the –O3 option
04/19/23WSU High Performance Computing Center (HiPeCC)
10
Sample sourceSample source
SUBROUTINE sub(arr, n) REAL*8 arr(n) DO i = 1, n arr(i) = arr(i) + arr(i-1) END DO DO i = 1, n arr(i) = arr(i) + 7.0 CALL foo(a) END DO DO I = 1, n arr(i) = arr(i) + 7.0 END DOEND
04/19/23WSU High Performance Computing Center (HiPeCC)
11
Sample APO listingSample APO listing
Parallelization log for Subprogram sub_3: Not Parallel
Array dependence from arr on line 4 to arr on line 4.
6: Not ParallelCall foo on line 8
10: PARALLEL (Auto) _mpdo_sub_1
04/19/23WSU High Performance Computing Center (HiPeCC)
12
Sample source listingSample source listing
C PARALLEL DO will be converted to SUBROUTINE _mpdo_sub_1
C$OMP PARALLEL DO private (i), shared (a)DO I = 1, 10000, 1
a(I) = 0.0END DO
04/19/23WSU High Performance Computing Center (HiPeCC)
13
Running Your ProgramRunning Your Program
Environment variable used to specify the number of threads: OMP_NUM_THREADS
Example:setenv OMP_NUM_THREADS 4
04/19/23WSU High Performance Computing Center (HiPeCC)
14
Running Your ProgramRunning Your Program
Environment variable used to allow a dynamic number of threads to be used (as available): OMP_DYNAMIC
Example:setenv OMP_DYNAMIC FALSE
Default: TRUE
04/19/23WSU High Performance Computing Center (HiPeCC)
15
Incomplete OptimizationIncomplete Optimization
Does not detect a loop is safe to parallelize Parallelizes the wrong loop Unnecessarily parallelizes a loop
04/19/23WSU High Performance Computing Center (HiPeCC)
16
Failing to Parallelize Safe LoopsFailing to Parallelize Safe Loops
Does NOT parallelize loops containing: Data dependencies* Function calls GO TO Statements* Problematic Array Subscripts Conditionally Assigned Temporary Nonlocal
Variables Unanalyzable Pointer Usage (C/C++)
*not discussed here
04/19/23WSU High Performance Computing Center (HiPeCC)
17
Function CallsFunction Calls
You can tell APO to ignore dependencies of function calls by using
Fortran:C*$* ASSERT CONCURRENT CALL
C/C++:#pragma concurrent call
04/19/23WSU High Performance Computing Center (HiPeCC)
18
Problematic Array SubscriptsProblematic Array Subscripts
Too complicated: Indirect array references
A(IB(I)) = . . . Unanalyzable subscripts
Allowable elements: literal constants, variables, product, sum, difference
Rely on hidden knowledgeA(I) = A(I+M)
04/19/23WSU High Performance Computing Center (HiPeCC)
19
Conditionally Assigned Conditionally Assigned Temporary Nonlocal VariablesTemporary Nonlocal VariablesSUBROUTINE S1(A,B)COMMON TDO I = 1, N
IF B(I) THENT = . . .A(I) = A(I) + T
END OFEND DOCALL S2()
END
04/19/23WSU High Performance Computing Center (HiPeCC)
20
Unanalyzable Pointer Usage Unanalyzable Pointer Usage (C/C++)(C/C++) Arbitrary pointer dereferences Arrays of arrays
Use p[n][n] instead of **p Loops bounded by pointer comparisons Aliased parameter information
Use __restrict type qualifier to say arrays do not overlap
04/19/23WSU High Performance Computing Center (HiPeCC)
21
Parallelizing the Wrong LoopParallelizing the Wrong Loop
Inner Loops Small Trip Counts Poor Data Locality
04/19/23WSU High Performance Computing Center (HiPeCC)
22
Inner LoopsInner Loops
APO tries to parallelize the outermost loop, after possibly interchanging loops to make a more promising one outermost
If the outermost loop attempt fails, APO parallelizes an inner loop if possible
Inner loop parallelized probably because of “Failing to Parallelize Safe Loops” discussed earlier
Probably advantageous to modify code so the outermost loop is the one parallelized
04/19/23WSU High Performance Computing Center (HiPeCC)
23
Small Trip CountsSmall Trip Counts
Small trips counts generally run faster when they are not parallelized
Use AssertionC*$* ASSERT DO PREFER #pragma prefer
Use manual parallelization directives
04/19/23WSU High Performance Computing Center (HiPeCC)
24
Poor Data LocalityPoor Data Locality
DO I = 1, N. . .A(I)
END DODO I = N, 1, -1. . .A(I). . .
END DO
04/19/23WSU High Performance Computing Center (HiPeCC)
25
Poor Data LocalityPoor Data Locality
DO I = 1, NDO J = 1, N
A(I,J) = B(J,I) + . . .END DO
END DO
DO I = 1, NDO J = 1, N
B(I,J) = A(J,I) + . . .END DO
END DO
04/19/23WSU High Performance Computing Center (HiPeCC)
26
Incurring Unnecessary Incurring Unnecessary Parallelization OverheadParallelization OverheadUnknown Trip CountsNested parallelism
04/19/23WSU High Performance Computing Center (HiPeCC)
27
Unknown Trip CountsUnknown Trip Counts
If the trip count is not known (and sometimes even if it is), APO parallelizes the loop conditionally
It generates code for both a parallel and a sequential version
APO can avoid running in parallel if the loops turns out to have a small trip count
Choice also includes number of processors available, overhead cost, code inside loop
04/19/23WSU High Performance Computing Center (HiPeCC)
28
Nested ParallelismNested Parallelism
SUBROUTINE CALLERDO I = 1, N
CALL SUBEND DO
END
SUBROUTINE SUBDO I = 1, N. . .END DO
END
04/19/23WSU High Performance Computing Center (HiPeCC)
29
Strategies for Assisting APOStrategies for Assisting APO
Modify code to avoid coding practices that will not analyze well
Manual parallelization options [OpenMP]Use APO directives to give APO more
information about code
04/19/23WSU High Performance Computing Center (HiPeCC)
30
Compiler Directives for Compiler Directives for Automatic ParallelizationAutomatic Parallelization
C*$* [NO] CONCURRENTIZE C*$* ASSERT DO (CONCURRENT|SERIAL) C*$* ASSERT CONCURRENT CALL C*$* ASSERT PERMUTATION (array_name) C*$* ASSERT DO PREFER (CONCURRENT|SERIAL)
04/19/23WSU High Performance Computing Center (HiPeCC)
31
Compiler DirectivesCompiler Directives
The following affect compilation even if –apo is not specified:C*$* ASSERT DO (CONCURRENT)C*$* ASSERT CONCURRENT CALLC*$* ASSERT PERMUTATION
-LNO:ignore_pragmas causes APO to ignore all directives, assertions and pragmas
04/19/23WSU High Performance Computing Center (HiPeCC)
32
C*$* NO CONCURRENTIZEC*$* NO CONCURRENTIZE
Place inside subroutinePlace outside subroutine to affect all
subroutines C*$* CONCURRENTIZE used to overrideC*$* NO CONCURRENTIZE placed outside of it
04/19/23WSU High Performance Computing Center (HiPeCC)
33
C*$* ASSERT DO (CONCURRENT)C*$* ASSERT DO (CONCURRENT)
Tells APO to ignore array dependenciesApplying to inner loop may cause loop to
be made outermost by loop interchangeDoes not affect CALLIgnored if obvious real dependencies foundIf multiple loops can be parallelized, it
causes APO to prefer loop immediately following the assertion
04/19/23WSU High Performance Computing Center (HiPeCC)
34
C*$* ASSERT DO (SERIAL)C*$* ASSERT DO (SERIAL)
Do not parallelize the loop following the assertion
APO may parallelize another loop in the same nest
The parallelized loop may be either inside or outside the designated sequential loop
04/19/23WSU High Performance Computing Center (HiPeCC)
35
C*$* ASSERT CONCURRENT CALLC*$* ASSERT CONCURRENT CALL
Applies to the loop that immediately follows it and to all loops nested inside that loop
A subroutine inside the loop cannot read from a location that is written to during another iteration (shared)
A subroutine inside the loop cannot write to a location that is read from or written to during another iteration (shared)
04/19/23WSU High Performance Computing Center (HiPeCC)
36
C*$* ASSERT PERMUTATIONC*$* ASSERT PERMUTATION
C*$* ASSERT PERMUTATION (array_name) tells APO that array_name is a permutation array: Every element of the array has distinct value
The array can thus be used for indirect addressing
Affects every loop in subroutine, even those appearing ahead of it
04/19/23WSU High Performance Computing Center (HiPeCC)
37
C*$* ASSERT DO PREFERC*$* ASSERT DO PREFER
C*$* ASSERT DO PREFER (CONCURRENT) instructs APO to parallelize the following loop if it is safe to do so
With nested loops, if it is not safe, APO uses heuristics to choose among loops that are safe
If applied to inner loop, APO may make it the outer loop
If applied to multiple loops, APO uses heuristics to choose one of the specified loops
04/19/23WSU High Performance Computing Center (HiPeCC)
38
C*$* ASSERT DO PREFERC*$* ASSERT DO PREFER
C*$* ASSERT DO PREFER (SERIAL) is essentially the same asC*$* ASSERT DO (SERIAL)
Used in cases with small trip counts
Used in cases with poor data locality
04/19/23WSU High Performance Computing Center (HiPeCC)
39
Example 1: AddOpac.fExample 1: AddOpac.f
do nd=1,ndust if( lgDustOn1(nd) ) then do i=1,nupper
dstab(i) = dstab(i) + dstab1(i,nd) * dstab3(nd) dstsc(i) = dstsc(i) + dstsc1(i,nd) * dstsc2(nd) end do
endifend do
408: Not Parallel Array dependence from DSTAB on line 412 to
DSTAB on line 412. Array dependence from DSTSC on line 413 to
DSTSC on line 413.
04/19/23WSU High Performance Computing Center (HiPeCC)
40
Example 1: AddOpac.fExample 1: AddOpac.fC*$* ASSERT DO CONCURRENT before outer DO resulted in:
DO ND = 1, 20, 1 IF(LGDUSTON3(ND)) THENC PARALLEL DO will be converted to SUBROUTINE __mpdo_addopac_10C$OMP PARALLEL DO if(((DBLE(__mp_sug_numthreads_func$()) *((DBLE(C$& __mp_sug_numthreads_func$()) * 1.23D+02) + 2.6D+03)) .LT.((DBLE(C$& NUPPER0) * DBLE((__mp_sug_numthreads_func$() + -1))) * 6.0D00))),C$& private(I6), shared(DSTAB2, DSTABUND0, DSTAB3, DSTSC2, DSTSC3, ND,C$& NUPPER0) DO I6 = 1, NUPPER0, 1 DSTAB2(I6) = (DSTAB2(I6) +(DSTABUND0(ND) * DSTAB3(I6, ND))) DSTSC2(I6) = (DSTSC2(I6) +(DSTABUND0(ND) * DSTSC3(I6, ND))) END DO ENDIF END DO
04/19/23WSU High Performance Computing Center (HiPeCC)
41
Example 2:BiDiag.fExample 2:BiDiag.f
135: Not Parallel Array dependence from DESTROY on line 166 to
DESTROY on line 137. Array dependence from DESTROY on line 166 to
DESTROY on line 144. Array dependence from DESTROY on line 174 to
DESTROY on line 166. Array dependence from DESTROY on line 166 to
DESTROY on line 166. Array dependence from DESTROY on line 144 to
DESTROY on line 166. Array dependence from DESTROY on line 137 to
DESTROY on line 166. Array dependence from DESTROY on line 166 to
DESTROY on line 174.<more of same>
04/19/23WSU High Performance Computing Center (HiPeCC)
42
Example 2:BiDiag.fExample 2:BiDiag.f
C$OMP PARALLEL DO PRIVATE(ns, nej, nelec, max, ratio) do i=IonLow(nelem),IonHigh(nelem)-1. . . C$OMP CRITICAL destroy(nelem,max) = destroy(nelem,max) + 1 PhotoRate(nelem,i,ns,1) * ` 1
vyield(nelem,i,ns,nej) * ratioC$OMP END CRITICAL
04/19/23WSU High Performance Computing Center (HiPeCC)
43
Example 3: ContRate.fExample 3: ContRate.f 78: Not Parallel Scalar dependence on XMAXSUB. Scalar XMAXSUB without unique last value. Scalar FREQSUB without unique last value. Scalar OPACSUB without unique last value.
Solution: same as previous example
04/19/23WSU High Performance Computing Center (HiPeCC)
44
ExercisesExercises
Copy ~jmatrow/openmp/apo*.f Compile and examine the .list fileEach program requires one change
apo1.f – Assertion neededapo2.f – OpenMP directive needed