pgi compilers and tools

1

PGI® CompilersTools for Scientists and Engineers

Brent Leback – [email protected] Norton – [email protected]

www.pgroup.com

September 2006

mailto:[email protected]


• Introduction to PGI Compilers and Tools

• Documentation. Getting Help

• Basic Compiler Options • Optimization Strategies

• 6.2 Features and Roadmap

• Questions and Answers

Outline of Today’s Topics

• Optimization – State-of-the-art vector, parallel, IPA, Feedback, …• Cross-platform – AMD & Intel, 32/64-bit, Linux & Windows• PGI Unified Binary for AMD and Intel processors • Tools – Integrated OpenMP/MPI debug & profile, IDE integration• Parallel – MPI, OpenMP 2.5, auto-parallel for Multi-core• Comprehensive OS Support – Red Hat 7.3 – 9.0, RHEL 3.0/4.0, Fedora Core 2/3/4/5, SuSE 7.1 – 10.1, SLES 8/9/10, Windows XP, Windows x64

PGI Compilers and Tools, features

PGI Tools Enable Developers to:• View x64 as a unified CPU architecture

• Extract peak performance from x64 CPUs

• Ride innovation waves from both Intel and AMD

• Use a single source base and toolset across Linux and Windows

• Develop, debug, tune parallel applications forMulti-core, Multi-core SMP, Clustered Multi-core SMP

PGI Documentation and Support• PGI provided documentation

• PGI User Forums, at www.pgroup.com

• PGI FAQs, Tips & Techniques pages

• Email support, via [email protected]

• Web support, a form-based system similar to email support

• Fax support

http://www.pgroup.com/


PGI Docs & Support, cont.• Legacy phone support, direct access, etc.

• PGI download web page

• PGI prepared/personalized training

• PGI ISV program

• PGI Premier Service program

PGI Basic Compiler Options• Basic Usage

• Language Dialects

• Target Architectures

• Debugging aids

• Optimization switches

PGI Basic Compiler Usage• A compiler driver interprets options and invokes pre-

processors, compilers, assembler, linker, etc.• Options precedence: if options conflict, last option on

command line takes precedence• Use -Minfo to see a listing of optimizations and

transformations performed by the compiler• Use -help to list all options or see details on how to use

a given option, e.g. pgf90 -Mvect -help• Use man pages for more details on options, e.g. “man

pgf90”• Use –v to see under the hood

Flags to support language dialects• Fortran

– pgf77, pgf90, pgf95, pghpf tools– Suffixes .f, .F, .for, .fpp, .f90, .F90, .f95, .F95, .hpf, .HPF– -Mextend, -Mfixed, -Mfreeform– Type size –i2, -i4, -i8, -r4, -r8, etc.– -Mcray, -Mbyteswapio, -Mupcase, -Mnomain, -Mrecursive, etc.

• C/C++– pgcc, pgCC, aka pgcpp– Suffixes .c, .C, .cc, .cpp, .i– -B, -c89, -c9x, -Xa, -Xc, -Xs, -Xt– -Msignextend, -Mfcon, -Msingle, -Muchar, -Mgccbugs

Specifying the target architecture• Not an issue on XT3.• Defaults to the type of processor/OS you are running

on• Use the “tp” switch.

– -tp k8-64 or –tp p7-64 or –tp core2-64 for 64-bit code.– -tp amd64e for AMD opteron rev E or later– -tp x64 for unified binary– -tp k8-32, k7, p7, piv, piii, p6, p5, px for 32 bit code

Flags for debugging aids• -g generates symbolic debug information used by a

debugger• -gopt generates debug information in the presence of

optimization• -Mbounds adds array bounds checking• -v gives verbose output, useful for debugging system

or build problems• -Mlist will generate a listing• -Minfo provides feedback on optimizations made by the

compiler• -S or –Mkeepasm to see the exact assembly generated

Basic optimization switches• Traditional optimization controlled through -O[<n>], n

is 0 to 4.• -fast switch combines common set into one simple

switch, is equal to -O2 -Munroll=c:1 -Mnoframe -Mlre – For -Munroll, c specifies completely unroll loops with this

loop count or less– -Munroll=n:<m> says unroll other loops m times

• -Mnoframe does not set up a stack frame • -Mlre is loop-carried redundancy elimination

Basic optimization switches, cont.• fastsse switch is commonly used, extends –fast to

SSE hardware, and vectorization• -fastsse is equal to -O2 -Munroll=c:1 -Mnoframe -Mlre

(-fast) plus -Mvect=sse, -Mscalarsse -Mcache_align, -Mflushz

• -Mcache_align aligns top level arrays and objects on cache-line boundaries

• -Mflushz flushes SSE denormal numbers to zero

Optimization Strategies

• Establish a workload• Optimization from the top-down• Use of proper tools, methods• Processor level optimizations, parallel methods• Different flags/features for different types of code

15

Node level tuning Vectorization – packed SSE instructions maximize performance Interprocedural Analysis (IPA) – use it! motivating examples Function Inlining – especially important for C and C++ Parallelization – for Cray XD1 and multi-core processors Miscellaneous Optimizations – hit or miss, but worth a try

16

350 !351 ! Initialize vertex, similarity and coordinate arrays352 !353 Do Index = 1, NodeCount354 IX = MOD (Index - 1, NodesX) + 1355 IY = ((Index - 1) / NodesX) + 1356 CoordX (IX, IY) = Position (1) + (IX - 1) * StepX357 CoordY (IX, IY) = Position (2) + (IY - 1) * StepY358 JetSim (Index) = SUM (Graph (:, :, Index) * &359 & GaborTrafo (:, :, CoordX(IX,IY), CoordY(IX,IY)))360 VertexX (Index) = MOD (Params%Graph%RandomIndex (Index) - 1, NodesX) + 1361 VertexY (Index) = ((Params%Graph%RandomIndex (Index) - 1) / NodesX) + 1362 End Do

Vectorizable F90 Array Syntax Data is REAL*4

Inner “loop” at line 358 is vectorizable, can used packed SSE instructions

17

% pgf95 -fastsse -Mipa=fast -Minfo -S graphRoutines.f90…localmove: 334, Loop unrolled 1 times (completely unrolled) 343, Loop unrolled 2 times (completely unrolled) 358, Generated an alternate loop for the inner loop Generated vector sse code for inner loop Generated 2 prefetch instructions for this loop Generated vector sse code for inner loop Generated 2 prefetch instructions for this loop …

–fastsse to Enable SSE Vectorization–Minfo to List Optimizations to stderr

18

Scalar SSE: Vector SSE:

Facerec Scalar: 104.2 secFacerec Vector: 84.3 sec

.LB6_668:# lineno: 358 movss -12(%rax),%xmm2 movss -4(%rax),%xmm3 subl $1,%edx mulss -12(%rcx),%xmm2 addss %xmm0,%xmm2 mulss -4(%rcx),%xmm3 movss -8(%rax),%xmm0 mulss -8(%rcx),%xmm0 addss %xmm0,%xmm2 movss (%rax),%xmm0 addq $16,%rax addss %xmm3,%xmm2 mulss (%rcx),%xmm0 addq $16,%rcx testl %edx,%edx addss %xmm0,%xmm2 movaps %xmm2,%xmm0 jg .LB6_625

.LB6_1245: # lineno: 358 movlps (%rdx,%rcx),%xmm2 subl $8,%eax movlps 16(%rcx,%rdx),%xmm3 prefetcht0 64(%rcx,%rsi) prefetcht0 64(%rcx,%rdx) movhps 8(%rcx,%rdx),%xmm2 mulps (%rsi,%rcx),%xmm2 movhps 24(%rcx,%rdx),%xmm3 addps %xmm2,%xmm0 mulps 16(%rcx,%rsi),%xmm3 addq $32,%rcx testl %eax,%eax addps %xmm3,%xmm0 jg .LB6_1245:

Vectorizable C Code Fragment?217 void func4(float *u1, float *u2, float *u3, … …221 for (i = -NE+1, p1 = u2-ny, p2 = n2+ny; i < nx+NE-1; i++)222 u3[i] += clz * (p1[i] + p2[i]);223 for (i = -NI+1, i < nx+NE-1; i++) {224 float vdt = v[i] * dt;225 u3[i] = 2.*u2[i]-u1[i]+vdt*vdt*u3[i];226 }

% pgcc –fastsse –Minfo functions.cfunc4: 221, Loop unrolled 4 times 221, Loop not vectorized due to data dependency 223, Loop not vectorized due to data dependency

Pointer Arguments Inhibit Vectorization

% pgcc –fastsse –Msafeptr –Minfo functions.cfunc4: 221, Generated vector SSE code for inner loop Generated 3 prefetch instructions for this loop 223, Unrolled inner loop 4 times

217 void func4(float *u1, float *u2, float *u3, … …221 for (i = -NE+1, p1 = u2-ny, p2 = n2+ny; i < nx+NE-1; i++)222 u3[i] += clz * (p1[i] + p2[i]);223 for (i = -NI+1, i < nx+NE-1; i++) {224 float vdt = v[i] * dt;225 u3[i] = 2.*u2[i]-u1[i]+vdt*vdt*u3[i];226 }

C Constant Inhibits Vectorization

% pgcc –fastsse –Msafeptr –Mfcon –Minfo functions.cfunc4: 221, Generated vector SSE code for inner loop Generated 3 prefetch instructions for this loop 223, Generated vector SSE code for inner loop Generated 4 prefetch instructions for this loop

217 void func4(float *u1, float *u2, float *u3, … …221 for (i = -NE+1, p1 = u2-ny, p2 = n2+ny; i < nx+NE-1; i++)222 u3[i] += clz * (p1[i] + p2[i]);223 for (i = -NI+1, i < nx+NE-1; i++) {224 float vdt = v[i] * dt;225 u3[i] = 2.*u2[i]-u1[i]+vdt*vdt*u3[i];226 }

22

-Msafeptr Option and Pragma–M[no]safeptr[=all | arg | auto | dummy | local | static | global]

all All pointers are safe

arg Argument pointers are safe

local local pointers are safe

static static local pointers are safe

global global pointers are safe

#pragma [scope] [no]safeptr={arg | local | global | static | all},…

Where scope is global, routine or loop

23

Common Barriers to SSE Vectorization Potential Dependencies & C Pointers – Give compiler more info with –Msafeptr, pragmas, or restrict type qualifer

Function Calls – Try inlining with –Minline or –Mipa=inline

Type conversions – manually convert constants or use flags

Large Number of Statements – Try –Mvect=nosizelimit

Too few iterations – Usually better to unroll the loop

Real dependencies – Must restructure loop, if possible

24

Barriers to Efficient Execution of Vector SSE Loops

Not enough work – vectors are too short

Vectors not aligned to a cache line boundary

Non unity strides

Code bloat if altcode is generated

25

Vectorization – packed SSE instructions maximize performance Interprocedural Analysis (IPA) – use it! motivating example Function Inlining – especially important for C and C++ Parallelization – for Cray XD1 and multi-core processors Miscellaneous Optimizations – hit or miss, but worth a try

26

What can Interprocedural Analysis and Optimization with –Mipa do for You?

Interprocedural constant propagation Pointer disambiguation Alignment detection, Alignment propagation Global variable mod/ref detection F90 shape propagation Function inlining IPA optimization of libraries, including inlining

27

Effect of IPA on the WUPWISE Benchmark

PGF95 Compiler OptionsExecution Time

in Seconds

–fastsse 156.49–fastsse –Mipa=fast 121.65–fastsse –Mipa=fast,inline 91.72

–Mipa=fast => constant propagation => compiler sees complex matrices are all 4x3 => completely unrolls loops –Mipa=fast,inline => small matrix multiplies are all inlined

28

Using Interprocedural Analysis Must be used at both compile time and link time Non-disruptive to development process – edit/build/run Speed-ups of 5% - 10% are common –Mipa=safe:<name> - safe to optimize functions which call or are called from unknown function/library name

–Mipa=libopt – perform IPA optimizations on libraries –Mipa=libinline – perform IPA inlining from libraries

29

Vectorization – packed SSE instructions maximize performance Interprocedural Analysis (IPA) – use it! motivating examples Function Inlining – especially important for C and C++ SMP Parallelization – for Cray XD1 and multi-core processors Miscellaneous Optimizations – hit or miss, but worth a try

30

Explicit Function Inlining–Minline[=[lib:]<inlib> | [name:]<func> | except:<func> | size:<n> | levels:<n>]

[lib:]<inlib> Inline extracted functions from inlib

[name:]<func> Inline function func

except:<func> Do not inline function func

size:<n> Inline only functions smaller than n statements (approximate)

levels:<n> Inline n levels of functions

For C++ Codes, PGI Recommends IPA-based

inlining or –Minline=levels:10!

31

Other C++ recommendations Encapsulation, Data Hiding - small functions, inline! Exception Handling – use –no_exceptions until 7.0 Overloaded operators, overloaded functions - okay Pointer Chasing - -Msafeptr, restrict qualifer, 32 bits? Templates, Generic Programming – now okay Inheritance, polymorphism, virtual functions – runtime lookup or check, no inlining, potential performance penalties

32

Vectorization – packed SSE instructions maximize performance Interprocedural Analysis (IPA) – use it! motivating examples Function Inlining – especially important for C and C++ SMP Parallelization – for Cray XT(CNL) on multi-core processors Miscellaneous Optimizations – hit or miss, but worth a try

33

SMP Parallelization –Mconcur for auto-parallelization on multi-core

Compiler strives for parallel outer loops, vector SSE inner loops –Mconcur=innermost forces a vector/parallel innermost loop –Mconcur=cncall enables parallelization of loops with calls

–mp to enable OpenMP 2.5 parallel programming model

See PGI User’s Guide or OpenMP 2.5 standard OpenMP programs compiled w/out –mp “just work” Not supported on Cray XT(Catamount) – would require some custom work

–Mconcur and –mp can be used together!

MGRID BenchmarkMain Loop

DO 10 I3=2, N-1 DO 10 I2=2,N-1 DO 10 I1=2,N-110 R(I1,I2,I3) = V(I1,I2,I3) & -A(0)*(U(I1,I2,I3)) & -A(1)*(U(I1-1,I2,I3)+U(I1+1,I2,I3) & +U(I1,I2-1,I3)+U(I1,I2+1,I3) & +U(I1,I2,I3-1)+U(I1,I2,I3+1)) & -A(2)*(U(I1-1,I2-1,I3)+U(I1+1,I2-1,I3) & +U(I1-1,I2+1,I3)+U(I1+1,I2+1,I3) & +U(I1,I2-1,I3-1)+U(I1,I2+1,I3-1) & +U(I1,I2-1,I3+1)+U(I1,I2+1,I3+1) & +U(I1-1,I2,I3-1)+U(I1-1,I2,I3+1) & +U(I1+1,I2,I3-1)+U(I1+1,I2,I3+1) ) & -A(3)*(U(I1-1,I2-1,I3-1)+U(I1+1,I2-1,I3-1) & +U(I1-1,I2+1,I3-1)+U(I1+1,I2+1,I3-1) & +U(I1-1,I2-1,I3+1)+U(I1+1,I2-1,I3+1) & +U(I1-1,I2+1,I3+1)+U(I1+1,I2+1,I3+1))

Auto-parallel MGRID Overall Speed-upis 40% on Dual-core AMD Opteron

% pgf95 –fastsse –Mipa=fast,inline –Minfo –Mconcur mgrid.fresid:. . . 189, Parallel code for non-innermost loop activated if loop count >= 33; block distribution 291, 4 loop-carried redundant expressions removed with 12 operations and 16 arrays Generated vector SSE code for inner loop Generated 8 prefetch instructions for this loop Generated vector SSE code for inner loop Generated 8 prefetch instructions for this loop

36

Vectorization – packed SSE instructions maximize performance Interprocedural Analysis (IPA) – use it! motivating examples Function Inlining – especially important for C and C++ SMP Parallelization – for Cray XD1 and multi-core processors Miscellaneous Optimizations – hit or miss, but worth a try

37

Miscellaneous Optimizations (1)

–Mfprelaxed – single-precision sqrt, rsqrt, div performed using reduced-precision reciprocal approximation –lacml and –lacml_mp – link in the AMD Core Math Library –Mprefetch=d:<p>,n:<q> – control prefetching distance, max number of prefetch instructions per loop –tp k8-32 – can result in big performance win on some C/C++ codes that don’t require > 2GB addressing; pointer and long data become 32-bits

38

Miscellaneous Optimizations (2) –O3 – more aggressive hoisting and scalar replacement; not part of –fastsse, always time your code to make sure it’s faster For C++ codes: ––no_exceptions –Minline=levels:10 –M[no]movnt – disable / force non-temporal moves

–V[version] to switch between PGI releases at file level

–Mvect=noaltcode – disable multiple versions of loops

• Industry-leading SPECFP06 and SPECINT06 Performance• PGI Visual Fortran for Windows x64 & Windows XP• Full-featured PGI Workstation/Server for 32-bit Windows XP• PGI Unified Binary performance enhancements • More gcc extensions / compatibility• New SSE intrinsics• PGI CDK ROLL for ROCKS clusters• MPICH1 and MPICH2 support in the PGI CDK• Incremental debugger/profiler enhancements• Limited tuning for Intel Core2 (Woodcrest et al)

What’s New in PGI 6.2

• Deep integration with Visual Studio 2005• PGI-custom Fortran-aware text editor • Syntax coloring, keyword completion • Fortran 95 Intrinsics tips • PGI-custom project system and icons • PGI-custom property pages• One-touch project build / execute • MS Visual C++ interoperability• Mixed VC++ / PGI Fortran applications • PGI-custom parallel F95 debug engine • OpenMP 2.5 / threads debugging• Just-in-time debugging features • DVF/CVF compatibility features• Win32 API support• Complete (Vis Studio bundled) and Standard (no Vis Studio) versions

PGI Visual Fortran 6.2• PGI Unified Binary executables• Auto-parallel for multi-core CPUs • Native OpenMP 2.5 parallelization • World-class performance • 64-bit Windows x64 support • 32-bit Windows 2000/XP support • Optimization/support for AMD64 • Optimization/support for Intel EM64T • DEC/IBM/Cray compatibility features • cpp-compatible pre-processing • Visual Studio 2005 bundled* • MSDN Library bundled* • GUI parallel debugging/profiling* • Assembly-optimized BLAS/LAPACK/FFTs* • Boxed CD-ROM/Manuals media kit*

*PVF Workstation Complete Only

• PGI Unified Binary directives and enhancements• Aggressive Intel Core2 and next gen AMD64 tuning• Industry-leading SPECFP06 and SPECINT06 Performance on Linux/Windows/AMD/Intel/32/64• Incremental PGDBG enhancements, improved C++ support• MPI Debugging / Profiling for Windows x64 CCS Clusters• All-new cross-platform PGPROF performance profiler• Fortran 2003/C99 language features• GCC front-end compatibility, g++ interoperability• PGC++ tuning, PGC++/VC++ interoperability• Windows SUA and Apple/MacOS platform support• De facto standard scalable C/Fortran language/tools extensions

On the PGI Roadmap

Reach me at [email protected] for your time

Questions?


pgi compilers and tools

Documents