pseries program optimization

Louisiana State UniversityBaton Rouge, Louisiana

Charles GrasslIBM

June, 2005

pSeries Program Optimization

Agenda

• Programming Concerns• Tactics

Understanding Performance

• Fused floating multiply add functional units•Latency•Balance

• Memory Access•Stride•Latency•Bandwidth

FMA Functional Unit

• Multiply and Add• Fused• 6 clock period latency

D = A+B*C

Performance Expectations

• 100's Mflop/s to 2-4 Gflop/s•Limitations:

• System bandwidth• Program model

• Floating point operations• Adds and Multiplies• Divides

• Memory access• Copying

• 100 Mbyte/s - 10 Gbyte/s•Strided access is slow•Multiple streams

Performance Expectations Example:Program POP

• Structured Fortran 90• Data moves• Low Computational Intensity• Low FMA Percentage

0.869Comp. Intensity53 %FMA percentage372 Mflip/sFloat. Point Instr. + RMA rate10141 MFloat. Point Inst. + FMSs0.3HW float points Inst. per Cycle0.9Instructions per cycle2.9 MInstructions per load/store

Program POP:Computation Time Distribution

Performance Expectations Example:Program SPPM

• Loose Fortran 77• Optimized• High Computational Intensity• Vector intrinsics• Low FMA Percentage

1.8Computation Intensity55 %FMA percentage979 Mflip/sFloat point Inst. + RMA rate191752Float. Point + FMSs0.7HW Float Inst. Per cycle1.2Instr. Per cycle2.9Instr. Per load/store

Program SPPM:Computation Time Distribution

0102030405060

Programming Concerns

• Superscalar design•Concurrency:

• Branches• Loop control• Procedure calls• Nested “if” statements

•Program “control” is very efficient• Cache based microprocessor

•Critical resource: memory bandwidth• Tactics:

• Increase Computational Intensity• Exploit "prefetch"

Strategies for Optimization

• Exploit multiple functional units:• Expose load/store "streams"• Optimized libraries• Pipelined operations

Tactics for Optimization

• Cache reuse•Blocking

• Unit stride•Use entire loaded cache lines

• Limit range of indirect addressing•Sort indirect addresses

• Increase computational intensity of the loops•Ratio of flop's to byte's

• Load streams

Computational Intensity

• Ratio of floating point operations per memory references (loads and stores)

• Higher is better• Example:

for (i=0;i<n;i++)A[i] = A[i] +B[i]*s+C[i]

•Loads and stores: 3+1•Floating point operations: 3•Computational intensity: 3/4

Loop Unrolling Strategy:Increase Computational Intensity

• Find variable which is constant with respect to outer loop•Unroll such that this variable is loaded once but used multiple times

• Unroll outer loop•Minimizes load/stores

Outer Loop Unroll

• 2 flops / 2 loads• Comp. Int.: 1

• 8 flops / 5 Loads• Comp. Int.: 1.6

DO I = 1, NDO J = 1, N

s = s +X(J)*A(J,I)END DO

END DO

DO I= 1, N, 4DO J = 1, N

s = s +X(J)*A(J,I+0)+X(J)*A(J,I+1)+X(J)*A(J,I+2)+X(J)*A(J,I+3)

END DOEND DO

Unroll

Outer Loop Unroll Test

100150200250300350400450500

-O3 -O3 -qhot

Unroll124812

1.3 GHz POWER4

Outer Loop Unroll Test

100150200250300350400450500

1 2 4 8 12

-O3-O3 -qhot

1.3 GHz POWER4

Loop Unroll Analysis

• Strategy:•Unroll up to 8 times

• Exploit 8 prefetch streams• Compiler:

•Unrolls up to 4 times• “Near” optimal performance

•Combines inner and outer loop unrolling

Loop Unrolling Strategies

• Inner loop strategy:•Reduces data dependency•Eliminate intermediate loads and stores•Expose functional units•Expose registers•Examples:

• Linear recurrences• Simple loops

• Single operation

Inner Loop Unroll: Dependencies

• Eliminate data dependence (half)• Eliminate intermediate loads and stores

•Compiler will do some of this at -O3 and higher

do i=2,n-1a(i+1) = a(i)*s1 + a(i-1)*s2

end do

do i=2,n-2,2a(i+1) = a(i )*s1 + a(i-1)*s2a(i+2) = a(i+1)*s1 + a(i )*s2

end do

-qnounroll -qunroll

Unroll 1Unroll 2Unroll 4

Unroll 1 Unroll 2 Unroll 4

-qnounroll-qunroll

• Compiler unrolling helps •Does not help manually unrolled loops• Conflicts

• Turn off compiler unrolling if manually unrolled

Inner Loop Unroll: Registers and Functional Units

• Expose functional units• Expose registers

•Compiler will do some of this at -O3 and higher

do j=1,ndo i=1,m

sum = sum + X(i)*A(i,j)end do

end do

do j=1,ndo i=1,m,2

sum = sum + X(i) *A(i,j) &+ X(i+1)*A(i+1,j)

end doend do

100150200250300350400450500

-qnounroll -qunroll

Unroll 1Unroll 2Unroll 4Unroll 8

1.45 GHz POWER4

100150200250300350400450500

Unroll 1 Unroll 2 Unroll 4 Unroll 8

-qnounroll-qunroll

1.45 GHz POWER4

• Compiler does adequate job of unrolling

• Do not manually unroll inner loop

Outer Loop Unrolling Strategies

• Expose prefetch streams•Up to 8 streams

do j=1,ndo i=1,m

sum = sum + X(i)*A(i,j)end do

end do

do j=1,n,2do i=1,m

sum = sum + X(i)*A(i,j) &+ X(i)*A(i,j+1)

end doend do

Outer Loop Unroll: Streams

0100200300400500600700800

-qnounroll -qunroll

Unroll 1Unroll 2Unroll 4Unroll 8

0100200300400500600700800

Unroll 1 Unroll 2 Unroll 4 Unroll 8

-qnounroll-qunroll

• Compiler does a “fair”job of outer loop unrolling

• Manual unrolling can outperform compiler unrolling

Strides: Cache Lines

• Stided memory accesses use partial cache lines•Reduced efficiency

Memory

Strided Memory Access

for (i=0;i<n;i+=2)sum+=a[i];

Strides

• Cache line size is 128 bytes• Double precision: 16 words• Single precision: 32 words

Bandwidth Reduction

1/161/32641/161/32321/161/16161/81/88¼¼4½½21x1x1

DoubleSingleStride

Stride Test

-10 -6 -2 1 4 8 12

POWER4 1.3 GHz

Correcting Stides

• Interleave code• Example:

•Real and Imaginary part arithmetic

do i=1,n… AIMAG(Z(i))

end dodo i=1,n

…REAL(Z(i))end do

do i=1,n… AIMAG(Z(i))

……REAL(Z(i))

end do

Correcting Strides

• Interchange loops

do i=1,ndo j=1,m

sum=sum+A(I,j)end do

end do

Interchange

do j=1,mdo i=1,n

sum=sum+A(I,j)end do

end do

Correcting Strides:Loop Interchange

020406080

100120140160

p/s -O3

-O3 -qhotInterchanged

1.45 GHz POWER4

Effect of Large Strides (TLB Misses)

100150200250300350400450

8 16 32 68 132

Large Stride

Working Set Size

• Reduce spanning size of random memory accesses

• More MPI tasks usually helps

Random Memory Access

010203040506070

1 2 4 8 16 32 64 128 256 512Work Set Size (Mbyte)

Blocking

• Common technique in Linear Algebra (LA)•Similar to unrolling•Utilize cache lines•Linear Algebra NB:

• Typically 96-256

Blocking

Blocking Example: Transpose

• Especially useful for bad strides

do i = 1,ndo j = 1,mB(j,i) = A(i,j)

end doend do

do j1 = 1,n-nb+1,nbj2 = min(j1+nb-1,n)do i1 = 1,m-nb+1,nb

i2 = min(i1+nb-1,m)do i = i1, i2do j = j1, j2

B(j,i) = A(i,j)end do

end doend do

end do

Blocking

Blocking Example: Transpose

0200400600800

10001200140016001800

1 32 64 128

Blocking Size

Hardware Prefetch

• Detects adjacent cache line references• Forward and backward• Up to eight concurrent streams• Prefetches up to two lines ahead per stream• Twelve prefetch filter queues prevents

rolling• No prefetch on store misses • (when a store instruction causes a cache

line miss)

Prefetch: Stride Pattern Recognition

• Upon a cache miss:• Biased guess is made as to the direction of

that stream• Guess is based upon where in the cache line

the address associated with that miss occurred

• If it is in the first 3/4, then the direction is guessed as ascending

• If in the last 1/4, the direction is guessed descending

Memory Bandwidth

1 2 4 6 8 10 12Right Hand Sides

Small Pages

1.3 GHz POWER4

Memory Bandwidth

100015002000250030003500400045005000

Large Pages

1.3 GHz POWER4

Memory Bandwidth

100015002000250030003500400045005000

Small PagesLarge Pages

1.3 GHz POWER4

Memory Bandwidth

• Bandwidth is proportional to number of streams•Streams are roughly the number of right hand side arrays

•Up to eight streams

Exploiting Prefetch

• Merge Loops• Strategy

•Combine loops to get up to 8 right hand sides

for (j=1; j<= n; j++)A[j] = A[j-1]+B[j]

for (j=1; j<= n; j++) D[j] = D[j+1]+C[j]*s

for (j=1; j<= n; j++){A[j] = A[j-1]+B[j]D[j] = D[j+1]+C[j]*s}

Loop Merge Example

100150200250300350400

Original -O3 -qhot Merged

1.3 GHz POWER41.3 GHz POWER4

Folding

• Fold loop to increase number of streams:• Strategy

•Fold loops to get up to 4 times

do i = 1,nsum = sum +A(i)

end do

do i = 1,n/4sum = sum +A(i )

+ A(i+1*n/4)+ A(i+2*n/4)+ A(i+3*n/4)

end do

Folding Example

1000150020002500300035004000

0 2 4 6 8 12Folds

1st Qtr

1.3 GHz POWER4

Effect of Precision

• Available floating point formats:• real (kind=4)• real (kind=8)• real (kind=16)

• Advantage of smaller data types:•Require less bandwidth•More effective cache use

REAL*8 A,pi,e…do i=1,nA(i) = pi*A(i) + e

end do

REAL*4 A,pi,e…do i=1,nA(i) = pi*A(i) + e

end do

Effect of Precision

0200400600800

1000120014001600

REAL*4 REAL*8 REAL*16

Small (10000)Large (90M)

Divide and Sqrt

• POWER4 special functions:• Divide• Sqrt• Use FMA functional unit

• 2 simultaneous divide or sqrt (or rsqrt)• NOT pipelined

3838Fsqrt3232Fdiv66Fma

Double(Cycles)

Single(Cycles)Instruction

Hardware DIV, SQRT, RSQRT

01020304050607080

Divide: inst. SQRT: Inst. RSQRT: inst.

Hardware

Divide: inst. SQRT: Inst. RSQRT:inst.

Software Pipelined

Divide: inst. SQRT: Inst. RSQRT:inst.

HardwareSoftware Pipelined

Intrinsic Function Vectorization

do i = -nbdy+3,n+nbdy-1prl = qrprl(i)pll = qrmrl(i)pavg = vtmp1(i)wllfac(i) = 5*gammp1*pavg + gamma * pllwrlfac(i) = 5*gammp1 * pavg +gamma *prlhrholl = rho(1,i-1)hrhorl = rho(1,i)wll(i) = 1/sqrt(hrholl * wllfac(i))wrl(i) = 1/sqrt(hrhorl * wrlfac(i))

end do

Intrinsic Function Vectorization

allocate(t1,n+2*nbdy-3)allocate(t2,n+2*nbdy-3)do i = -nbdy+3,n+nbdy-1

prl = qrprl(i)...t1(i) =hrholl * wllfac(i)t2(i) =hrhorl * wrlfac(i)

end docall __vrsqrt(t1,wrl,n+2*nbdy-3)call __vrsqrt(t2,wll,n+2*nbdy-3)

Vectorization Analysis

• Dependencies• Compiler overhead:

• Generate (malloc) local temporary arrays• Extra memory traffic

• Moderate vector lengths required

302545N½

258020CrossoverLength

RSQRTSQRTREC

pseries program optimization

Documents

performance programming with ibm pseries compilers and...

ibm certification study guide - pseries hacmp for aix

practical guide for san with pseries - personal websites -...

ibm pseries 660 model 6h1 - kev009.com

program optimization

program optimization - tum

logical partitioning (lpar) on power5 pseries systems

hardware management console for pseries installation and

hardware management console for pseries installation and

4 questions pseries aix system support

hardware management console for pseries maintenance...

pipeline optimization program

ibm pseries 670 and pseries 690 system handbook - · pdf...

ibm certification study guide - pseries aix system support...

pseries ops manual 060809

hardware management console for pseries...

san pseries redbook

program optimization methodology

ibm certification study guide - pseries aix system...

ibm pseries 615