san diego supercomputer center single processor code optimization for the datastar (power 4)...

SAN DIEGO SUPERCOMPUTER CENTER

Single Processor Code Optimizationfor the

Datastar (Power 4) Architecture

by Ross Walker

with some modifcations by Grimshaw


Why optimize your code for single cpu performance?

Advantages

• More calculation in a given time.• Less SUs per simulation.• Faster time to solution.• Better chance of successful

MRAC/LRAC proposals.


REMEMBERWALLCLOCK TIME IS EVERYTHING!

• The only metric that ultimately matters is Wallclock time to solution.• Wallclock is how long it takes to run your simulation.• Wallclock is how much you get charged for.• Wallclock is how long your code is blocking other users

from using the machine.

•Lies, Damn Lies and Statistics...•TFlops/PFlops numbers are ‘worse’ than statistics.•You can easily write a code that gets very high Flops numbers but has a longer time to solution.


Why optimize your code for single cpu performance?

Disadvantages

• Time consuming – Do not waste weeks optimizing a one off code that will run for 1 hour.

• Can make the code harder to read and debug.

• Can adversely affect parallel scaling.• Different architectures can respond in

different ways.


When to optimize?

• Code optimization is an iterative process requiring time, energy and thought.

• Performance tuning is recommended for:

1) production codes that are widely distributed and often used in the research community

2) projects with limited allocation (to maximize available compute hours).


Optimization Strategy

This talk will concentrate on how to carry out this section.


Optimization Strategy

• Aim: To try to minimize the amount of work the compiler is required to do.

DO NOT rely on the compiler to optimize bad code.

Use a two pronged approach:1) Write easily optimizable code to begin with.

2) Once the code is debugged and working try more aggressive optimizations.


Example Code

• This talk will go through a number of simple examples of bad coding and show what compilers can and can’t do and how you can work ‘with’ the compiler to produce faster code.

• Example code is in:

/gpfs/projects/summer_inst/code_opt/

After this talk you will have an opportunity to try to optimize this code. There will be prizes for the most efficient codes.


The Power 4 Architecture(Datastar)


The Power 4 Architecture

• Remember: Clock speed is NOT everything.

CPU CPU MHzCINT2000 base/peak

CFP2000 base/peak

AMD Athlon XP 1900+

1600 677/701 588/634

HP PA-8700 750 568/604 526/576

IBM POWER4 (1CPU)

1300 790/814 1098/1169

Intel Itanium 800 314/314 645

Intel Pentium 4 2200 771/784 766/777

SUN UltraSPARC III-Cu

900 470/533 629/731


Why is the Power 4 chip faster for a lower clock speed?

• The speed comes from the inherent parallelism designed into the chip.• 2 Floating point units per core.• 2 Fixed point units per core.

Each FP unit can perform 1 fused multiply add (FMA) per clock tick. Thus the two floating point units can perform a theoretical maximum of 4 floating point operations per clock cycle.

• You should design your code with this in mind.


What compilers will not do.

• They will not reorder large memory arrays.• They will not factor/simplify large equations.• They will not replace inefficient logic.• They will not vectorize large complex loops.• They will not fix inefficient memory access.• They will not detect and remove excess work.• They will not automatically invert divisions.• ...


How can we help the compiler?

• We should aim to help the compiler as much as possible by designing our code to match the machine architecture.• Try to always move linearly in memory (Cache hits).• Factor / Simplify equations as much as possible.• Think carefully about how we do things. Is there a more efficient

approach?• Remove excess work.• Avoid branching inside loops.

I will take you through a series of simple examples that you can then use to try and optimize the example code.


Example ‘bad’ code

• In /gpfs/projects/summer_inst/code_opt/ is an example piece of code that we will attempt to optimize.

• This code solves the following ‘fictional’ equation:

1

ij i ij jr q r q

ijj i ij

e eE r cut

r a


Example ‘bad’ CodeThe main loop of the code is as follows: total_e = 0.0d0 cut_count = 0 do i = 1, natom do j = 1, natom if ( j < i ) then !Avoid double counting. vec2 = (coords(1,i)-coords(1,j))**2 + (coords(2,i)-coords(2,j))**2 +

(coords(3,i)-coords(3,j))**2 !X^2 + Y^2 + Z^2 rij = sqrt(vec2) !Check if this is below the cut off if ( rij <= cut ) then cut_count = cut_count+1 !Increment the counter of pairs below cutoff current_e = (exp(rij*q(i))*exp(rij*q(j)))/rij total_e = total_e + current_e + 1.0d0/a end if end if end do end do

While only short there is a lot of potential optimization available with this code.


Step 1 – Select best compiler / library options

• We have sqrt’s and exponentials.• We also have multiple loops.

Hence it is essential that we compile with optimization and we link again IBM’s accelerated math library (mass).xlf90 –o original.x –O3 –qarch=pwr4

–qtune=pwr4 –L/usr/lib –lmass original.f

Compiler Options Execution time (s)

-O0 552.0

-O3 183.3

-O5 179.3

-O3 -lmass 176.8

-O5 -lmass 177.3


Step 2 – Code Optimization

• We can modify the code to make the compiler’s job easier.

• I will take you through a series of examples and then you can use these to optimize the example code.


Optimization Tip 1• Avoid expensive mathematical operations as much as

possible.• Addition, Subtraction, Multiplication = Very Fast• Division and Exponentiation = Slow• Square Root = Very Slow

(Inverse square root can sometimes be faster)

Operation Min Cycles per iteration (L1 Cache)

x(i) = y(i) 1.7

x(i) = x(i) + y(i) 1.7

x(i) = x(i) + s*y(i) 1.7

x(i) = 1/y(i) 15.1

x(i) = sqrt(y(i)) 18.1


Optimization Tip 1 Contd...

• E.g. in our example code we need to test if Rij is <= to a cut off.

rij = sqrt(x^2+y^2+z^2) if (rij <= cut) then ... end if

• We can avoid doing excessive square roots by taking into account that cut is a constant and so just comparing to cut^2.

vec2 = x^2+y^2+z^2 if (vec2 <= cut) then rij = sqrt(vec2) ...

end if


Optimization Tip 2

• Do not rely on the compiler to simplify / factorize equations for you. You should manually do this yourself. E.g.:

• This replaces 2 expensive exponentials with just 1.• The same is true for excessive divisions, square roots

etc.• Try expanding the equation that your code represents

and then refactorizing it with the intention of avoiding as many square roots, exponentials and divisions as possible.

ij i jij i ij j R q qR q R qe e e


Optimization Tip 3• Invert divisions wherever possible.

do i = 1, N do j = 1, N y(j,i) = x(j,i) / r(i) end doend do

Becomesdo i = 1, Noner = 1.0d0/r(i)

do j = 1, N y(j,i) = x(j,i) * oner end doend do

This replaces N^2 divisions with N divisions and N multiplications.


Optimization Tip 3 (Extended Example)

• One can extend this idea to situations where something both requires a multiplication by a distance and a division by a distance. E.g.do j = 1, iminus vec(1:3)=qmx(1:3)-qmcoords(1:3,j) r2 = vec(1)**2+vec(2)**2+vec(3)**2 onerij = 1.0d0/sqrt(r2) rij = r2*onerij !1.0d0/rij ... ... = ... + param*rij sum = sum+q(j)*onerijend do

This takes advantage of the fact that on some systems an inverse square root followed by a multiplication can be faster than a square root followed by a division.


Optimization Tip 4• Avoid branching in inner loops – This leads to cache

misses and also prevents the compiler from vectorizing loops and making use of the multiple floating point units.

do i=1,n if (r < 1.0e-16) then if (r < 1.0e-16) then do i=1,n a(i)=0.0; b(i)=0.0; c(i)=0.0 a(i)=0.0; b(i)=0.0; c(i)=0.0 else end do a(i)=x(i)/r else b(i)=y(i)/r do i=1,n c(i)=z(i)/r a(i)=x(i)/r end if b(i)=y(i)/rend do c(i)=z(i)/r

end do end if


Optimization Tip 4 Contd...

• How can we improve this code even more?

if (r < 1.0e-16) then do i=1,n a(i)=0.0; b(i)=0.0; c(i)=0.0 end doelse do i=1,n a(i)=x(i)/r b(i)=y(i)/r c(i)=z(i)/r end doend if

if (r < 1.0e-16) then do i=1,n a(i)=0.0; b(i)=0.0; c(i)=0.0 end doelse oner=1.0d0/r do i=1,n a(i)=x(i)*oner b(i)=y(i)*oner c(i)=z(i)*oner end doend if

Invert Divisions


Optimization Tip 4 Contd.• How can we remove the following if statement from

the example code?

• Hint: consider changing the loop indexes.

do i = 1, natom do j = 1, natom if ( j < i ) then !Avoid double counting. vec2 = (coords(i,1)-coords(j,1))**2 + (coords(i,2)-coords(j,2))**2 + (coords(i,3)-coords(j,3))**2 !X^2 + Y^2 + Z^2 rij = sqrt(vec2) !Check if this is below the cut off if ( rij <= cut ) then cut_count = cut_count+1 !Increment the counter of pairs below cutoff current_e = (exp(rij*q(i))*exp(rij*q(j)))/rij total_e = total_e + current_e + 1.0d0/a end if end if end do end do


Optimization Tip 5• Ensure that loop indexes are constant while a

loop is executing. E.g.:

do i = 2, N ... do j = 1, i-1 ... end doend do

Some compilers will not recognize that i-1 is constant during the j loop and so will re-evaluate i-1 on each loop iteration. We can help the compiler out here as follows:

do i = 2, N...iminus = i-1do j = 1, iminus ...end doend do


Optimization Tip 6

• Avoid excessive array lookups – Factor array lookups that are not dependent on the inner loop out of the loop.• Some compilers will do this for you. Others will not.

x(i) becomes a scalar in the inner loop.

do i = 1, N do j = 1,N ... sum = sum+x(j)*x(i) end doend do

do i = 1, N xi = x(i) do j = 1,N ... sum = sum+x(j)*xi end doend do


Optimization Tip 7• Always try to use stride 1 memory access.• Traverse linearly in memory.• Maximizes Cache hits.• In Fortran this means you should always loop

over the first array index in the inner loop. E.g.

do i=1, Ni do j=1,Nj do k=1,Nk v=x(i,j,k) ... end do end doend do

Wrong!This way round each iteration of loop k jumps a total of i*j*8 bytes meaning it almost always misses the cache


Optimization Tip 7 Contd.

• By changing the order of the loop we can ensure that we always stride linearly through the x array.

• Note: in C memory is laid out in the reverse fashion. Ideally you should use a linear array and do the necessary offset arithmetic yourself.

do k=1, Nk do j=1,Nj do i=1,Ni v=x(i,j,k) ... end do end doend do


Optimization Tip 7 Contd.

• If you have multiple nested loops that loop entirely through an array it can often be more beneficial to replace a multi-dimensional array with a 1-D array and do the arithmetic yourself. E.g.

• This avoids unnecessary pointer lookups and can make it easier to vectorize the code

Nitr = Nk*Nj*Nido k=1, Nitr v=x(k) ...end do


Optimization Tip 7 Contd.• Sometimes (you need to test), especially on the Power 4 architecture

it can help to split arrays of the form (1:x,N) where x is a small (ca. 3) constant into a set of x 1-D arrays. E.g.

do i = 1, N ... do j = 1, N vec2=(xi-crd(1,j))**2+(yi-crd(2,j))**2+(zi-crd(3,j))**2 ... end do end do

BECOMES do i = 1, N ... do j = 1, N vec2=(xi-x(j))**2+(yi-y(j))**2+(zi-z(j))**2 ... end do end do

The reasoning behind this is that the power4 chip has multiple cache lines and so by splitting the array into 3 seperate arrays the compiler can utilize 3 pipelines simultaneously. (You need to test if this works for your code!)


Vectorization

• Sometimes if you have a large amount of computation occurring inside a loop it can be beneficial to explicitly vectorize the code and make use of specific vector math libraries.

• Often the compiler can do this vectorization for you (xlf90 will explicitly try if –qhot is specified).

• Compiler can typically only vectorize small inner loops. Complex loop structures will not get vectorized. Here your code can benefit from explicit vectorization.


Why Does Vectorization Help on a Scalar Machine?

• Machine is actually super scalar...• Has a pipeline for doing computation.• Has multiple floating point units.

• Vectorization makes it easy for the compiler to use multiple floating point units as we have a large number of operations that are all independent of each other.

• Vectorization makes it easy to fill the pipeline.


• On Power 4 a single FMA has to go through 6 stages in the pipeline.

• After 1 clock cycle the first FMA operation has completed stage 1 and moves on to stage 2. Processor can now start processing stage 1 of the second FMA operation in parallel with stage 2 of first operation etc etc...



• After 6 operations pipelining of subsequent FMA operations gives one result every clock cycle.• Pipeline latency is thus hidden beyond 6 operations.

• Power 4 chip has two floating point units so needs a minimum of 12 independent FMA operations to be fully pipelined.

• Thus if we can split our calculation up into a series of long ‘independent’ vector calculations we can maximize the floating point performance of the chip.



Vectorization Contd.

do i = 2, Ndo j = 1, i-1

vec2 = ... onerij = 1.0d0/sqrt(vec2) rij = vec2*onerij exp1 = exp(rij) esum = esum + exp1*onerijend do

end do

Scalar Code


Vectorization Contd.

loopcount = 0do i = 2, N

do j = 1, i-1 loopcount = loopcount+1

vec2 = ... vectmp1(loopcount) = vec2 end do

end do!Begin vector operationscall vrsqrt(vectmp2,vectmp1,loopcount)!vectmp2 now contains onerijvectmp1(1:loopcount) = vectmp1(1:loopcount) * vectmp2(1:loopcount)!vectmp1 now contains rijcall vexp(vectmp1,vectmp1,loopcount)!vectmp 1 now contains exp(rij)do i = 1, loopcount esum = esum + vectmp1(i)*vectmp2(i)end do

Vector Codedo i = 2, N

do j = 1, i-1 vec2 = ...

onerij = 1.0d0/sqrt(vec2) rij = vec2*onerij exp1 = exp(rij) esum = esum + exp1*onerijend do

end do


IBM’s vector math library

• IBM has a special tuned vector math library for use on Datastar. This library is called libmassvp4 and can be compiled into your code with:• xlf90 -o foo.x -O3 -L/usr/lib -lmassvp4 foo.f

• Support is provided for a number of vector functions including:vrec (vectored inverse), vexp (vectored exp)vsqrt (vectored sqrt), vrsqrt (vectored invsqrt)vlog (vectored ln), vcos (vectored cosine),vtanh (vectored tanh)


The Example Code(A competition)

• As mentioned previously an example piece of code is available in /gpfs/projects/summer_inst/code_opt/ on datastar.(COPY THIS TO YOUR OWN DIRECTORY ON GPFS BEFORE EDITING)

• Provided in this directory is the unoptimized code (original.f), a script to compile it (compile.x) and a loadleveller script to submit it on 1 cpu (llsubmit submit_orig.ll)

• There are also a set of optimized codes that I have created (both scalar and vector). These are read protected at present and will be made readable to all at the end of today.



• Your mission (should you choose to accept it) is to optimize this example code so that it gives the same answer (to 12 s.f.) on a single processor



• My timings are as follows(for 1 cpu of Datastar (1.5GHz Node):

Code Version Execution Time (s)

Original Code(original.f)

176.80

Standard Optimized Code(optimized_standard.f)

41.46

Vectorized Code(optimized_vector.f)

30.64

PRIZES!!! - There will be a prize for each person who gets better than 31.0 seconds on 1cpu x 1.5GHz and

answer to within 12 s.f.


Directory Contents

• Directory = /gpfs/projects/summer_inst/code_opt/

-rwxr-xr-x ... compile.x* ./compile.x to compile codes-rw-r--r-- ... coords.dat coordinate file read by codedrwxr-xr-x ... opt/ directory containing optimized solution-rw-r--r-- ... original.f original single cpu code (start here)-rw-r--r-- ... original.out output from run-rwxr-xr-x ... original.x* executable built by compile.x-rw-r--r-- ... original_16cpu.out output from 16cpu run-rw-r--r-- ... original_mpi.f original multi cpu code (start here 2)-rwxr-xr-x ... original_mpi.x* executable built by compile.x-rw-r--r-- ... run_original.llscript Loadleveler script to run 1 cpu job-rw-r--r-- ... run_original_16cpu.llscript Loadleveler script to run 16 cpu job

Everything in opt directory is readable except the source code ;-) I will make this available at the end of today.


Running Code

• We will use the express queue so you must login to “we need to pick a hostname”

• Copy the original code and scripts to a directory that is owned by you.• I recommend you do

» cd /gpfs» mkdir username (substitute your guest username here)» cd username» cp /gpfs/projects/summer_inst/code_opt/* ./

• Edit the original code with the editor of your choice and apply any optimizations you think it needs.


Running Code 2

• Once you have optimized some of the code you can compile it with the ./compile.x script (will compile both serial and parallel) or with (for single cpu):

xlf90 -o original.x -L/usr/lib -lmass -O3 -qtune=pwr4 -qarch=pwr4-bmaxdata:0x80000000 original.f

• You then need to submit your job to the express queue


Code Results

• The original code should produce the following answers:

• Num Pairs = 18652660(Your code should get this exactly)

• Total E = 54364.4699664615910(Your code should get this to within 12

s.f)

san diego supercomputer center single processor code optimization for the datastar (power 4)...

Documents