san diego supercomputer center single processor code optimization for the datastar (power 4)...

45
SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

Upload: sherman-wells

Post on 02-Jan-2016

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Single Processor Code Optimizationfor the

Datastar (Power 4) Architecture

by Ross Walker

with some modifcations by Grimshaw

Page 2: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Why optimize your code for single cpu performance?

Advantages

• More calculation in a given time.• Less SUs per simulation.• Faster time to solution.• Better chance of successful

MRAC/LRAC proposals.

Page 3: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

REMEMBERWALLCLOCK TIME IS EVERYTHING!

• The only metric that ultimately matters is Wallclock time to solution.• Wallclock is how long it takes to run your simulation.• Wallclock is how much you get charged for.• Wallclock is how long your code is blocking other users

from using the machine.

•Lies, Damn Lies and Statistics...•TFlops/PFlops numbers are ‘worse’ than statistics.•You can easily write a code that gets very high Flops numbers but has a longer time to solution.

Page 4: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Why optimize your code for single cpu performance?

Disadvantages

• Time consuming – Do not waste weeks optimizing a one off code that will run for 1 hour.

• Can make the code harder to read and debug.

• Can adversely affect parallel scaling.• Different architectures can respond in

different ways.

Page 5: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

When to optimize?

• Code optimization is an iterative process requiring time, energy and thought.

• Performance tuning is recommended for:

1) production codes that are widely distributed and often used in the research community

2) projects with limited allocation (to maximize available compute hours).

Page 6: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Optimization Strategy

This talk will concentrate on how to carry out this section.

Page 7: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Optimization Strategy

• Aim: To try to minimize the amount of work the compiler is required to do.

DO NOT rely on the compiler to optimize bad code.

Use a two pronged approach:1) Write easily optimizable code to begin with.

2) Once the code is debugged and working try more aggressive optimizations.

Page 8: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Example Code

• This talk will go through a number of simple examples of bad coding and show what compilers can and can’t do and how you can work ‘with’ the compiler to produce faster code.

• Example code is in:

/gpfs/projects/summer_inst/code_opt/

After this talk you will have an opportunity to try to optimize this code. There will be prizes for the most efficient codes.

Page 9: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

The Power 4 Architecture(Datastar)

Page 10: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

The Power 4 Architecture

• Remember: Clock speed is NOT everything.

CPU CPU MHzCINT2000 base/peak

CFP2000 base/peak

AMD Athlon XP 1900+

1600 677/701 588/634

HP PA-8700 750 568/604 526/576

IBM POWER4 (1CPU)

1300 790/814 1098/1169

Intel Itanium 800 314/314 645

Intel Pentium 4 2200 771/784 766/777

SUN UltraSPARC III-Cu

900 470/533 629/731

Page 11: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Why is the Power 4 chip faster for a lower clock speed?

• The speed comes from the inherent parallelism designed into the chip.• 2 Floating point units per core.• 2 Fixed point units per core.

Each FP unit can perform 1 fused multiply add (FMA) per clock tick. Thus the two floating point units can perform a theoretical maximum of 4 floating point operations per clock cycle.

• You should design your code with this in mind.

Page 12: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

What compilers will not do.

• They will not reorder large memory arrays.• They will not factor/simplify large equations.• They will not replace inefficient logic.• They will not vectorize large complex loops.• They will not fix inefficient memory access.• They will not detect and remove excess work.• They will not automatically invert divisions.• ...

Page 13: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

How can we help the compiler?

• We should aim to help the compiler as much as possible by designing our code to match the machine architecture.• Try to always move linearly in memory (Cache hits).• Factor / Simplify equations as much as possible.• Think carefully about how we do things. Is there a more efficient

approach?• Remove excess work.• Avoid branching inside loops.

I will take you through a series of simple examples that you can then use to try and optimize the example code.

Page 14: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Example ‘bad’ code

• In /gpfs/projects/summer_inst/code_opt/ is an example piece of code that we will attempt to optimize.

• This code solves the following ‘fictional’ equation:

1

ij i ij jr q r q

ijj i ij

e eE r cut

r a

Page 15: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Example ‘bad’ CodeThe main loop of the code is as follows: total_e = 0.0d0 cut_count = 0 do i = 1, natom do j = 1, natom if ( j < i ) then !Avoid double counting. vec2 = (coords(1,i)-coords(1,j))**2 + (coords(2,i)-coords(2,j))**2 +

(coords(3,i)-coords(3,j))**2 !X^2 + Y^2 + Z^2 rij = sqrt(vec2) !Check if this is below the cut off if ( rij <= cut ) then cut_count = cut_count+1 !Increment the counter of pairs below cutoff current_e = (exp(rij*q(i))*exp(rij*q(j)))/rij total_e = total_e + current_e + 1.0d0/a end if end if end do end do

While only short there is a lot of potential optimization available with this code.

Page 16: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Step 1 – Select best compiler / library options

• We have sqrt’s and exponentials.• We also have multiple loops.

Hence it is essential that we compile with optimization and we link again IBM’s accelerated math library (mass).xlf90 –o original.x –O3 –qarch=pwr4

–qtune=pwr4 –L/usr/lib –lmass original.f

Compiler Options Execution time (s)

-O0 552.0

-O3 183.3

-O5 179.3

-O3 -lmass 176.8

-O5 -lmass 177.3

Page 17: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Step 2 – Code Optimization

• We can modify the code to make the compiler’s job easier.

• I will take you through a series of examples and then you can use these to optimize the example code.

Page 18: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Optimization Tip 1• Avoid expensive mathematical operations as much as

possible.• Addition, Subtraction, Multiplication = Very Fast• Division and Exponentiation = Slow• Square Root = Very Slow

(Inverse square root can sometimes be faster)

Operation Min Cycles per iteration (L1 Cache)

x(i) = y(i) 1.7

x(i) = x(i) + y(i) 1.7

x(i) = x(i) + s*y(i) 1.7

x(i) = 1/y(i) 15.1

x(i) = sqrt(y(i)) 18.1

Page 19: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Optimization Tip 1 Contd...

• E.g. in our example code we need to test if Rij is <= to a cut off.

rij = sqrt(x^2+y^2+z^2) if (rij <= cut) then ... end if

• We can avoid doing excessive square roots by taking into account that cut is a constant and so just comparing to cut^2.

vec2 = x^2+y^2+z^2 if (vec2 <= cut) then rij = sqrt(vec2) ...

end if

Page 20: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Optimization Tip 2

• Do not rely on the compiler to simplify / factorize equations for you. You should manually do this yourself. E.g.:

• This replaces 2 expensive exponentials with just 1.• The same is true for excessive divisions, square roots

etc.• Try expanding the equation that your code represents

and then refactorizing it with the intention of avoiding as many square roots, exponentials and divisions as possible.

ij i jij i ij j R q qR q R qe e e

Page 21: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Optimization Tip 3• Invert divisions wherever possible.

do i = 1, N do j = 1, N y(j,i) = x(j,i) / r(i) end doend do

Becomesdo i = 1, Noner = 1.0d0/r(i)

do j = 1, N y(j,i) = x(j,i) * oner end doend do

This replaces N^2 divisions with N divisions and N multiplications.

Page 22: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Optimization Tip 3 (Extended Example)

• One can extend this idea to situations where something both requires a multiplication by a distance and a division by a distance. E.g.do j = 1, iminus vec(1:3)=qmx(1:3)-qmcoords(1:3,j) r2 = vec(1)**2+vec(2)**2+vec(3)**2 onerij = 1.0d0/sqrt(r2) rij = r2*onerij !1.0d0/rij ... ... = ... + param*rij sum = sum+q(j)*onerijend do

This takes advantage of the fact that on some systems an inverse square root followed by a multiplication can be faster than a square root followed by a division.

Page 23: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Optimization Tip 4• Avoid branching in inner loops – This leads to cache

misses and also prevents the compiler from vectorizing loops and making use of the multiple floating point units.

do i=1,n if (r < 1.0e-16) then if (r < 1.0e-16) then do i=1,n a(i)=0.0; b(i)=0.0; c(i)=0.0 a(i)=0.0; b(i)=0.0; c(i)=0.0 else end do a(i)=x(i)/r else b(i)=y(i)/r do i=1,n c(i)=z(i)/r a(i)=x(i)/r end if b(i)=y(i)/rend do c(i)=z(i)/r

end do end if

Page 24: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Optimization Tip 4 Contd...

• How can we improve this code even more?

if (r < 1.0e-16) then do i=1,n a(i)=0.0; b(i)=0.0; c(i)=0.0 end doelse do i=1,n a(i)=x(i)/r b(i)=y(i)/r c(i)=z(i)/r end doend if

if (r < 1.0e-16) then do i=1,n a(i)=0.0; b(i)=0.0; c(i)=0.0 end doelse oner=1.0d0/r do i=1,n a(i)=x(i)*oner b(i)=y(i)*oner c(i)=z(i)*oner end doend if

Invert Divisions

Page 25: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Optimization Tip 4 Contd.• How can we remove the following if statement from

the example code?

• Hint: consider changing the loop indexes.

do i = 1, natom do j = 1, natom if ( j < i ) then !Avoid double counting. vec2 = (coords(i,1)-coords(j,1))**2 + (coords(i,2)-coords(j,2))**2 + (coords(i,3)-coords(j,3))**2 !X^2 + Y^2 + Z^2 rij = sqrt(vec2) !Check if this is below the cut off if ( rij <= cut ) then cut_count = cut_count+1 !Increment the counter of pairs below cutoff current_e = (exp(rij*q(i))*exp(rij*q(j)))/rij total_e = total_e + current_e + 1.0d0/a end if end if end do end do

Page 26: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Optimization Tip 5• Ensure that loop indexes are constant while a

loop is executing. E.g.:

do i = 2, N ... do j = 1, i-1 ... end doend do

Some compilers will not recognize that i-1 is constant during the j loop and so will re-evaluate i-1 on each loop iteration. We can help the compiler out here as follows:

do i = 2, N...iminus = i-1do j = 1, iminus ...end doend do

Page 27: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Optimization Tip 6

• Avoid excessive array lookups – Factor array lookups that are not dependent on the inner loop out of the loop.• Some compilers will do this for you. Others will not.

x(i) becomes a scalar in the inner loop.

do i = 1, N do j = 1,N ... sum = sum+x(j)*x(i) end doend do

do i = 1, N xi = x(i) do j = 1,N ... sum = sum+x(j)*xi end doend do

Page 28: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Optimization Tip 7• Always try to use stride 1 memory access.• Traverse linearly in memory.• Maximizes Cache hits.• In Fortran this means you should always loop

over the first array index in the inner loop. E.g.

do i=1, Ni do j=1,Nj do k=1,Nk v=x(i,j,k) ... end do end doend do

Wrong!This way round each iteration of loop k jumps a total of i*j*8 bytes meaning it almost always misses the cache

Page 29: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Optimization Tip 7 Contd.

• By changing the order of the loop we can ensure that we always stride linearly through the x array.

• Note: in C memory is laid out in the reverse fashion. Ideally you should use a linear array and do the necessary offset arithmetic yourself.

do k=1, Nk do j=1,Nj do i=1,Ni v=x(i,j,k) ... end do end doend do

Page 30: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Optimization Tip 7 Contd.

• If you have multiple nested loops that loop entirely through an array it can often be more beneficial to replace a multi-dimensional array with a 1-D array and do the arithmetic yourself. E.g.

• This avoids unnecessary pointer lookups and can make it easier to vectorize the code

Nitr = Nk*Nj*Nido k=1, Nitr v=x(k) ...end do

Page 31: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Optimization Tip 7 Contd.• Sometimes (you need to test), especially on the Power 4 architecture

it can help to split arrays of the form (1:x,N) where x is a small (ca. 3) constant into a set of x 1-D arrays. E.g.

do i = 1, N ... do j = 1, N vec2=(xi-crd(1,j))**2+(yi-crd(2,j))**2+(zi-crd(3,j))**2 ... end do end do

BECOMES do i = 1, N ... do j = 1, N vec2=(xi-x(j))**2+(yi-y(j))**2+(zi-z(j))**2 ... end do end do

The reasoning behind this is that the power4 chip has multiple cache lines and so by splitting the array into 3 seperate arrays the compiler can utilize 3 pipelines simultaneously. (You need to test if this works for your code!)

Page 32: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Vectorization

• Sometimes if you have a large amount of computation occurring inside a loop it can be beneficial to explicitly vectorize the code and make use of specific vector math libraries.

• Often the compiler can do this vectorization for you (xlf90 will explicitly try if –qhot is specified).

• Compiler can typically only vectorize small inner loops. Complex loop structures will not get vectorized. Here your code can benefit from explicit vectorization.

Page 33: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Why Does Vectorization Help on a Scalar Machine?

• Machine is actually super scalar...• Has a pipeline for doing computation.• Has multiple floating point units.

• Vectorization makes it easy for the compiler to use multiple floating point units as we have a large number of operations that are all independent of each other.

• Vectorization makes it easy to fill the pipeline.

Page 34: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

• On Power 4 a single FMA has to go through 6 stages in the pipeline.

• After 1 clock cycle the first FMA operation has completed stage 1 and moves on to stage 2. Processor can now start processing stage 1 of the second FMA operation in parallel with stage 2 of first operation etc etc...

Why Does Vectorization Help on a Scalar Machine?

Page 35: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

• After 6 operations pipelining of subsequent FMA operations gives one result every clock cycle.• Pipeline latency is thus hidden beyond 6 operations.

• Power 4 chip has two floating point units so needs a minimum of 12 independent FMA operations to be fully pipelined.

• Thus if we can split our calculation up into a series of long ‘independent’ vector calculations we can maximize the floating point performance of the chip.

Why Does Vectorization Help on a Scalar Machine?

Page 36: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Vectorization Contd.

do i = 2, Ndo j = 1, i-1

vec2 = ... onerij = 1.0d0/sqrt(vec2) rij = vec2*onerij exp1 = exp(rij) esum = esum + exp1*onerijend do

end do

Scalar Code

Page 37: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Vectorization Contd.

loopcount = 0do i = 2, N

do j = 1, i-1 loopcount = loopcount+1

vec2 = ... vectmp1(loopcount) = vec2 end do

end do!Begin vector operationscall vrsqrt(vectmp2,vectmp1,loopcount)!vectmp2 now contains onerijvectmp1(1:loopcount) = vectmp1(1:loopcount) * vectmp2(1:loopcount)!vectmp1 now contains rijcall vexp(vectmp1,vectmp1,loopcount)!vectmp 1 now contains exp(rij)do i = 1, loopcount esum = esum + vectmp1(i)*vectmp2(i)end do

Vector Codedo i = 2, N

do j = 1, i-1 vec2 = ...

onerij = 1.0d0/sqrt(vec2) rij = vec2*onerij exp1 = exp(rij) esum = esum + exp1*onerijend do

end do

Page 38: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

IBM’s vector math library

• IBM has a special tuned vector math library for use on Datastar. This library is called libmassvp4 and can be compiled into your code with:• xlf90 -o foo.x -O3 -L/usr/lib -lmassvp4 foo.f

• Support is provided for a number of vector functions including:vrec (vectored inverse), vexp (vectored exp)vsqrt (vectored sqrt), vrsqrt (vectored invsqrt)vlog (vectored ln), vcos (vectored cosine),vtanh (vectored tanh)

Page 39: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

The Example Code(A competition)

• As mentioned previously an example piece of code is available in /gpfs/projects/summer_inst/code_opt/ on datastar.(COPY THIS TO YOUR OWN DIRECTORY ON GPFS BEFORE EDITING)

• Provided in this directory is the unoptimized code (original.f), a script to compile it (compile.x) and a loadleveller script to submit it on 1 cpu (llsubmit submit_orig.ll)

• There are also a set of optimized codes that I have created (both scalar and vector). These are read protected at present and will be made readable to all at the end of today.

Page 40: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

The Example Code(A competition)

• Your mission (should you choose to accept it) is to optimize this example code so that it gives the same answer (to 12 s.f.) on a single processor

Page 41: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

The Example Code(A competition)

• My timings are as follows(for 1 cpu of Datastar (1.5GHz Node):

Code Version Execution Time (s)

Original Code(original.f)

176.80

Standard Optimized Code(optimized_standard.f)

41.46

Vectorized Code(optimized_vector.f)

30.64

PRIZES!!! - There will be a prize for each person who gets better than 31.0 seconds on 1cpu x 1.5GHz and

answer to within 12 s.f.

Page 42: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Directory Contents

• Directory = /gpfs/projects/summer_inst/code_opt/

-rwxr-xr-x ... compile.x* ./compile.x to compile codes-rw-r--r-- ... coords.dat coordinate file read by codedrwxr-xr-x ... opt/ directory containing optimized solution-rw-r--r-- ... original.f original single cpu code (start here)-rw-r--r-- ... original.out output from run-rwxr-xr-x ... original.x* executable built by compile.x-rw-r--r-- ... original_16cpu.out output from 16cpu run-rw-r--r-- ... original_mpi.f original multi cpu code (start here 2)-rwxr-xr-x ... original_mpi.x* executable built by compile.x-rw-r--r-- ... run_original.llscript Loadleveler script to run 1 cpu job-rw-r--r-- ... run_original_16cpu.llscript Loadleveler script to run 16 cpu job

Everything in opt directory is readable except the source code ;-) I will make this available at the end of today.

Page 43: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Running Code

• We will use the express queue so you must login to “we need to pick a hostname”

• Copy the original code and scripts to a directory that is owned by you.• I recommend you do

» cd /gpfs» mkdir username (substitute your guest username here)» cd username» cp /gpfs/projects/summer_inst/code_opt/* ./

• Edit the original code with the editor of your choice and apply any optimizations you think it needs.

Page 44: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Running Code 2

• Once you have optimized some of the code you can compile it with the ./compile.x script (will compile both serial and parallel) or with (for single cpu):

xlf90 -o original.x -L/usr/lib -lmass -O3 -qtune=pwr4 -qarch=pwr4-bmaxdata:0x80000000 original.f

• You then need to submit your job to the express queue

Page 45: SAN DIEGO SUPERCOMPUTER CENTER Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker with some modifcations by Grimshaw

SAN DIEGO SUPERCOMPUTER CENTER

Code Results

• The original code should produce the following answers:

• Num Pairs = 18652660(Your code should get this exactly)

• Total E = 54364.4699664615910(Your code should get this to within 12

s.f)