san diego supercomputer center single processor code optimization for the datastar (power 4)...
TRANSCRIPT
SAN DIEGO SUPERCOMPUTER CENTER
Single Processor Code Optimizationfor the
Datastar (Power 4) Architecture
by Ross Walker
with some modifcations by Grimshaw
SAN DIEGO SUPERCOMPUTER CENTER
Why optimize your code for single cpu performance?
Advantages
• More calculation in a given time.• Less SUs per simulation.• Faster time to solution.• Better chance of successful
MRAC/LRAC proposals.
SAN DIEGO SUPERCOMPUTER CENTER
REMEMBERWALLCLOCK TIME IS EVERYTHING!
• The only metric that ultimately matters is Wallclock time to solution.• Wallclock is how long it takes to run your simulation.• Wallclock is how much you get charged for.• Wallclock is how long your code is blocking other users
from using the machine.
•Lies, Damn Lies and Statistics...•TFlops/PFlops numbers are ‘worse’ than statistics.•You can easily write a code that gets very high Flops numbers but has a longer time to solution.
SAN DIEGO SUPERCOMPUTER CENTER
Why optimize your code for single cpu performance?
Disadvantages
• Time consuming – Do not waste weeks optimizing a one off code that will run for 1 hour.
• Can make the code harder to read and debug.
• Can adversely affect parallel scaling.• Different architectures can respond in
different ways.
SAN DIEGO SUPERCOMPUTER CENTER
When to optimize?
• Code optimization is an iterative process requiring time, energy and thought.
• Performance tuning is recommended for:
1) production codes that are widely distributed and often used in the research community
2) projects with limited allocation (to maximize available compute hours).
SAN DIEGO SUPERCOMPUTER CENTER
Optimization Strategy
This talk will concentrate on how to carry out this section.
SAN DIEGO SUPERCOMPUTER CENTER
Optimization Strategy
• Aim: To try to minimize the amount of work the compiler is required to do.
DO NOT rely on the compiler to optimize bad code.
Use a two pronged approach:1) Write easily optimizable code to begin with.
2) Once the code is debugged and working try more aggressive optimizations.
SAN DIEGO SUPERCOMPUTER CENTER
Example Code
• This talk will go through a number of simple examples of bad coding and show what compilers can and can’t do and how you can work ‘with’ the compiler to produce faster code.
• Example code is in:
/gpfs/projects/summer_inst/code_opt/
After this talk you will have an opportunity to try to optimize this code. There will be prizes for the most efficient codes.
SAN DIEGO SUPERCOMPUTER CENTER
The Power 4 Architecture(Datastar)
SAN DIEGO SUPERCOMPUTER CENTER
The Power 4 Architecture
• Remember: Clock speed is NOT everything.
CPU CPU MHzCINT2000 base/peak
CFP2000 base/peak
AMD Athlon XP 1900+
1600 677/701 588/634
HP PA-8700 750 568/604 526/576
IBM POWER4 (1CPU)
1300 790/814 1098/1169
Intel Itanium 800 314/314 645
Intel Pentium 4 2200 771/784 766/777
SUN UltraSPARC III-Cu
900 470/533 629/731
SAN DIEGO SUPERCOMPUTER CENTER
Why is the Power 4 chip faster for a lower clock speed?
• The speed comes from the inherent parallelism designed into the chip.• 2 Floating point units per core.• 2 Fixed point units per core.
Each FP unit can perform 1 fused multiply add (FMA) per clock tick. Thus the two floating point units can perform a theoretical maximum of 4 floating point operations per clock cycle.
• You should design your code with this in mind.
SAN DIEGO SUPERCOMPUTER CENTER
What compilers will not do.
• They will not reorder large memory arrays.• They will not factor/simplify large equations.• They will not replace inefficient logic.• They will not vectorize large complex loops.• They will not fix inefficient memory access.• They will not detect and remove excess work.• They will not automatically invert divisions.• ...
SAN DIEGO SUPERCOMPUTER CENTER
How can we help the compiler?
• We should aim to help the compiler as much as possible by designing our code to match the machine architecture.• Try to always move linearly in memory (Cache hits).• Factor / Simplify equations as much as possible.• Think carefully about how we do things. Is there a more efficient
approach?• Remove excess work.• Avoid branching inside loops.
I will take you through a series of simple examples that you can then use to try and optimize the example code.
SAN DIEGO SUPERCOMPUTER CENTER
Example ‘bad’ code
• In /gpfs/projects/summer_inst/code_opt/ is an example piece of code that we will attempt to optimize.
• This code solves the following ‘fictional’ equation:
1
ij i ij jr q r q
ijj i ij
e eE r cut
r a
SAN DIEGO SUPERCOMPUTER CENTER
Example ‘bad’ CodeThe main loop of the code is as follows: total_e = 0.0d0 cut_count = 0 do i = 1, natom do j = 1, natom if ( j < i ) then !Avoid double counting. vec2 = (coords(1,i)-coords(1,j))**2 + (coords(2,i)-coords(2,j))**2 +
(coords(3,i)-coords(3,j))**2 !X^2 + Y^2 + Z^2 rij = sqrt(vec2) !Check if this is below the cut off if ( rij <= cut ) then cut_count = cut_count+1 !Increment the counter of pairs below cutoff current_e = (exp(rij*q(i))*exp(rij*q(j)))/rij total_e = total_e + current_e + 1.0d0/a end if end if end do end do
While only short there is a lot of potential optimization available with this code.
SAN DIEGO SUPERCOMPUTER CENTER
Step 1 – Select best compiler / library options
• We have sqrt’s and exponentials.• We also have multiple loops.
Hence it is essential that we compile with optimization and we link again IBM’s accelerated math library (mass).xlf90 –o original.x –O3 –qarch=pwr4
–qtune=pwr4 –L/usr/lib –lmass original.f
Compiler Options Execution time (s)
-O0 552.0
-O3 183.3
-O5 179.3
-O3 -lmass 176.8
-O5 -lmass 177.3
SAN DIEGO SUPERCOMPUTER CENTER
Step 2 – Code Optimization
• We can modify the code to make the compiler’s job easier.
• I will take you through a series of examples and then you can use these to optimize the example code.
SAN DIEGO SUPERCOMPUTER CENTER
Optimization Tip 1• Avoid expensive mathematical operations as much as
possible.• Addition, Subtraction, Multiplication = Very Fast• Division and Exponentiation = Slow• Square Root = Very Slow
(Inverse square root can sometimes be faster)
Operation Min Cycles per iteration (L1 Cache)
x(i) = y(i) 1.7
x(i) = x(i) + y(i) 1.7
x(i) = x(i) + s*y(i) 1.7
x(i) = 1/y(i) 15.1
x(i) = sqrt(y(i)) 18.1
SAN DIEGO SUPERCOMPUTER CENTER
Optimization Tip 1 Contd...
• E.g. in our example code we need to test if Rij is <= to a cut off.
rij = sqrt(x^2+y^2+z^2) if (rij <= cut) then ... end if
• We can avoid doing excessive square roots by taking into account that cut is a constant and so just comparing to cut^2.
vec2 = x^2+y^2+z^2 if (vec2 <= cut) then rij = sqrt(vec2) ...
end if
SAN DIEGO SUPERCOMPUTER CENTER
Optimization Tip 2
• Do not rely on the compiler to simplify / factorize equations for you. You should manually do this yourself. E.g.:
• This replaces 2 expensive exponentials with just 1.• The same is true for excessive divisions, square roots
etc.• Try expanding the equation that your code represents
and then refactorizing it with the intention of avoiding as many square roots, exponentials and divisions as possible.
ij i jij i ij j R q qR q R qe e e
SAN DIEGO SUPERCOMPUTER CENTER
Optimization Tip 3• Invert divisions wherever possible.
do i = 1, N do j = 1, N y(j,i) = x(j,i) / r(i) end doend do
Becomesdo i = 1, Noner = 1.0d0/r(i)
do j = 1, N y(j,i) = x(j,i) * oner end doend do
This replaces N^2 divisions with N divisions and N multiplications.
SAN DIEGO SUPERCOMPUTER CENTER
Optimization Tip 3 (Extended Example)
• One can extend this idea to situations where something both requires a multiplication by a distance and a division by a distance. E.g.do j = 1, iminus vec(1:3)=qmx(1:3)-qmcoords(1:3,j) r2 = vec(1)**2+vec(2)**2+vec(3)**2 onerij = 1.0d0/sqrt(r2) rij = r2*onerij !1.0d0/rij ... ... = ... + param*rij sum = sum+q(j)*onerijend do
This takes advantage of the fact that on some systems an inverse square root followed by a multiplication can be faster than a square root followed by a division.
SAN DIEGO SUPERCOMPUTER CENTER
Optimization Tip 4• Avoid branching in inner loops – This leads to cache
misses and also prevents the compiler from vectorizing loops and making use of the multiple floating point units.
do i=1,n if (r < 1.0e-16) then if (r < 1.0e-16) then do i=1,n a(i)=0.0; b(i)=0.0; c(i)=0.0 a(i)=0.0; b(i)=0.0; c(i)=0.0 else end do a(i)=x(i)/r else b(i)=y(i)/r do i=1,n c(i)=z(i)/r a(i)=x(i)/r end if b(i)=y(i)/rend do c(i)=z(i)/r
end do end if
SAN DIEGO SUPERCOMPUTER CENTER
Optimization Tip 4 Contd...
• How can we improve this code even more?
if (r < 1.0e-16) then do i=1,n a(i)=0.0; b(i)=0.0; c(i)=0.0 end doelse do i=1,n a(i)=x(i)/r b(i)=y(i)/r c(i)=z(i)/r end doend if
if (r < 1.0e-16) then do i=1,n a(i)=0.0; b(i)=0.0; c(i)=0.0 end doelse oner=1.0d0/r do i=1,n a(i)=x(i)*oner b(i)=y(i)*oner c(i)=z(i)*oner end doend if
Invert Divisions
SAN DIEGO SUPERCOMPUTER CENTER
Optimization Tip 4 Contd.• How can we remove the following if statement from
the example code?
• Hint: consider changing the loop indexes.
do i = 1, natom do j = 1, natom if ( j < i ) then !Avoid double counting. vec2 = (coords(i,1)-coords(j,1))**2 + (coords(i,2)-coords(j,2))**2 + (coords(i,3)-coords(j,3))**2 !X^2 + Y^2 + Z^2 rij = sqrt(vec2) !Check if this is below the cut off if ( rij <= cut ) then cut_count = cut_count+1 !Increment the counter of pairs below cutoff current_e = (exp(rij*q(i))*exp(rij*q(j)))/rij total_e = total_e + current_e + 1.0d0/a end if end if end do end do
SAN DIEGO SUPERCOMPUTER CENTER
Optimization Tip 5• Ensure that loop indexes are constant while a
loop is executing. E.g.:
do i = 2, N ... do j = 1, i-1 ... end doend do
Some compilers will not recognize that i-1 is constant during the j loop and so will re-evaluate i-1 on each loop iteration. We can help the compiler out here as follows:
do i = 2, N...iminus = i-1do j = 1, iminus ...end doend do
SAN DIEGO SUPERCOMPUTER CENTER
Optimization Tip 6
• Avoid excessive array lookups – Factor array lookups that are not dependent on the inner loop out of the loop.• Some compilers will do this for you. Others will not.
x(i) becomes a scalar in the inner loop.
do i = 1, N do j = 1,N ... sum = sum+x(j)*x(i) end doend do
do i = 1, N xi = x(i) do j = 1,N ... sum = sum+x(j)*xi end doend do
SAN DIEGO SUPERCOMPUTER CENTER
Optimization Tip 7• Always try to use stride 1 memory access.• Traverse linearly in memory.• Maximizes Cache hits.• In Fortran this means you should always loop
over the first array index in the inner loop. E.g.
do i=1, Ni do j=1,Nj do k=1,Nk v=x(i,j,k) ... end do end doend do
Wrong!This way round each iteration of loop k jumps a total of i*j*8 bytes meaning it almost always misses the cache
SAN DIEGO SUPERCOMPUTER CENTER
Optimization Tip 7 Contd.
• By changing the order of the loop we can ensure that we always stride linearly through the x array.
• Note: in C memory is laid out in the reverse fashion. Ideally you should use a linear array and do the necessary offset arithmetic yourself.
do k=1, Nk do j=1,Nj do i=1,Ni v=x(i,j,k) ... end do end doend do
SAN DIEGO SUPERCOMPUTER CENTER
Optimization Tip 7 Contd.
• If you have multiple nested loops that loop entirely through an array it can often be more beneficial to replace a multi-dimensional array with a 1-D array and do the arithmetic yourself. E.g.
• This avoids unnecessary pointer lookups and can make it easier to vectorize the code
Nitr = Nk*Nj*Nido k=1, Nitr v=x(k) ...end do
SAN DIEGO SUPERCOMPUTER CENTER
Optimization Tip 7 Contd.• Sometimes (you need to test), especially on the Power 4 architecture
it can help to split arrays of the form (1:x,N) where x is a small (ca. 3) constant into a set of x 1-D arrays. E.g.
do i = 1, N ... do j = 1, N vec2=(xi-crd(1,j))**2+(yi-crd(2,j))**2+(zi-crd(3,j))**2 ... end do end do
BECOMES do i = 1, N ... do j = 1, N vec2=(xi-x(j))**2+(yi-y(j))**2+(zi-z(j))**2 ... end do end do
The reasoning behind this is that the power4 chip has multiple cache lines and so by splitting the array into 3 seperate arrays the compiler can utilize 3 pipelines simultaneously. (You need to test if this works for your code!)
SAN DIEGO SUPERCOMPUTER CENTER
Vectorization
• Sometimes if you have a large amount of computation occurring inside a loop it can be beneficial to explicitly vectorize the code and make use of specific vector math libraries.
• Often the compiler can do this vectorization for you (xlf90 will explicitly try if –qhot is specified).
• Compiler can typically only vectorize small inner loops. Complex loop structures will not get vectorized. Here your code can benefit from explicit vectorization.
SAN DIEGO SUPERCOMPUTER CENTER
Why Does Vectorization Help on a Scalar Machine?
• Machine is actually super scalar...• Has a pipeline for doing computation.• Has multiple floating point units.
• Vectorization makes it easy for the compiler to use multiple floating point units as we have a large number of operations that are all independent of each other.
• Vectorization makes it easy to fill the pipeline.
SAN DIEGO SUPERCOMPUTER CENTER
• On Power 4 a single FMA has to go through 6 stages in the pipeline.
• After 1 clock cycle the first FMA operation has completed stage 1 and moves on to stage 2. Processor can now start processing stage 1 of the second FMA operation in parallel with stage 2 of first operation etc etc...
Why Does Vectorization Help on a Scalar Machine?
SAN DIEGO SUPERCOMPUTER CENTER
• After 6 operations pipelining of subsequent FMA operations gives one result every clock cycle.• Pipeline latency is thus hidden beyond 6 operations.
• Power 4 chip has two floating point units so needs a minimum of 12 independent FMA operations to be fully pipelined.
• Thus if we can split our calculation up into a series of long ‘independent’ vector calculations we can maximize the floating point performance of the chip.
Why Does Vectorization Help on a Scalar Machine?
SAN DIEGO SUPERCOMPUTER CENTER
Vectorization Contd.
do i = 2, Ndo j = 1, i-1
vec2 = ... onerij = 1.0d0/sqrt(vec2) rij = vec2*onerij exp1 = exp(rij) esum = esum + exp1*onerijend do
end do
Scalar Code
SAN DIEGO SUPERCOMPUTER CENTER
Vectorization Contd.
loopcount = 0do i = 2, N
do j = 1, i-1 loopcount = loopcount+1
vec2 = ... vectmp1(loopcount) = vec2 end do
end do!Begin vector operationscall vrsqrt(vectmp2,vectmp1,loopcount)!vectmp2 now contains onerijvectmp1(1:loopcount) = vectmp1(1:loopcount) * vectmp2(1:loopcount)!vectmp1 now contains rijcall vexp(vectmp1,vectmp1,loopcount)!vectmp 1 now contains exp(rij)do i = 1, loopcount esum = esum + vectmp1(i)*vectmp2(i)end do
Vector Codedo i = 2, N
do j = 1, i-1 vec2 = ...
onerij = 1.0d0/sqrt(vec2) rij = vec2*onerij exp1 = exp(rij) esum = esum + exp1*onerijend do
end do
SAN DIEGO SUPERCOMPUTER CENTER
IBM’s vector math library
• IBM has a special tuned vector math library for use on Datastar. This library is called libmassvp4 and can be compiled into your code with:• xlf90 -o foo.x -O3 -L/usr/lib -lmassvp4 foo.f
• Support is provided for a number of vector functions including:vrec (vectored inverse), vexp (vectored exp)vsqrt (vectored sqrt), vrsqrt (vectored invsqrt)vlog (vectored ln), vcos (vectored cosine),vtanh (vectored tanh)
SAN DIEGO SUPERCOMPUTER CENTER
The Example Code(A competition)
• As mentioned previously an example piece of code is available in /gpfs/projects/summer_inst/code_opt/ on datastar.(COPY THIS TO YOUR OWN DIRECTORY ON GPFS BEFORE EDITING)
• Provided in this directory is the unoptimized code (original.f), a script to compile it (compile.x) and a loadleveller script to submit it on 1 cpu (llsubmit submit_orig.ll)
• There are also a set of optimized codes that I have created (both scalar and vector). These are read protected at present and will be made readable to all at the end of today.
SAN DIEGO SUPERCOMPUTER CENTER
The Example Code(A competition)
• Your mission (should you choose to accept it) is to optimize this example code so that it gives the same answer (to 12 s.f.) on a single processor
SAN DIEGO SUPERCOMPUTER CENTER
The Example Code(A competition)
• My timings are as follows(for 1 cpu of Datastar (1.5GHz Node):
Code Version Execution Time (s)
Original Code(original.f)
176.80
Standard Optimized Code(optimized_standard.f)
41.46
Vectorized Code(optimized_vector.f)
30.64
PRIZES!!! - There will be a prize for each person who gets better than 31.0 seconds on 1cpu x 1.5GHz and
answer to within 12 s.f.
SAN DIEGO SUPERCOMPUTER CENTER
Directory Contents
• Directory = /gpfs/projects/summer_inst/code_opt/
-rwxr-xr-x ... compile.x* ./compile.x to compile codes-rw-r--r-- ... coords.dat coordinate file read by codedrwxr-xr-x ... opt/ directory containing optimized solution-rw-r--r-- ... original.f original single cpu code (start here)-rw-r--r-- ... original.out output from run-rwxr-xr-x ... original.x* executable built by compile.x-rw-r--r-- ... original_16cpu.out output from 16cpu run-rw-r--r-- ... original_mpi.f original multi cpu code (start here 2)-rwxr-xr-x ... original_mpi.x* executable built by compile.x-rw-r--r-- ... run_original.llscript Loadleveler script to run 1 cpu job-rw-r--r-- ... run_original_16cpu.llscript Loadleveler script to run 16 cpu job
Everything in opt directory is readable except the source code ;-) I will make this available at the end of today.
SAN DIEGO SUPERCOMPUTER CENTER
Running Code
• We will use the express queue so you must login to “we need to pick a hostname”
• Copy the original code and scripts to a directory that is owned by you.• I recommend you do
» cd /gpfs» mkdir username (substitute your guest username here)» cd username» cp /gpfs/projects/summer_inst/code_opt/* ./
• Edit the original code with the editor of your choice and apply any optimizations you think it needs.
SAN DIEGO SUPERCOMPUTER CENTER
Running Code 2
• Once you have optimized some of the code you can compile it with the ./compile.x script (will compile both serial and parallel) or with (for single cpu):
xlf90 -o original.x -L/usr/lib -lmass -O3 -qtune=pwr4 -qarch=pwr4-bmaxdata:0x80000000 original.f
• You then need to submit your job to the express queue
SAN DIEGO SUPERCOMPUTER CENTER
Code Results
• The original code should produce the following answers:
• Num Pairs = 18652660(Your code should get this exactly)
• Total E = 54364.4699664615910(Your code should get this to within 12
s.f)