the synergy between computer science and cfd modeling dov kruger anne pence alan blumberg

The Synergy between Computer Science and CFD Modeling

Dov Kruger

Anne Pence

Alan Blumberg

Background of the Davidson Laboratory Team

• Professor Alan Blumberg, co-author of POM and author of ECOMSED

• Dov Kruger, computer science, transitioning into ocean research

• Anne Pence, PhD candidate with a background in naval architecture and Oceanography

CFD is a Computing War Against Big Problems

• Overview of 5 key areas, and focus on our research in single and multiple CPU performance and accuracy– Understanding the battleground: Architectural Features of

Computers– Having a tactical plan: coding efficiency– Effective Use of resources:

• Parallelism on shared memory architectures– Auto-parallelizing compilers– OpenMP

• Parallelism using distributed computers (MPI)

– Hitting the correct target: Accuracy

Ignorance is useful, knowledge is crucial

• We questioned everything, including equation of state

• Surprising results: a new equation of state, and a new style of coding CFD models that coalesces loops and is significantly faster on modern computers

• Nothing worked until Professor Blumberg– Conveyed the analytics of the core routines– Designed suitable test cases– Debugged problems

Doubling Performance of Barotropic Mode: 10 months +

2.5 hours• Blindly optimizing the code is very difficult

– It took time to understand the model

• Problems– Validating results– Finding the right test case– Incremental test procedure

• Once the correct strategy was in place, it took only 2.5 hours to double the performance of barotropic mode– More optimizations are possible, but they get exponentially

more difficult– We propose not to do them by hand

A Sample Problem

Computer Architecture

• Locality = Speed!– Registers– Different Levels of cache– Main Memory– Swapping– Disk– Sequential access is much faster than non-sequential

• Locality is also accuracy• Multiple execution units

– The limiting factor today is memory bandwidth– ECOMSED and POM are both memory limited

Efficiency

• There is a cost structure to operations– Changes over time

• Today’s Highest Costs– Writing to memory

• Cache must flush to memory

– Reading from memory• Statistically, cached

explogsqrtDivisionMultiplication

Heuristics for Writing Fast Code

• Efficiency doesn't hurt– It may not help

• Minimize cost of operations– Different on each machine– But similar enough that heuristics will produce

good results on most architectures– On average, performance will improve if

expressions are sufficiently large

Example: Computing Bottom Stress: original

do j=2,jmm1 do i=2,imm1 ubar(i,j,kbm1)=0.5*(ua(i,j)+ua(i+1,j)) vbar(i,j,kbm1)=0.5*(va(i,j)+va(i,j+1)) enddo enddo do j=2,jmm1 do i=2,imm1 qbar(i,j)=sqrt(ubar(i,j,kbm1)*ubar(i,j,kbm1)+ $ vbar(i,j,kbm1)*vbar(i,j,kbm1)) enddo enddo

do j=2,jmm1 do i=2,imm1 if (fsm(i,j).gt.0.0) then tau(i,j,kb)=10000.*cbc(i,j)/ $ varbf(i,j)*qbar(i,j)*qbar(i,j) endif enddo enddo

Bottom Stress: 21 Times Faster

do j=2,jmm1 do i=2,imm1 tempuBAR = (UA(I,J)+UA(I+1,J)) tempVBAR = (VA(I,J)+VA(I,J+1)) TAU(I,J,KB)=CBC2(I,J)*(tempUBAR**2 + tempVBAR**2) enddoenddo

The Current State of Compiler Optimization

• Compilers typically do not rearrange floating point expressions– For fear of subtle roundoff effects– Instruction scheduling can still be done as long as the two

data streams do not interact– Example: (x * y) - (x / y) + (x + y) – (x – y)

• Common subexpressions are well-handled provided they are in the same form– Algebraic equivalences are not considered

• Constant subexpressions are pre-evaluated at compile time in C and FORTRAN

Examples: Writing Fast Code

• Only scalars are considered constant• Writing to an array is not optimized awayvar1(i,j,k) = c1 * c2 * c3var1(i,j,k) = var1(i,j,k) * c4var1(i,j,k) = c1 * c2 * c3 * c4c1 * var2(i,j,k) * c3 * var3(i,j,k) * c4c1*c3*c4 * var2(i,j,k) * var3(i,j,k)var2(i,j,k) * var3(i,j,k) * c1*c3*c4var2(i,j,k) * var3(i,j,k) * (c1*c3*c4)var1(i,j,k) / c1(i,j,k)

Skipping Land

• Goals– Save CPU by only processing water boxes– Cost very little CPU for all-water problems

• IF-tests slow down the CPU, so preferably avoid them• Stripwise “unstructured view” of the grid avoids

additional tests– inefficient on a vector machine

• k-order would be better to test an entire water column at once, but this would require re-engineering the entire model

Skipping land, continuedistart,jrow, iend

istart,jrow, iend

2,2,2 5,2,6

2,3,2 5,3,6

2,4,6

2,5,6

2,6,6

High Performance Advection

• Scalar Fluxes in one preferred direction– No memory access for flux– Full machine precision

xfluxi Xfluxi+1Var(i,j,k)=xFluxiP1 – xFluxixFluxi = xFluxiP1

Advection, contyfluxJ(1) = 0yFluxJ(im) = 0diffusiveyfluxJ(1) = 0diffusiveyFluxJ(im) = 0do k=2,kbm1 do i=2,imm1 yfluxJ(i) = 0 ! because of land boundary condition diffusiveYFluxJ(i) = 0 enddo do j=2,jmm1 xfluxI = 0 ! because of land boundary condition do i=2,imm1 xfluxIP1 = (f(i,j,k)+f(i+1,j,k)) * xmfl3d(i+1,j,k) yFluxJP1 = (f(i,j,k)+f(i,j+1,k)) * ymfl3d(i,j+1,k) diffusiveXFluxIP1 = -aamx(i,j,k) * addDTi(i,j,BACK) *$ (fb(i,j,k)-fb(i-1,j,k))*uMaskedH2_2H1(i,j) diffusiveYFluxJP1 = -aamy(i,j,k) * addDTj(i,j, BACK) *$ (fb(i,j,k)-fb(i,j-1,k))*vMaskedH1_2H2(i,j)

Advection, take 3 ff(i,j,k) =$ ($ fB(i,j,k) * volT(i,j,BACK) - !advective part$ dti2 * 0.5 * invDZ(k) *$ ($ (f(i,j,k-1)+f(i,j,k))*w(i,j,k) -$ (f(i,j,k)+f(i,j,k+1))*w(i,j,k+1)*ART(i,j) +$ xFluxIP1 - xFluxI + yFluxJP1 - yFluxJ(i)$ ) + ! diffusive part$ diffusiveXfluxIP1 - diffusiveXFluxI +$ diffusiveYFluxJP1 - diffusiveYFluxJ(i)$ ) *$ invVolT(i,j,FUTURE) xFluxI = xFluxIp1 diffusiveXFluxI = diffusiveXFluxIp1 yFluxJ(i) = yFluxJP1 diffusiveYFluxJ(i)=diffusiveYFluxJP1 enddo enddoenddo

K-order is Faster

• POM and ECOMSED are currently stored with I varying most frequently– var(i,j,k)

• For most algorithms, traversal order is irrelevant, except:– Vertical models are naturally ordered top to bottom

– A model stored and accessed in k-order can implement the vertical profile algorithms much more efficiently

Performance of PROFT/PROFS

• On a sample problem– 3 times faster for profiling temperature– 8 times faster for profiling salinity– This includes the penalty of traversing in the

wrong memory order for T,S• For a k-ordered model, code would be even faster

• Techniques as discussed before, and also– Removal of exponentiation where unnecessary

Shared Memory Parallelism

• Compiler automatically recognizes opportunities for multiple CPUs to split up loops

• Can be– Automatic (under compiler control)– Manual (OpenMP)

• Problem is memory bandwidth

Memory Bandwidth and Shared Memory Computers

• Different levels– Commodity PC dual processors– Mid-range shared-memory computers

• Larger cache

• Floating point performance of CPUs are critical

– Special-purpose machines with exotic memory and caches

• Write-back cache synchronization

MPI

• Message Passing Interface requires manual coding intervention

• The programmer can split the domain across multiple computers

• Messages are passed exchanging information between models

• This will not speed up small models due to the overhead of message passing– The volume of the model must be large with respect to

the data exchange interface

Accuracy

• Values in registers retain extra bits of precision• This can substantially improve computational

results• Every time values are stored to single precision

arrays, this extra information is lost• Example: Vectorization on the Pentium 4

– Very fast (4 simultaneous 32-bit operations)– Values change substantially in the model

• Loss of significant figures

Simulation is not Prediction

• Roundoff error slowly burns away digits of precision– No escape from information entropy

• Roundoff is indistinguishable from small differences in the initial conditions

• Single precision is not enough• After 12 hours, salinity results can differ by

3ppt on different computers

Calibration of Model Run on Different Machines?

• Calibration represents balancing all the constituents of the equations as they are computed by the model– Some parts of the model are more sensitive to

roundoff error than others– Some parts are more driven by forcing

functions and are therefore more stable– Turbulence closure is particularly twitchy

Density Computation

• The Equation of State has been defined by Millero and Poisson in UNESCO

• Accuracy is high– On the order of 6 digits– Original data (>500 samples) 5 digits

• Computational cost is also high– Many terms including s1.5

• How much accuracy is needed?

Comparison: Fofonoff 1952 vs. UNESCO

• ECOMSED used Fofonoff 1952 until recently– Approximately 0.6% difference at worst case at

approximate 10°C, 10ppt

• In computer model, the absolute density is irrelevant– The density difference between adjacent cells drives the

forcing

• In a 20ppt difference– No significant difference in X1– Small differences in distribution of salt gradient between

0 and 20 ppt.

Difference between UNESCO and Fofonoff 1950

A Faster Algorithm: densityDKAP

c1 = 6.38582008014815d-12 c2 = -1.07670862741285d-09 c3 = 9.73156784811086d-08 c4 = -9.0042779076007d-06 c5 = 6.60339902342078d-05 c6 = -0.000132403263360079d0 c7 = 4.91156586724696d-12 c8 = -7.81209323957205d-10 c9 = 6.72419650942208d-08 c10 = -3.5945086374423d-06 c11 = 0.000809532551425d0 c12 = -1.59139276805119d-07 c13 = -2.00957653399204d-12 c14 = 1.28333259714417d-10 c15 = 2.34587013019438d-09

do k=1,kb-1

do j=2,jm-1

do i=2,im-1

rho(i,j,k) =

$ ((((c13 * t(i,j,k) + c14) * t(i,j,k) + c15) *

$ s(i,j,k) + c12) * s(i,j,k) +

$ ((((c7 * t(i,j,k) + c8) * t(i,j,k)) + c9) *

$ t(i,j,k) + c10) * t(i,j,k) + c11) * s(i,j,k)+

$ ((((c1 * t(i,j,k) + c2) * t(i,j,k) + c3) *

$ t(i,j,k)+c4) * t(i,j,k) + c5) * t(i,j,k) + c6

enddo

enddo

enddo

UNESCO, DKAP and MS•Dr. Martin Senator of Davidson Laboratory contributed a different fit, MSSuperior to UNESCO in fit to the original data

Consistent Attention to Accuracy• In POM and ECOMSED, density computation is

performed– Rho is computed for each cell– In POM, density is stored as a ratio from rhoref to save digits– Differences are subtracted between adjacent cells– Salinity: difference of 0.001 ppt yields a result in the 7 th decimal

place– Temperature: difference of 0.01 °C yields a result in the 7th

decimal place– Result: 0 digits accuracy

• Conclusion: Compute density in double precision. Anything else may be inaccurate

Double precision Density vs. Single

Precision• Double precision is

more stable

• Differential Calculation of density will be presumably much more accurate, and therefore even more stable

Calculate Density Difference Analytically

• For greatest accuracy, calculate differential density directly– Slightly slower, but full accuracy

• Density(s,t) = Pn(s,t)

• DiffDensity(s,t,ds,dt)

Future Directions in Hardware

• New processors with more registers– Bigger expressions become even more advantageous

• Improved shared memory performance– Improved writeback cache design, larger caches, faster

interconnects

• Intelligent memory subsystems preloading values– Sequential access becomes even more important

Improved Software• We propose to write a Model Construction Toolkit (MCT)

that will do all the optimizations mentioned here, and more– Reimplement POM/ECOMSED using the new toolkit– Allow calling FORTRAN routines– Enter algorithms either in a FORTRAN-like syntax or in a

higher-level syntax based on common notation for difference equations

– Automatically optimize expressions• Compute array subexpressions only when changed

– Compute results using an optimal function given the situation

• The result: More robust, reliable modeling, at much higher speed, with less effort

Acknowledgements

• Professor Roger Pinkham, for superb numerical and statistical knowledge and insight

• Dr. Martin Senator for his densityMS fit and help getting started with R

• The makers of R, an incredible statistics package• John Gilson, who reviewed the presentation and

helped us get the right computing focus, and whose discussions first got us started thinking about a model construction toolkit

the synergy between computer science and cfd modeling dov kruger anne pence alan blumberg

Documents