the synergy between computer science and cfd modeling dov kruger anne pence alan blumberg
Post on 22-Dec-2015
213 views
TRANSCRIPT
The Synergy between Computer Science and CFD Modeling
Dov Kruger
Anne Pence
Alan Blumberg
Background of the Davidson Laboratory Team
• Professor Alan Blumberg, co-author of POM and author of ECOMSED
• Dov Kruger, computer science, transitioning into ocean research
• Anne Pence, PhD candidate with a background in naval architecture and Oceanography
CFD is a Computing War Against Big Problems
• Overview of 5 key areas, and focus on our research in single and multiple CPU performance and accuracy– Understanding the battleground: Architectural Features of
Computers– Having a tactical plan: coding efficiency– Effective Use of resources:
• Parallelism on shared memory architectures– Auto-parallelizing compilers– OpenMP
• Parallelism using distributed computers (MPI)
– Hitting the correct target: Accuracy
Ignorance is useful, knowledge is crucial
• We questioned everything, including equation of state
• Surprising results: a new equation of state, and a new style of coding CFD models that coalesces loops and is significantly faster on modern computers
• Nothing worked until Professor Blumberg– Conveyed the analytics of the core routines– Designed suitable test cases– Debugged problems
Doubling Performance of Barotropic Mode: 10 months +
2.5 hours• Blindly optimizing the code is very difficult
– It took time to understand the model
• Problems– Validating results– Finding the right test case– Incremental test procedure
• Once the correct strategy was in place, it took only 2.5 hours to double the performance of barotropic mode– More optimizations are possible, but they get exponentially
more difficult– We propose not to do them by hand
A Sample Problem
Computer Architecture
• Locality = Speed!– Registers– Different Levels of cache– Main Memory– Swapping– Disk– Sequential access is much faster than non-sequential
• Locality is also accuracy• Multiple execution units
– The limiting factor today is memory bandwidth– ECOMSED and POM are both memory limited
Efficiency
• There is a cost structure to operations– Changes over time
• Today’s Highest Costs– Writing to memory
• Cache must flush to memory
– Reading from memory• Statistically, cached
explogsqrtDivisionMultiplication
Heuristics for Writing Fast Code
• Efficiency doesn't hurt– It may not help
• Minimize cost of operations– Different on each machine– But similar enough that heuristics will produce
good results on most architectures– On average, performance will improve if
expressions are sufficiently large
Example: Computing Bottom Stress: original
do j=2,jmm1 do i=2,imm1 ubar(i,j,kbm1)=0.5*(ua(i,j)+ua(i+1,j)) vbar(i,j,kbm1)=0.5*(va(i,j)+va(i,j+1)) enddo enddo do j=2,jmm1 do i=2,imm1 qbar(i,j)=sqrt(ubar(i,j,kbm1)*ubar(i,j,kbm1)+ $ vbar(i,j,kbm1)*vbar(i,j,kbm1)) enddo enddo
do j=2,jmm1 do i=2,imm1 if (fsm(i,j).gt.0.0) then tau(i,j,kb)=10000.*cbc(i,j)/ $ varbf(i,j)*qbar(i,j)*qbar(i,j) endif enddo enddo
Bottom Stress: 21 Times Faster
do j=2,jmm1 do i=2,imm1 tempuBAR = (UA(I,J)+UA(I+1,J)) tempVBAR = (VA(I,J)+VA(I,J+1)) TAU(I,J,KB)=CBC2(I,J)*(tempUBAR**2 + tempVBAR**2) enddoenddo
The Current State of Compiler Optimization
• Compilers typically do not rearrange floating point expressions– For fear of subtle roundoff effects– Instruction scheduling can still be done as long as the two
data streams do not interact– Example: (x * y) - (x / y) + (x + y) – (x – y)
• Common subexpressions are well-handled provided they are in the same form– Algebraic equivalences are not considered
• Constant subexpressions are pre-evaluated at compile time in C and FORTRAN
Examples: Writing Fast Code
• Only scalars are considered constant• Writing to an array is not optimized awayvar1(i,j,k) = c1 * c2 * c3var1(i,j,k) = var1(i,j,k) * c4var1(i,j,k) = c1 * c2 * c3 * c4c1 * var2(i,j,k) * c3 * var3(i,j,k) * c4c1*c3*c4 * var2(i,j,k) * var3(i,j,k)var2(i,j,k) * var3(i,j,k) * c1*c3*c4var2(i,j,k) * var3(i,j,k) * (c1*c3*c4)var1(i,j,k) / c1(i,j,k)
Skipping Land
• Goals– Save CPU by only processing water boxes– Cost very little CPU for all-water problems
• IF-tests slow down the CPU, so preferably avoid them• Stripwise “unstructured view” of the grid avoids
additional tests– inefficient on a vector machine
• k-order would be better to test an entire water column at once, but this would require re-engineering the entire model
Skipping land, continuedistart,jrow, iend
istart,jrow, iend
2,2,2 5,2,6
2,3,2 5,3,6
2,4,6
2,5,6
2,6,6
High Performance Advection
• Scalar Fluxes in one preferred direction– No memory access for flux– Full machine precision
xfluxi Xfluxi+1Var(i,j,k)=xFluxiP1 – xFluxixFluxi = xFluxiP1
Advection, contyfluxJ(1) = 0yFluxJ(im) = 0diffusiveyfluxJ(1) = 0diffusiveyFluxJ(im) = 0do k=2,kbm1 do i=2,imm1 yfluxJ(i) = 0 ! because of land boundary condition diffusiveYFluxJ(i) = 0 enddo do j=2,jmm1 xfluxI = 0 ! because of land boundary condition do i=2,imm1 xfluxIP1 = (f(i,j,k)+f(i+1,j,k)) * xmfl3d(i+1,j,k) yFluxJP1 = (f(i,j,k)+f(i,j+1,k)) * ymfl3d(i,j+1,k) diffusiveXFluxIP1 = -aamx(i,j,k) * addDTi(i,j,BACK) *$ (fb(i,j,k)-fb(i-1,j,k))*uMaskedH2_2H1(i,j) diffusiveYFluxJP1 = -aamy(i,j,k) * addDTj(i,j, BACK) *$ (fb(i,j,k)-fb(i,j-1,k))*vMaskedH1_2H2(i,j)
Advection, take 3 ff(i,j,k) =$ ($ fB(i,j,k) * volT(i,j,BACK) - !advective part$ dti2 * 0.5 * invDZ(k) *$ ($ (f(i,j,k-1)+f(i,j,k))*w(i,j,k) -$ (f(i,j,k)+f(i,j,k+1))*w(i,j,k+1)*ART(i,j) +$ xFluxIP1 - xFluxI + yFluxJP1 - yFluxJ(i)$ ) + ! diffusive part$ diffusiveXfluxIP1 - diffusiveXFluxI +$ diffusiveYFluxJP1 - diffusiveYFluxJ(i)$ ) *$ invVolT(i,j,FUTURE) xFluxI = xFluxIp1 diffusiveXFluxI = diffusiveXFluxIp1 yFluxJ(i) = yFluxJP1 diffusiveYFluxJ(i)=diffusiveYFluxJP1 enddo enddoenddo
K-order is Faster
• POM and ECOMSED are currently stored with I varying most frequently– var(i,j,k)
• For most algorithms, traversal order is irrelevant, except:– Vertical models are naturally ordered top to bottom
– A model stored and accessed in k-order can implement the vertical profile algorithms much more efficiently
Performance of PROFT/PROFS
• On a sample problem– 3 times faster for profiling temperature– 8 times faster for profiling salinity– This includes the penalty of traversing in the
wrong memory order for T,S• For a k-ordered model, code would be even faster
• Techniques as discussed before, and also– Removal of exponentiation where unnecessary
Shared Memory Parallelism
• Compiler automatically recognizes opportunities for multiple CPUs to split up loops
• Can be– Automatic (under compiler control)– Manual (OpenMP)
• Problem is memory bandwidth
Memory Bandwidth and Shared Memory Computers
• Different levels– Commodity PC dual processors– Mid-range shared-memory computers
• Larger cache
• Floating point performance of CPUs are critical
– Special-purpose machines with exotic memory and caches
• Write-back cache synchronization
MPI
• Message Passing Interface requires manual coding intervention
• The programmer can split the domain across multiple computers
• Messages are passed exchanging information between models
• This will not speed up small models due to the overhead of message passing– The volume of the model must be large with respect to
the data exchange interface
Accuracy
• Values in registers retain extra bits of precision• This can substantially improve computational
results• Every time values are stored to single precision
arrays, this extra information is lost• Example: Vectorization on the Pentium 4
– Very fast (4 simultaneous 32-bit operations)– Values change substantially in the model
• Loss of significant figures
Simulation is not Prediction
• Roundoff error slowly burns away digits of precision– No escape from information entropy
• Roundoff is indistinguishable from small differences in the initial conditions
• Single precision is not enough• After 12 hours, salinity results can differ by
3ppt on different computers
Calibration of Model Run on Different Machines?
• Calibration represents balancing all the constituents of the equations as they are computed by the model– Some parts of the model are more sensitive to
roundoff error than others– Some parts are more driven by forcing
functions and are therefore more stable– Turbulence closure is particularly twitchy
Density Computation
• The Equation of State has been defined by Millero and Poisson in UNESCO
• Accuracy is high– On the order of 6 digits– Original data (>500 samples) 5 digits
• Computational cost is also high– Many terms including s1.5
• How much accuracy is needed?
Comparison: Fofonoff 1952 vs. UNESCO
• ECOMSED used Fofonoff 1952 until recently– Approximately 0.6% difference at worst case at
approximate 10°C, 10ppt
• In computer model, the absolute density is irrelevant– The density difference between adjacent cells drives the
forcing
• In a 20ppt difference– No significant difference in X1– Small differences in distribution of salt gradient between
0 and 20 ppt.
Difference between UNESCO and Fofonoff 1950
A Faster Algorithm: densityDKAP
c1 = 6.38582008014815d-12 c2 = -1.07670862741285d-09 c3 = 9.73156784811086d-08 c4 = -9.0042779076007d-06 c5 = 6.60339902342078d-05 c6 = -0.000132403263360079d0 c7 = 4.91156586724696d-12 c8 = -7.81209323957205d-10 c9 = 6.72419650942208d-08 c10 = -3.5945086374423d-06 c11 = 0.000809532551425d0 c12 = -1.59139276805119d-07 c13 = -2.00957653399204d-12 c14 = 1.28333259714417d-10 c15 = 2.34587013019438d-09
do k=1,kb-1
do j=2,jm-1
do i=2,im-1
rho(i,j,k) =
$ ((((c13 * t(i,j,k) + c14) * t(i,j,k) + c15) *
$ s(i,j,k) + c12) * s(i,j,k) +
$ ((((c7 * t(i,j,k) + c8) * t(i,j,k)) + c9) *
$ t(i,j,k) + c10) * t(i,j,k) + c11) * s(i,j,k)+
$ ((((c1 * t(i,j,k) + c2) * t(i,j,k) + c3) *
$ t(i,j,k)+c4) * t(i,j,k) + c5) * t(i,j,k) + c6
enddo
enddo
enddo
UNESCO, DKAP and MS•Dr. Martin Senator of Davidson Laboratory contributed a different fit, MSSuperior to UNESCO in fit to the original data
Consistent Attention to Accuracy• In POM and ECOMSED, density computation is
performed– Rho is computed for each cell– In POM, density is stored as a ratio from rhoref to save digits– Differences are subtracted between adjacent cells– Salinity: difference of 0.001 ppt yields a result in the 7 th decimal
place– Temperature: difference of 0.01 °C yields a result in the 7th
decimal place– Result: 0 digits accuracy
• Conclusion: Compute density in double precision. Anything else may be inaccurate
Double precision Density vs. Single
Precision• Double precision is
more stable
• Differential Calculation of density will be presumably much more accurate, and therefore even more stable
Calculate Density Difference Analytically
• For greatest accuracy, calculate differential density directly– Slightly slower, but full accuracy
• Density(s,t) = Pn(s,t)
• DiffDensity(s,t,ds,dt)
Future Directions in Hardware
• New processors with more registers– Bigger expressions become even more advantageous
• Improved shared memory performance– Improved writeback cache design, larger caches, faster
interconnects
• Intelligent memory subsystems preloading values– Sequential access becomes even more important
Improved Software• We propose to write a Model Construction Toolkit (MCT)
that will do all the optimizations mentioned here, and more– Reimplement POM/ECOMSED using the new toolkit– Allow calling FORTRAN routines– Enter algorithms either in a FORTRAN-like syntax or in a
higher-level syntax based on common notation for difference equations
– Automatically optimize expressions• Compute array subexpressions only when changed
– Compute results using an optimal function given the situation
• The result: More robust, reliable modeling, at much higher speed, with less effort
Acknowledgements
• Professor Roger Pinkham, for superb numerical and statistical knowledge and insight
• Dr. Martin Senator for his densityMS fit and help getting started with R
• The makers of R, an incredible statistics package• John Gilson, who reviewed the presentation and
helped us get the right computing focus, and whose discussions first got us started thinking about a model construction toolkit