numerical weather prediction model optimization update daniel b. weber and henry j. neeman center...
TRANSCRIPT
Numerical Weather Prediction Model Optimization Update
Daniel B. Weber and Henry J. Neeman Daniel B. Weber and Henry J. Neeman
Center for Analysis and Prediction of Center for Analysis and Prediction of StormsStorms
University of OklahomaUniversity of Oklahoma
s fn
Motivation: May 3, 1999 Tornado
s fn
Copyright 1999 The Daily Oklahoman
Improve warning timesImprove warning times
OSCER 2006OSCER 2006
s fn
Computer resource estimate (1-km mesh) Computer resource estimate (1-km mesh) 5500x3600x100grid points x 3500 5500x3600x100grid points x 3500 calc/pointcalc/point = = 6.9 TFLOPS6.9 TFLOPS
IA-32 based 3GHz Pentium 4 provides a IA-32 based 3GHz Pentium 4 provides a peak of 6 GFLOPS/processorpeak of 6 GFLOPS/processor
Requires 1155 processors assuming Requires 1155 processors assuming perfect CPU utilization and networkperfect CPU utilization and network
Continental US Thunderstorm Prediction
(ARPS)
OSCER 2006OSCER 2006
OSCER 2005 Symposium
Several approaches for optimization Several approaches for optimization – Single processor Single processor – Parallel processorParallel processor– Optimization basicsOptimization basics
This year more details…This year more details…
s fn OSCER 2006OSCER 2006
Why Optimize?
s fn
Top 500 List
0.1
1
10
100
1000
1993 1997 2001 2005
Year
TF
LO
PS
Peak System Performance
OSCER 2006OSCER 2006
Technology has changed!
s fn
ARPS Single Processor Performance
0
1000
2000
3000
4000
5000
6000
7000
8000
Intel P31Ghz
Intel P42Ghz
IntelItanium0.8Ghz
EV-671Ghz
SGIO30000.4Ghz
IBMPower41.3Ghz
NEC SX-5
Platform
MF
LO
PS
Actual Performance Peak Performance
OSCER 2006OSCER 2006
Two Options for Improving Code Performance
s fn
Build faster more efficient computers Build faster more efficient computers (expensive) (expensive)
Optimize the software to run Optimize the software to run efficiently on efficiently on all computing platformsall computing platforms
OSCER 2006OSCER 2006
Optimization Goals
Focus efforts on commodity based Focus efforts on commodity based computers to achieve our science computers to achieve our science goal, goal, we have no choice…we have no choice…
Keep the code easy to read, important Keep the code easy to read, important for code maintenance and further for code maintenance and further developmentdevelopment
s fn OSCER 2006OSCER 2006
Software Optimization
s fn
Existing codes are not designed to Existing codes are not designed to run efficiently on scalar technologyrun efficiently on scalar technology
Is it worth the effort to convert an Is it worth the effort to convert an existing computational code to a new existing computational code to a new way of code structure/computing?way of code structure/computing?
OSCER 2006OSCER 2006
Software Application ARPS (Advanced Regional Prediction ARPS (Advanced Regional Prediction
System) thunderstorm prediction System) thunderstorm prediction modelmodel
Research Version of ARPS (ARPI)Research Version of ARPS (ARPI) CFD code - Navier Stokes equations CFD code - Navier Stokes equations
solved on a finite grid/meshsolved on a finite grid/mesh Results can be applied to other models Results can be applied to other models
etc.etc.
s fn OSCER 2006OSCER 2006
Code Analysis Profile the code (PAPI, Speedshop, Profile the code (PAPI, Speedshop,
Perfex, Apprentice)Perfex, Apprentice) Find the computationally intensive Find the computationally intensive
partsparts Obtain platform informationObtain platform information Key difference for scientists:Key difference for scientists:
– We are doing the work this time!We are doing the work this time!– Optimization not required on vector Optimization not required on vector
hardware!hardware!
s fn OSCER 2006OSCER 2006
s fn
Process Seconds Percent of Total----------------------------------------------------------------Initialization = 1.25 13.2Turbulence = 1.97 20.9Advect u,v,w = 0.28 3.0Advect scalars = 0.42 4.4UV solver = 0.81 8.6WP solver = 1.61 17.1PT solver = 0.03 0.3Qv,c,r solver = 1.07 11.3Buoyancy = 0.06 0.6Coriolis = 0.00 0.0Comp. mixing = 1.03 10.9Message passing = 0.00 0.0Miscellaneous = 0.13 1.3----------------------------------------------------------------- Total Time = 9.44 100.0
Instrument the CodeInstrument the Code
OSCER 2006OSCER 2006
Four issuesFour issues– Memory bound (large number of ref/calc)Memory bound (large number of ref/calc)– Compute bound (large number of calc/ref)Compute bound (large number of calc/ref)– Message bound (waiting for messages)Message bound (waiting for messages)– I/O bound (waiting for return from file i/o) I/O bound (waiting for return from file i/o)
Memory references are more expensive than calculations
s fn
Generic Optimization Strategy
OSCER 2006OSCER 2006
Optimization Review:Single processor
*Identify the computationally intensive *Identify the computationally intensive componentscomponents
*Reduce memory references and *Reduce memory references and improve cache reuse (more on this improve cache reuse (more on this later)later)
*Reduce calculations and instructions *Reduce calculations and instructions (merge loops)(merge loops)
Compiler optimizations Compiler optimizations s fn OSCER 2006OSCER 2006
Traditional compiler option selectionsTraditional compiler option selections Removing divides (strength reduction)Removing divides (strength reduction) Removing unnecessary memory Removing unnecessary memory
references and calculationsreferences and calculations Loop mergingLoop merging Hardware specific optimizationHardware specific optimization Loop collapsing (vector architecture)Loop collapsing (vector architecture) Cache optimization (Tiling)Cache optimization (Tiling)
s fn
Single Processor Optimization Techniques
OSCER 2006OSCER 2006
Tiling is the process to which the original domain Tiling is the process to which the original domain of computation is split up into smaller sections of computation is split up into smaller sections that can fit into the top level cache (usually L2)that can fit into the top level cache (usually L2)
The The goal is togoal is to tune the application to fit the tile tune the application to fit the tile region within the cache of the selected hardware region within the cache of the selected hardware and achieve and achieve enhanced data reuseenhanced data reuse and and application performanceapplication performance--Accessing L2 is much faster than main memoryAccessing L2 is much faster than main memory
Tiling requires the changing of loop limits over a Tiling requires the changing of loop limits over a series of loops to perform calculations on the series of loops to perform calculations on the sub-domain (maximize data reuse = minimize sub-domain (maximize data reuse = minimize memory fetches)memory fetches)
Used PAPI to access the performance counters on Used PAPI to access the performance counters on my Dell Pentium 3 laptop (2 hardware counters)my Dell Pentium 3 laptop (2 hardware counters)
s fn
Tiling
OSCER 2006OSCER 2006
s fn
ARPI contains 75 3-D arrays (per ARPI contains 75 3-D arrays (per processor) other forecast models use processor) other forecast models use much more (> 2x)much more (> 2x)
A typical forecast sub-domain (per A typical forecast sub-domain (per processor) has on the order of 103x53x53 processor) has on the order of 103x53x53 grid/mesh points (~86+MB)grid/mesh points (~86+MB)
Result: ARPI arrays will not fit into any Result: ARPI arrays will not fit into any current or near future cache system...current or near future cache system...
ARPI Memory Requirements
OSCER 2006OSCER 2006
Tiling Example J-Stencil (adjust loop limit size)
s fn
DO N = 1,loopnum ! Loopnum = 80DO K = 1,nz DO J = 3,ny-2 ! j-stencil calculation DO I = 1,nx a(i,j,k) = (u(i,j+2,k)+u(i,j+1,k)-
u(i,j,k)+u(i,j-1,k)-u(i,j-2,k))*1.3*n END DO END DO ! END DO ! sample computationEND DO
OSCER 2006OSCER 2006
Tiling J-Stencil Cache Misses
s fn
J Loop L1 and L2 Cache Misses
0
100
200
300
400
500
600
700
800
0 50 100 150 200 250 300
Data Size (Kbytes)
Cac
he
Mis
ses
(occ
ura
nce
s x
1000
)
L1 Misses
L2 Misses
OSCER 2006OSCER 2006
Tiling J-Stencil FLOP Results
s fn
Pentium III Flops vs Problem Size (data)
0
50
100
150
200
250
300
350
0 10 20 30 40 50
Data Size (Kbytes)
Mfl
op
s
J Loop Flops
OSCER 2006OSCER 2006
ARPI Solution Order
s fn
DO bigstep = 1,total_number_big_stepsUpdate turbulent kinetic energyUpdate potential temperatureUpdate moisture variables and conversionCompute static small time step forcing for u-v-w-p (advection, mixing, buoyancy)
DO smallstep = 1, small_steps_per_big_stepdo ktile = ktile_start, ktile_end,ktile_incr
Update horizontal velocities (u-v)Update vertical velocity (w) and pressure (p)
end do ! K tile loop END DO ! Iterate Small Time StepEND DO ! Iterate Large Time stepChallenge: Devise a method to implement loop tiling
limits OSCER 2006OSCER 2006
Example: Final U Velocity Calculation
s fn
DO k=kbgn,kend ! Computecp*avgx(ptrho)*(difx(pprt) DO j=jbgn,jend ! DO i=ubgn,uend ! note ptforce is cp*avgx(ptrho)
u(i,j,k,3) = u(i,j,k,3)+dtsml1*[uforce(i,j,k) : ptforce(i,j,k)*dxinv*(pprt(i,j,k,3)-pprt(i-1,j,k,3))]
END DO END DO END DO
Note: 3-D tiled loop limitsNeed to string several loops together to achieve data reuse
OSCER 2006OSCER 2006
ARPI Solvers Tile Results
s fn
* Total data/memory size, peak computational rate is 700 MFLOPS. The PORTLAND compiler option used was with the –fast compile option.
SOLVER# of
Arrays# of mesh
points/256KB Cache
Memory Requirements
(KB)*MFLOPS
No Tiling Tiled No Tiling Tiled
U-V Only 9 7111 4410 180 115.7 116.3
W-P Only 15 4266 7350 150 79.4 92.7
U-V-W-P 19 3368 9310 190 91.3 105
Prep SmallTime Step
28 2285 13720 280 42.2 51.1
OSCER 2006OSCER 2006
s fn
SOLVER Array Reuse/total # of arrays used in
R.H.S terms
#3-D Arrays/# of different arrays reused in R.H.S
terms
#3-D Loops
FPI/Mesh point
MFLOPS(P3/700)
Turbulence 437/ 486 31/29 67 365 73
Solve Temperature Moisture
610/ 707 43/38 116 810 93
Prep U-V-W-P
343/ 391 29/26 66 313 92
Prep smalltime step*
23/44 28/16 10 35 40
Solve U-V* 21/30 9/3 2 36 115
Solve W-P* 28/42 15/8 6 46 79
Total -/- 80/- 267 1605 75
ARPI Solver - Loop Analysis
* = tiled in the present ARPI code OSCER 2006OSCER 2006
Performance (FLOP rating) of scalar Performance (FLOP rating) of scalar architecturearchitecture– linked to the length of the inner most looplinked to the length of the inner most loop– larger inner loop ranges utilized data in the larger inner loop ranges utilized data in the
L1/L2 cache more efficiently – similar to L1/L2 cache more efficiently – similar to VECTOR architecture behavior! VECTOR architecture behavior!
Simple J and K loop performance Simple J and K loop performance – >40% of peak for problem data sizes < L2 >40% of peak for problem data sizes < L2
cachecache Forecast model improvements Forecast model improvements
– 10-25%, so far…10-25%, so far…– Tiling the most promising components, (most Tiling the most promising components, (most
array reuse) advection, turbulence and array reuse) advection, turbulence and smoothing, is under development smoothing, is under development
Difficult to implement (more on this later…)s fn
Tiling Impact Summary
OSCER 2006OSCER 2006
Multi-Processor Optimizations
*Fake zone expansion to reduce the *Fake zone expansion to reduce the number of intermediate messages number of intermediate messages (latency and bandwidth)(latency and bandwidth)
*Reduce the number of final variable *Reduce the number of final variable update messages (latency and update messages (latency and bandwidth)bandwidth)
Reduce the size of the messages Reduce the size of the messages (bandwidth)(bandwidth)
*Hide message latency via calculations *Hide message latency via calculations (latency and bandwidth)(latency and bandwidth)
s fn OSCER 2006OSCER 2006
Fake Zone Expansion *Design/redesign code to reduce the number of *Design/redesign code to reduce the number of
intermediate messages (latency and bandwidth)intermediate messages (latency and bandwidth) [ Processor 0 ] x - direction[ Processor 0 ] x - direction | | | | | | | | | | | | || || | | | | | | mesh points | | | | | | mesh points [ Processor 1 ] [ Processor 1 ] -2nd--2nd- ----4th--------4th---- Expanding the internal boundary zones from 1 Expanding the internal boundary zones from 1
to 2 mesh points removes the need to send to 2 mesh points removes the need to send messages for advection, turbulence, numerical messages for advection, turbulence, numerical diffusiondiffusion(2(2ndnd and 4 and 4thth order cases only) order cases only)
Calculations are faster than message passings fn OSCER 2006OSCER 2006
s fn
Message Grouping Combine sends/receives into one message to Combine sends/receives into one message to
reduce latency/overheadreduce latency/overheadDO bigstep = 1,total_number_big_steps
Update turbulent kinetic energyUpdate potential temperatureUpdate moisture conversion
SEND/RECEIVE (TKE, PT, MOISTURE) (5) DO smallstep = 1, small_steps_per_big_step
Update horizontal velocities (u-v)SEND/RECEIVE (U,V) (2)Update vertical velocity (w) and pressure (p)
SEND/RECEIVE (W,P) (2) END DO ! Iterate Small Time StepEND DO ! Iterate Large Time step
RESULT: ONLY 3 SEND/RECIEVE INSTANCESRESULT: ONLY 3 SEND/RECIEVE INSTANCES
OSCER 2006OSCER 2006
s fn
Message Grouping ResultsMessage Grouping Results
0.50.75
11.25
1.51.75
2
1 2 4 8 16 32 64 128 225 256
Number of Processors
No
rmal
ized
Tim
e
(1 P
roce
sso
r)
DSM Single Variable Pass
DSM Multi-Variable Pass
NCSA Balder Origin 2000NCSA Balder Origin 2000 OSCER 2006OSCER 2006
s fn
Message Hiding Initiate non-blocking sends/receives and compute during the MPI Initiate non-blocking sends/receives and compute during the MPI
operationsoperations Masks communication time with computation timeMasks communication time with computation time Gain is limited to the amount of calculations during the MPI operationsGain is limited to the amount of calculations during the MPI operationsDO bigstep = 1,total_number_big_steps
Update turbulent kinetic energyUpdate potential temperatureUpdate moisture/conversion
INITIATE NON-BLOCKING SEND (TKE, PT, MOISTURE) (5) DO smallstep = 1, small_steps_per_big_step
Update horizontal velocities (u-v) SEND/RECEIVE (U,V) (2)Update vertical velocity (w) and pressure (p)
SEND/RECEIVE (W,P) (2) END DO ! Iterate Small Time StepEND DO ! Iterate Large Time step
Final TKE,PT,MOISTURE RECEIVE…Final TKE,PT,MOISTURE RECEIVE…
RESULT: HIDE TKE, PT, MOISTURE SEND/RECIEVEs with U,V,W,P RESULT: HIDE TKE, PT, MOISTURE SEND/RECIEVEs with U,V,W,P computations, can apply this method to small time step also…no results computations, can apply this method to small time step also…no results yet, bugs….yet, bugs…. OSCER 2006OSCER 2006
s fn
ARPI Message Passing ARPI Message Passing AnalysisAnalysis
Number of Message Passing Events Per Processor
Solver Unoptimized Method #1 Fake Mesh Point
Expansion
Method #2Message Grouping
Advection (4th order)
36 0 0
Computational Mixing
(4th order)
16 0 0
Turbulent Mixing 28 0 0
Update variables 9 9 3
Total 89 9 3
OSCER 2006OSCER 2006
s fn
TopDawg Benchmarks
ARPI Benchmark Weak Scaling Test
0
1
2
3
4
5
6
7
8
0 100 200 300 400 500 600 700 800 900Number of Processors
No
rmal
ized
Tim
e
Normalized by 2 Processor Case - I/O
Normalized by 2 Processor + I/O
Zero slope = perfect scalingZero slope = perfect scaling OSCER 2006OSCER 2006
Debug the tiling of the big time step Debug the tiling of the big time step solvers solvers
Debug the message hiding codeDebug the message hiding code Approximately 2 man years spent on Approximately 2 man years spent on
optimization effortsoptimization efforts
s fn
Work in Progress
OSCER 2006OSCER 2006
Acknowledgements
Computer support for PAPI (Scott Hill)Computer support for PAPI (Scott Hill) PAPI Developers/SoftwarePAPI Developers/Software A BIG thanks to OSCER!!!A BIG thanks to OSCER!!!
s fn
A copy of this presentation can be found at :A copy of this presentation can be found at :http://www.oscer.ou.eduhttp://www.oscer.ou.edu
or email: [email protected] email: [email protected]
OSCER 2006OSCER 2006
s fn
Thank you for your Thank you for your attention!attention!
OSCER 2006OSCER 2006
Weak vs Strong Scaling Weak Scaling: vary the problem size by Weak Scaling: vary the problem size by
adding processors that each perform that adding processors that each perform that same amount of worksame amount of work – goal: keep the wall – goal: keep the wall clock time constantclock time constant– NWP applications, since we always need to
increase the resolution by adding processors compared to a coarser resolution forecast, but keep the wall clock time constant
Strong Scaling: add processors within a fixed Strong Scaling: add processors within a fixed problem size, therefore each time you add problem size, therefore each time you add processors each processor performs less workprocessors each processor performs less work– Monte Carlo simulations as more time steps
can be used for more accurate results via additional processors
s fn OSCER 2006OSCER 2006
Existing code:Existing code:– Retrofit to include tiling – loop modification Retrofit to include tiling – loop modification
w/potential “hard wired” code (current work)w/potential “hard wired” code (current work)– Rewrite existing code from scratch to include Rewrite existing code from scratch to include
general tiling capabilitiesgeneral tiling capabilities Built code with tile functionality from the Built code with tile functionality from the
top – down with n – tiles per processor top – down with n – tiles per processor – Build-in n number of fake zones to remove the Build-in n number of fake zones to remove the
need for updating at the end of each time step, need for updating at the end of each time step, but to update at the end of n time steps…but to update at the end of n time steps…(remember calculations are cheaper than (remember calculations are cheaper than communications)communications)
s fn
Two Approaches to Obtaining Optimized Software
OSCER 2006OSCER 2006