numerical weather prediction model optimization update daniel b. weber and henry j. neeman center...

Numerical Weather Prediction Model Optimization Update

Daniel B. Weber and Henry J. Neeman Daniel B. Weber and Henry J. Neeman

Center for Analysis and Prediction of Center for Analysis and Prediction of StormsStorms

University of OklahomaUniversity of Oklahoma

s fn

Motivation: May 3, 1999 Tornado

s fn

Copyright 1999 The Daily Oklahoman

Improve warning timesImprove warning times

OSCER 2006OSCER 2006

s fn

Computer resource estimate (1-km mesh) Computer resource estimate (1-km mesh) 5500x3600x100grid points x 3500 5500x3600x100grid points x 3500 calc/pointcalc/point = = 6.9 TFLOPS6.9 TFLOPS

IA-32 based 3GHz Pentium 4 provides a IA-32 based 3GHz Pentium 4 provides a peak of 6 GFLOPS/processorpeak of 6 GFLOPS/processor

Requires 1155 processors assuming Requires 1155 processors assuming perfect CPU utilization and networkperfect CPU utilization and network

Continental US Thunderstorm Prediction

(ARPS)


OSCER 2005 Symposium

Several approaches for optimization Several approaches for optimization – Single processor Single processor – Parallel processorParallel processor– Optimization basicsOptimization basics

This year more details…This year more details…

s fn OSCER 2006OSCER 2006

Why Optimize?

s fn

Top 500 List

0.1

1

10

100

1000

1993 1997 2001 2005

Year

TF

LO

PS

Peak System Performance


Technology has changed!

s fn

ARPS Single Processor Performance

0

1000

2000

3000

4000

5000

6000

7000

8000

Intel P31Ghz

Intel P42Ghz

IntelItanium0.8Ghz

EV-671Ghz

SGIO30000.4Ghz

IBMPower41.3Ghz

NEC SX-5

Platform

MF

LO

PS

Actual Performance Peak Performance


Two Options for Improving Code Performance

s fn

Build faster more efficient computers Build faster more efficient computers (expensive) (expensive)

Optimize the software to run Optimize the software to run efficiently on efficiently on all computing platformsall computing platforms


Optimization Goals

Focus efforts on commodity based Focus efforts on commodity based computers to achieve our science computers to achieve our science goal, goal, we have no choice…we have no choice…

Keep the code easy to read, important Keep the code easy to read, important for code maintenance and further for code maintenance and further developmentdevelopment


Software Optimization

s fn

Existing codes are not designed to Existing codes are not designed to run efficiently on scalar technologyrun efficiently on scalar technology

Is it worth the effort to convert an Is it worth the effort to convert an existing computational code to a new existing computational code to a new way of code structure/computing?way of code structure/computing?


Software Application ARPS (Advanced Regional Prediction ARPS (Advanced Regional Prediction

System) thunderstorm prediction System) thunderstorm prediction modelmodel

Research Version of ARPS (ARPI)Research Version of ARPS (ARPI) CFD code - Navier Stokes equations CFD code - Navier Stokes equations

solved on a finite grid/meshsolved on a finite grid/mesh Results can be applied to other models Results can be applied to other models

etc.etc.


Code Analysis Profile the code (PAPI, Speedshop, Profile the code (PAPI, Speedshop,

Perfex, Apprentice)Perfex, Apprentice) Find the computationally intensive Find the computationally intensive

partsparts Obtain platform informationObtain platform information Key difference for scientists:Key difference for scientists:

– We are doing the work this time!We are doing the work this time!– Optimization not required on vector Optimization not required on vector

hardware!hardware!


s fn

Process Seconds Percent of Total----------------------------------------------------------------Initialization = 1.25 13.2Turbulence = 1.97 20.9Advect u,v,w = 0.28 3.0Advect scalars = 0.42 4.4UV solver = 0.81 8.6WP solver = 1.61 17.1PT solver = 0.03 0.3Qv,c,r solver = 1.07 11.3Buoyancy = 0.06 0.6Coriolis = 0.00 0.0Comp. mixing = 1.03 10.9Message passing = 0.00 0.0Miscellaneous = 0.13 1.3----------------------------------------------------------------- Total Time = 9.44 100.0

Instrument the CodeInstrument the Code


Four issuesFour issues– Memory bound (large number of ref/calc)Memory bound (large number of ref/calc)– Compute bound (large number of calc/ref)Compute bound (large number of calc/ref)– Message bound (waiting for messages)Message bound (waiting for messages)– I/O bound (waiting for return from file i/o) I/O bound (waiting for return from file i/o)

Memory references are more expensive than calculations

s fn

Generic Optimization Strategy


Optimization Review:Single processor

*Identify the computationally intensive *Identify the computationally intensive componentscomponents

*Reduce memory references and *Reduce memory references and improve cache reuse (more on this improve cache reuse (more on this later)later)

*Reduce calculations and instructions *Reduce calculations and instructions (merge loops)(merge loops)

Compiler optimizations Compiler optimizations s fn OSCER 2006OSCER 2006

Traditional compiler option selectionsTraditional compiler option selections Removing divides (strength reduction)Removing divides (strength reduction) Removing unnecessary memory Removing unnecessary memory

references and calculationsreferences and calculations Loop mergingLoop merging Hardware specific optimizationHardware specific optimization Loop collapsing (vector architecture)Loop collapsing (vector architecture) Cache optimization (Tiling)Cache optimization (Tiling)

s fn

Single Processor Optimization Techniques


Tiling is the process to which the original domain Tiling is the process to which the original domain of computation is split up into smaller sections of computation is split up into smaller sections that can fit into the top level cache (usually L2)that can fit into the top level cache (usually L2)

The The goal is togoal is to tune the application to fit the tile tune the application to fit the tile region within the cache of the selected hardware region within the cache of the selected hardware and achieve and achieve enhanced data reuseenhanced data reuse and and application performanceapplication performance--Accessing L2 is much faster than main memoryAccessing L2 is much faster than main memory

Tiling requires the changing of loop limits over a Tiling requires the changing of loop limits over a series of loops to perform calculations on the series of loops to perform calculations on the sub-domain (maximize data reuse = minimize sub-domain (maximize data reuse = minimize memory fetches)memory fetches)

Used PAPI to access the performance counters on Used PAPI to access the performance counters on my Dell Pentium 3 laptop (2 hardware counters)my Dell Pentium 3 laptop (2 hardware counters)

s fn

Tiling


s fn

ARPI contains 75 3-D arrays (per ARPI contains 75 3-D arrays (per processor) other forecast models use processor) other forecast models use much more (> 2x)much more (> 2x)

A typical forecast sub-domain (per A typical forecast sub-domain (per processor) has on the order of 103x53x53 processor) has on the order of 103x53x53 grid/mesh points (~86+MB)grid/mesh points (~86+MB)

Result: ARPI arrays will not fit into any Result: ARPI arrays will not fit into any current or near future cache system...current or near future cache system...

ARPI Memory Requirements


Tiling Example J-Stencil (adjust loop limit size)

s fn

DO N = 1,loopnum ! Loopnum = 80DO K = 1,nz DO J = 3,ny-2 ! j-stencil calculation DO I = 1,nx a(i,j,k) = (u(i,j+2,k)+u(i,j+1,k)-

u(i,j,k)+u(i,j-1,k)-u(i,j-2,k))*1.3*n END DO END DO ! END DO ! sample computationEND DO


Tiling J-Stencil Cache Misses

s fn

J Loop L1 and L2 Cache Misses

0

100

200

300

400

500

600

700

800

0 50 100 150 200 250 300

Data Size (Kbytes)

Cac

he

Mis

ses

(occ

ura

nce

s x

1000

)

L1 Misses

L2 Misses


Tiling J-Stencil FLOP Results

s fn

Pentium III Flops vs Problem Size (data)

0

50

100

150

200

250

300

350

0 10 20 30 40 50

Data Size (Kbytes)

Mfl

op

s

J Loop Flops


ARPI Solution Order

s fn

DO bigstep = 1,total_number_big_stepsUpdate turbulent kinetic energyUpdate potential temperatureUpdate moisture variables and conversionCompute static small time step forcing for u-v-w-p (advection, mixing, buoyancy)

DO smallstep = 1, small_steps_per_big_stepdo ktile = ktile_start, ktile_end,ktile_incr

Update horizontal velocities (u-v)Update vertical velocity (w) and pressure (p)

end do ! K tile loop END DO ! Iterate Small Time StepEND DO ! Iterate Large Time stepChallenge: Devise a method to implement loop tiling

limits OSCER 2006OSCER 2006

Example: Final U Velocity Calculation

s fn

DO k=kbgn,kend ! Computecp*avgx(ptrho)*(difx(pprt) DO j=jbgn,jend ! DO i=ubgn,uend ! note ptforce is cp*avgx(ptrho)

u(i,j,k,3) = u(i,j,k,3)+dtsml1*[uforce(i,j,k) : ptforce(i,j,k)*dxinv*(pprt(i,j,k,3)-pprt(i-1,j,k,3))]

END DO END DO END DO

Note: 3-D tiled loop limitsNeed to string several loops together to achieve data reuse


ARPI Solvers Tile Results

s fn

* Total data/memory size, peak computational rate is 700 MFLOPS. The PORTLAND compiler option used was with the –fast compile option.

SOLVER# of

Arrays# of mesh

points/256KB Cache

Memory Requirements

(KB)*MFLOPS

No Tiling Tiled No Tiling Tiled

U-V Only 9 7111 4410 180 115.7 116.3

W-P Only 15 4266 7350 150 79.4 92.7

U-V-W-P 19 3368 9310 190 91.3 105

Prep SmallTime Step

28 2285 13720 280 42.2 51.1


s fn

SOLVER Array Reuse/total # of arrays used in

R.H.S terms

#3-D Arrays/# of different arrays reused in R.H.S

terms

#3-D Loops

FPI/Mesh point

MFLOPS(P3/700)

Turbulence 437/ 486 31/29 67 365 73

Solve Temperature Moisture

610/ 707 43/38 116 810 93

Prep U-V-W-P

343/ 391 29/26 66 313 92

Prep smalltime step*

23/44 28/16 10 35 40

Solve U-V* 21/30 9/3 2 36 115

Solve W-P* 28/42 15/8 6 46 79

Total -/- 80/- 267 1605 75

ARPI Solver - Loop Analysis

* = tiled in the present ARPI code OSCER 2006OSCER 2006

Performance (FLOP rating) of scalar Performance (FLOP rating) of scalar architecturearchitecture– linked to the length of the inner most looplinked to the length of the inner most loop– larger inner loop ranges utilized data in the larger inner loop ranges utilized data in the

L1/L2 cache more efficiently – similar to L1/L2 cache more efficiently – similar to VECTOR architecture behavior! VECTOR architecture behavior!

Simple J and K loop performance Simple J and K loop performance – >40% of peak for problem data sizes < L2 >40% of peak for problem data sizes < L2

cachecache Forecast model improvements Forecast model improvements

– 10-25%, so far…10-25%, so far…– Tiling the most promising components, (most Tiling the most promising components, (most

array reuse) advection, turbulence and array reuse) advection, turbulence and smoothing, is under development smoothing, is under development

Difficult to implement (more on this later…)s fn

Tiling Impact Summary


Multi-Processor Optimizations

*Fake zone expansion to reduce the *Fake zone expansion to reduce the number of intermediate messages number of intermediate messages (latency and bandwidth)(latency and bandwidth)

*Reduce the number of final variable *Reduce the number of final variable update messages (latency and update messages (latency and bandwidth)bandwidth)

Reduce the size of the messages Reduce the size of the messages (bandwidth)(bandwidth)

*Hide message latency via calculations *Hide message latency via calculations (latency and bandwidth)(latency and bandwidth)


Fake Zone Expansion *Design/redesign code to reduce the number of *Design/redesign code to reduce the number of

intermediate messages (latency and bandwidth)intermediate messages (latency and bandwidth) [ Processor 0 ] x - direction[ Processor 0 ] x - direction | | | | | | | | | | | | || || | | | | | | mesh points | | | | | | mesh points [ Processor 1 ] [ Processor 1 ] -2nd--2nd- ----4th--------4th---- Expanding the internal boundary zones from 1 Expanding the internal boundary zones from 1

to 2 mesh points removes the need to send to 2 mesh points removes the need to send messages for advection, turbulence, numerical messages for advection, turbulence, numerical diffusiondiffusion(2(2ndnd and 4 and 4thth order cases only) order cases only)

Calculations are faster than message passings fn OSCER 2006OSCER 2006

s fn

Message Grouping Combine sends/receives into one message to Combine sends/receives into one message to

reduce latency/overheadreduce latency/overheadDO bigstep = 1,total_number_big_steps

Update turbulent kinetic energyUpdate potential temperatureUpdate moisture conversion

SEND/RECEIVE (TKE, PT, MOISTURE) (5) DO smallstep = 1, small_steps_per_big_step

Update horizontal velocities (u-v)SEND/RECEIVE (U,V) (2)Update vertical velocity (w) and pressure (p)

SEND/RECEIVE (W,P) (2) END DO ! Iterate Small Time StepEND DO ! Iterate Large Time step

RESULT: ONLY 3 SEND/RECIEVE INSTANCESRESULT: ONLY 3 SEND/RECIEVE INSTANCES


s fn

Message Grouping ResultsMessage Grouping Results

0.50.75

11.25

1.51.75

2

1 2 4 8 16 32 64 128 225 256

Number of Processors

No

rmal

ized

Tim

e

(1 P

roce

sso

r)

DSM Single Variable Pass

DSM Multi-Variable Pass

NCSA Balder Origin 2000NCSA Balder Origin 2000 OSCER 2006OSCER 2006

s fn

Message Hiding Initiate non-blocking sends/receives and compute during the MPI Initiate non-blocking sends/receives and compute during the MPI

operationsoperations Masks communication time with computation timeMasks communication time with computation time Gain is limited to the amount of calculations during the MPI operationsGain is limited to the amount of calculations during the MPI operationsDO bigstep = 1,total_number_big_steps

Update turbulent kinetic energyUpdate potential temperatureUpdate moisture/conversion

INITIATE NON-BLOCKING SEND (TKE, PT, MOISTURE) (5) DO smallstep = 1, small_steps_per_big_step

Update horizontal velocities (u-v) SEND/RECEIVE (U,V) (2)Update vertical velocity (w) and pressure (p)

SEND/RECEIVE (W,P) (2) END DO ! Iterate Small Time StepEND DO ! Iterate Large Time step

Final TKE,PT,MOISTURE RECEIVE…Final TKE,PT,MOISTURE RECEIVE…

RESULT: HIDE TKE, PT, MOISTURE SEND/RECIEVEs with U,V,W,P RESULT: HIDE TKE, PT, MOISTURE SEND/RECIEVEs with U,V,W,P computations, can apply this method to small time step also…no results computations, can apply this method to small time step also…no results yet, bugs….yet, bugs…. OSCER 2006OSCER 2006

s fn

ARPI Message Passing ARPI Message Passing AnalysisAnalysis

Number of Message Passing Events Per Processor

Solver Unoptimized Method #1 Fake Mesh Point

Expansion

Method #2Message Grouping

Advection (4th order)

36 0 0

Computational Mixing

(4th order)

16 0 0

Turbulent Mixing 28 0 0

Update variables 9 9 3

Total 89 9 3


s fn

TopDawg Benchmarks

ARPI Benchmark Weak Scaling Test

0

1

2

3

4

5

6

7

8

0 100 200 300 400 500 600 700 800 900Number of Processors

No

rmal

ized

Tim

e

Normalized by 2 Processor Case - I/O

Normalized by 2 Processor + I/O

Zero slope = perfect scalingZero slope = perfect scaling OSCER 2006OSCER 2006

Debug the tiling of the big time step Debug the tiling of the big time step solvers solvers

Debug the message hiding codeDebug the message hiding code Approximately 2 man years spent on Approximately 2 man years spent on

optimization effortsoptimization efforts

s fn

Work in Progress


Acknowledgements

Computer support for PAPI (Scott Hill)Computer support for PAPI (Scott Hill) PAPI Developers/SoftwarePAPI Developers/Software A BIG thanks to OSCER!!!A BIG thanks to OSCER!!!

s fn

A copy of this presentation can be found at :A copy of this presentation can be found at :http://www.oscer.ou.eduhttp://www.oscer.ou.edu

or email: [email protected] email: [email protected]


s fn

Thank you for your Thank you for your attention!attention!


Weak vs Strong Scaling Weak Scaling: vary the problem size by Weak Scaling: vary the problem size by

adding processors that each perform that adding processors that each perform that same amount of worksame amount of work – goal: keep the wall – goal: keep the wall clock time constantclock time constant– NWP applications, since we always need to

increase the resolution by adding processors compared to a coarser resolution forecast, but keep the wall clock time constant

Strong Scaling: add processors within a fixed Strong Scaling: add processors within a fixed problem size, therefore each time you add problem size, therefore each time you add processors each processor performs less workprocessors each processor performs less work– Monte Carlo simulations as more time steps

can be used for more accurate results via additional processors


Existing code:Existing code:– Retrofit to include tiling – loop modification Retrofit to include tiling – loop modification

w/potential “hard wired” code (current work)w/potential “hard wired” code (current work)– Rewrite existing code from scratch to include Rewrite existing code from scratch to include

general tiling capabilitiesgeneral tiling capabilities Built code with tile functionality from the Built code with tile functionality from the

top – down with n – tiles per processor top – down with n – tiles per processor – Build-in n number of fake zones to remove the Build-in n number of fake zones to remove the

need for updating at the end of each time step, need for updating at the end of each time step, but to update at the end of n time steps…but to update at the end of n time steps…(remember calculations are cheaper than (remember calculations are cheaper than communications)communications)

s fn

Two Approaches to Obtaining Optimized Software


numerical weather prediction model optimization update daniel b. weber and henry j. neeman center...

Documents

development oscer

details oscer

oscer 2006chart20

code performancebuild

code papi

code analysisprofile

code maintenance

code easy