running operational canadian nwp models and assimilation … · 2015. 11. 18. · 3....

Scalability Workshop ECMWF, Reading, UK April 14-15, 2014

Running operational Canadian NWP models and assimilation systems on next-generation supercomputers Mark Buehner, Michel Desgagné, Abdessamad Qaddouri, Janusz Pudykiewicz, Michel Valin, Martin Charron Meteorological Research Division

The GEM model 1. Grid point lat/lon model 2. Finite differences on an Arakawa-C grid 3. Semi-Lagrangian (poles are an issue) 4. Implicit time discretization

1. Direct solver (Nk 2D horizontal elliptic problems) 2. Full 3D iterative solver based on FGMRES

5. Global uniform, Yin-Yang and LAM configurations 6. Hybrid MPI/OpenMP

1. Halo exchanges 2. Array transposes for elliptic problems

7. PE block partitioning for I/O

Global uniform grid: 1) a challenge for DM implementation 2) many more elliptic problems to solve due to implicit horizontal diffusion (transposes) 3) semi-Lagrangian near the poles 4) current DM implementation will not scale

Yin-Yang grid configuration

• Implemented as 2 LAMs communicating at boundaries

• Optimized Schwarz iterative method for solving the elliptic problem.

• Scales a whole lot better • Operational implementation due • in spring 2015 • Communications are an issue • Exchanging a Global Uniform

scalability problem (poles) by another scalability problem

Yin-Yang 10 km scalability Ni=3160, Nj=1078, Nk=158

H960 H3200 H5056 H1920

Y3200 Y8192 Y16384 Y30968 Y5056

H= CMC/hadar: IBM Power7 Y= NCAR/yellowstone: IBM iDataPlex

16x30x1 198x36

79x32x1 40x34

79x49x4 40x22

32x32x8 99x34 32x32x4

99x34 32x30x1 99x36

40x40x1 79x27

Yin-Yang 10 km scalability Dynamics components

The future of GEM

• Yin-Yang 2km on order 100K cores is already feasible on P7 processors or similar

• Yin-Yang exchanges will need work • Using GPUs capabilities is on the table • Improve Omp scalability • Re-partition MPI sub-domain (bni,bnj,ni,nj,nk) • Export SL interpolations to reduce halo size • Processor mapping to reduce the need to communicate

through the switch • Partition NK • MIMD approach for I/O

Investigating scalability and accuracy on an icosahedral geodesic grid

Spatial discretization: finite volume method on icosahedral geodesic grid Time discretization: exponential integration methods which resolve high frequencies to the required level of tolerance without severe time step restriction Shallow water implementation already shows great scalability Vertical coordinates: Generalized quasi-Lagrangian with conservative monotonic remapping

Unstable jet, icosahedral grid number 6, dx= 112 km, dt=7200 sec (typically 30 sec)

Pudykiewicz (2006), J. Comp. Phys., 213, pp 358-390 Pudykiewicz J. (2011), J. Comp. Phys., 230, pp 1956--1991 Qaddouri A., J. et al. (2012), Q. J. Roy. Met. Soc., 138, pp 989--1003 Clancy C., Pudykiewicz J. (2013), Tellus A, vol. 65

– April-16-14

Ensemble-Variational Assimilation: EnVar

• EnVar will replace 4D-Var at Environment Canada in 2014 for both global and regional deterministic prediction systems

• EnVar uses a variational assimilation approach in combination with the already available 4D ensemble covariances from the EnKF

• By using 4D ensembles, EnVar performs a 4D analysis without need of tangent-linear/adjoint of forecast model

• Consequently, it is more computationally efficient and easier to maintain/adapt than 4D-Var:

– EnVar: ~10 min, 320 cores, 50km grid spacing – 4D-Var: ~1hr, 640 cores, 100km grid spacing

• Future improvements to EnKF will benefit both ensemble and deterministic forecasts incentive to increase overall effort on EnKF development

• EnKF scalability examined by Houtekamer et al. (2014, MWR)

– April-16-14

• In 4D-Var the 3D analysis increment is evolved in time using the TL/AD forecast model (here included in H4D):

• In EnVar the background-error covariances and increment are explicitly 4-dimensional, resulting in cost function: 4D

14D4D4Db4D

14Db4D4D 2

1)][()][(21)( xBxyxHxRyxHxx ∆∆+−∆+−∆+=∆ −− TT HHJ

EnVar formulation

xBxyxHxRyxHxx ∆∆+−∆+−∆+=∆ −− 14Db4D

14Db4D 2

1)][()][(21)( TT HHJ

– April-16-14

Current operational systems

Global EnKF

Global ensemble forecasts (GEPS)

Global deterministic

forecast (GDPS)

Global 4D-Var

2013-2017: Toward a Reorganization of the NWP Suites at Environment Canada

Regional ensemble forecasts (REPS)

Regional deterministic

forecast (RDPS)

Regional 4D-Var

global system regional system

– April-16-14

2014 implementation: 4D-Var replaced by EnVar

Global EnKF

Global ensemble forecasts (GEPS)

Global deterministic

forecast (GDPS)

Global EnVar

Background error

covariances

2013-2017: Toward a Reorganization of the NWP Suites at Environment Canada

Regional ensemble forecasts (REPS)

Regional deterministic

forecast (RDPS)

Regional EnVar

global system regional system

– April-16-14

4D error covariances Temporal covariance evolution

EnVar (4D ensemble covariance):

4D-Var:

-3h 0h +3h

256 pre-computed forecast model integrations provide 4D background-error covariances

~70 TL and AD model integrations performed as part of assimilation

“analysis time”

– April-16-14

Spatial covariance localization in EnVar (Buehner 2005, Lorenc 2003)

• Complexity of 4D-Var is replaced by relatively simple use of 4D ensemble covariances to compute analysis increment and gradient during each iteration of the cost function minimization

• If the spatial localization function is horizontally homogeneous and isotropic, then spectral approach is efficient (for global lat-lon system):

∆x4D = ∑k ek4D o (L1/2 ξk) = ∑k ek

4D o (S-1 Lv

1/2 Lh1/2 ξk)

where: • ek

4D is the kth 4D ensemble deviation • ξk is corresponding control vector • S-1 is the inverse spectral transform • Lh = SLhST is the diagonal spectral horizontal localization matrix

• Equivalent to using sqrt of localized sample covariance matrix: L o[∑k ek (ek)T]

– April-16-14

Scalability of EnVar • Computation and I/O must be highly parallel, ek

4D is large: • 256 members, 7 times, 4 vars, 800x400x75L ~640GB

• Calculation of analysis increment (and adjoint) currently about 1/2 of overall cost of minimization: ∆x4D = ∑k ek

4D o (S-1 Lv1/2 Lh

1/2 ξk)

• Currently, Schur product is the most expensive (and simplest) step, but scales perfectly: ∆x4D=∑k ek

4D o αk

• Spectral transform is next most expensive, involves a global transpose: αk=S-1 Lv

1/2 Lh1/2 ξk

• Currently only 1D decomposition: • ξk : split by ens member (256) • ∆x : split by latitude (400)

• Improved scalability requires 2D decomposition (3D transposition strategy) under development

seco

nds

Reading ensemble

≈ + +

running operational canadian nwp models and assimilation … · 2015. 11. 18. · 3....

Documents