parallel domain decomposition methods with mixed order...

24
International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net 893 Parallel Domain Decomposition Methods with Mixed Order Discretization for Fully Implicit Solution of Tracer Transport Problems on the Cubed-Sphere Haijian Yang 1 , Chao Yang 2 and Xiao-Chuan Cai 3 (1)College of Mathematics and Econometrics, Hunan University, Hunan, 410082, Changsha, People’s Republic of China (2)Laboratory of Parallel Software and Computational Science, Institute of Software, Chinese Academy of Sciences, Beijing, 100190, People’s Republic of China (3)Department of Computer Science, University of Colorado Boulder, Boulder, CO 80309, USA Haijian Yang Email: [email protected] Chao Yang Email: [email protected] Xiao-Chuan Cai (Corresponding author) Email: [email protected] Abstract In this paper, a fully implicit finite volume Eulerian scheme and a corresponding scalable parallel solver are developed for some tracer transport problems on the cubed-sphere. To efficiently solve the large sparse linear system at each time step on parallel computers, we introduce a Schwarz preconditioned Krylov subspace method using two discretizations. More precisely speaking, the higher order method is used for the residual calculation and the lower order method is used for the construction of the preconditioner. The matrices from the two discretizations have similar sparsity pattern and eigenvalue distributions, but the matrix from the lower order method is a lot sparser, as a result, excellent scalability results (in total computing time and the number of iterations) are obtained. Even though Schwarz preconditioner is originally designed for elliptic problems, our experiments indicate clearly that the method scales well for this class of purely hyperbolic problems. In addition, we show numerically that the proposed method is highly scalable in terms of both strong and weak scalabilities on a supercomputer with thousands of processors. Keywords Transport equation Cubed-sphere Fully implicit method Domain decomposition Parallel scalability 1 Introduction The tracer transport equation plays a critical role in global atmospheric models [ 9, 13]. The problem at high resolution is very demanding in terms of computational resources. In order to develop a new generation of climate modeling software and make effective use of supercomputers with large number of processors, robust and scalable algorithms are necessary to allow the simultaneous use of fine spacial meshes and large time steps, also to maintain the fast convergence for a wide range of physical parameters. There are several schemes designed for the tracer transport problem on the cubed-sphere by using explicit methods, such as finite volume methods [5, 19], discontinuous Galerkin (DG) methods [17], spectral-element methods [37]. These schemes are shown to be stable and reliable for solving the tracer transport problem. However, because of the explicit nature of the algorithms, there are strict restrictions on the time step size imposed by the Courant-Friedrichs-Lewy (CFL) condition. When very fine meshes are used in the spacial discretization, the time step size has to be very small in order to satisfy this stability condition. To reduce the stability restriction of the time step size, semi-Lagrangian method (SL) is becoming increasingly popular for solving the transport equation [7, 8, 11, 14, 29]. In this paper, we introduce and study fully implicit domain decomposition algorithms that are not only robust with respect to the physical parameters but also scalable with a large number of processors. A potential drawback of the fully implicit method is that a large linear or nonlinear system needs to be solved at each time step. To improve the efficiency of a fully implicit solver, domain decomposition based preconditioning algorithms have been successfully applied in several applications [2, 6, 10, 25, 36]. In particular, we have employed domain decomposition based implicit algorithms in

Upload: others

Post on 18-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallel Domain Decomposition Methods with Mixed Order ...scholarism.net/FullText/ijmcs20143204.pdf · discretization scheme. Section 3 focuses on the details of the domain decomposition

International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net

893

Parallel Domain Decomposition Methods with Mixed Order Discretization for Fully

Implicit Solution of Tracer Transport Problems on the Cubed-Sphere

Haijian Yang1 , Chao Yang

2 and Xiao-Chuan Cai

3

(1)College of Mathematics and Econometrics, Hunan University, Hunan, 410082, Changsha, People’s Republic

of China

(2)Laboratory of Parallel Software and Computational Science, Institute of Software, Chinese Academy of

Sciences, Beijing, 100190, People’s Republic of China

(3)Department of Computer Science, University of Colorado Boulder, Boulder, CO 80309, USA

Haijian Yang

Email: [email protected]

Chao Yang

Email: [email protected]

Xiao-Chuan Cai (Corresponding author)

Email: [email protected]

Abstract

In this paper, a fully implicit finite volume Eulerian scheme and a corresponding scalable parallel solver are

developed for some tracer transport problems on the cubed-sphere. To efficiently solve the large sparse linear

system at each time step on parallel computers, we introduce a Schwarz preconditioned Krylov subspace

method using two discretizations. More precisely speaking, the higher order method is used for the residual

calculation and the lower order method is used for the construction of the preconditioner. The matrices from the

two discretizations have similar sparsity pattern and eigenvalue distributions, but the matrix from the lower

order method is a lot sparser, as a result, excellent scalability results (in total computing time and the number of

iterations) are obtained. Even though Schwarz preconditioner is originally designed for elliptic problems, our

experiments indicate clearly that the method scales well for this class of purely hyperbolic problems. In

addition, we show numerically that the proposed method is highly scalable in terms of both strong and weak

scalabilities on a supercomputer with thousands of processors.

Keywords

Transport equation Cubed-sphere Fully implicit method Domain decomposition Parallel scalability

1 Introduction

The tracer transport equation plays a critical role in global atmospheric models [9, 13]. The problem at high

resolution is very demanding in terms of computational resources. In order to develop a new generation of

climate modeling software and make effective use of supercomputers with large number of processors, robust

and scalable algorithms are necessary to allow the simultaneous use of fine spacial meshes and large time steps,

also to maintain the fast convergence for a wide range of physical parameters.

There are several schemes designed for the tracer transport problem on the cubed-sphere by using explicit

methods, such as finite volume methods [5, 19], discontinuous Galerkin (DG) methods [17], spectral-element

methods [37]. These schemes are shown to be stable and reliable for solving the tracer transport problem.

However, because of the explicit nature of the algorithms, there are strict restrictions on the time step size

imposed by the Courant-Friedrichs-Lewy (CFL) condition. When very fine meshes are used in the spacial

discretization, the time step size has to be very small in order to satisfy this stability condition. To reduce the

stability restriction of the time step size, semi-Lagrangian method (SL) is becoming increasingly popular for

solving the transport equation [7, 8, 11, 14, 29]. In this paper, we introduce and study fully implicit domain

decomposition algorithms that are not only robust with respect to the physical parameters but also scalable with

a large number of processors. A potential drawback of the fully implicit method is that a large linear or

nonlinear system needs to be solved at each time step. To improve the efficiency of a fully implicit solver,

domain decomposition based preconditioning algorithms have been successfully applied in several applications

[2, 6, 10, 25, 36]. In particular, we have employed domain decomposition based implicit algorithms in

Page 2: Parallel Domain Decomposition Methods with Mixed Order ...scholarism.net/FullText/ijmcs20143204.pdf · discretization scheme. Section 3 focuses on the details of the domain decomposition

International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net

894

atmospheric modeling for solving both the global shallow water equations [32, 33] and the regional

compressible Euler equations [34]. In this work, we extend the methods to the solution of the tracer transport

problem for atmospheric flows. The transport equation for tracer transport differs from the shallow-water or

Euler equations in the way that the tracer distribution can be quite non-smooth, which requires more deliberate

design of numerical discretization that may in turn challenge the solver for the discretized system. We also

mention that this class of Schwarz methods has not been well understood for purely hyperbolic problems such as

the tracer transport problem because the lack of ellipticity that is required by the existing theory [26, 27, 31].

For the discretization, we use a finite volume Eulerian scheme based on the Lax-Friedrichs flux solver together

with a second-order spatial reconstruction. One layer of ghost cells is used in the scheme to couple the six

patches together and pass the information between the patches. To solve the large linear algebraic system at

each time step, a Krylov subspace method preconditioned by restricted additive Schwarz is applied. In order to

have a highly scalable (in total computing time) iterative method, two issues have to be addressed: (1) the

number of iterations has to be relatively stable when the mesh is refined and/or when the number of processors

is increased; (2) the subdomain solver has to be cheap enough. The first issue is addressed by using the Schwarz

preconditioner with sufficiently large overlap. To deal with the second issue, traditional approaches replace

subdomain solve by some kind of incomplete factorization, but in our case, we replace the second-order

discretization by a first-order discretization which corresponds to a sparser matrix with a similar distribution of

eigenvalues. The accuracy of the overall method is not impacted since the change happens only at the

preconditioning level. Even though the transport problem is linear, but when a limiter is used in the

discritization and resulting algebraic system may become nonlinear. We consider such a case in the paper and

the Schwarz preconditioned Krylov solver is replaced by a Newton-Krylov-Schwarz method.

The rest of the paper is organized as follows. In Sect. 2, we present the transport equation and a fully implicit

discretization scheme. Section 3 focuses on the details of the domain decomposition algorithm, with special

emphasis on tuning the Schwarz preconditioners. Some numerical experiments to understand the accuracy and

the parallel performance of the proposed methods are provided in Sect. 4. We end the paper with some

concluding remarks in Sect. 5.

2 Transport Equation

Consider the following tracer transport equation defined on the sphere [15, 18]:

⎧⎩⎨∂ϕ∂t+∇⋅(Vϕ)=ϕ∇⋅V, on S×(0,T],ϕ|t=0=ϕ0,

(2.1)

where ϕ is the diagnostic variable representing the tracer mixing ratio per unit mass, V=(u,v) is the velocity of

the flow in the local latitude-longitude coordinates (λ,θ), S is the surface of the sphere, and ϕ0 is a given initial

condition.

To discretize (2.1), we employ the cubed-sphere mesh [20–23, 32] which is based on a gnomic mapping from

the six faces of a cube to the surface of the sphere. Figure 1 is a schematic illustration on the relative positions of

the six patches and their local connectivity. In Fig. 1, patches one to four are put along the equator, and patches

five and six are centered at the north and south poles, respectively. The mesh on each patch is nonorthogonal

and curvilinear due to the gnomic mapping. The coordinates system for each patch is free of singularities and

has the same metric terms. Let (λ,θ)∈[−π,π]×[−π/2,π/2] be the longitude-latitude coordinates,

and (x,y)∈[−π/4,π/4]×[−π/4,π/4] be the curvilinear coordinates on the cubed-sphere. Equation (2.1) has the form,

when written in the local curvilinear coordinates:

∂Λϕ∂t+(∂∂x(Λv1ϕ)+∂∂y(Λv2ϕ))=ϕ(∂∂x(Λv1)+∂∂y(Λv2)),

(2.2)

where Λ=(secxsecy)2/(1+tan2x+tan2y)3−−−−−−−−−−−−−−−√ and (v1,v2) is the contravariant coordinates of V.

More details about the transformation between the surface of the cube and the sphere can be found in Nair et al.

[17].

Page 3: Parallel Domain Decomposition Methods with Mixed Order ...scholarism.net/FullText/ijmcs20143204.pdf · discretization scheme. Section 3 focuses on the details of the domain decomposition

International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net

895

Fig. 1

Relative positions of the six patches and their local connectivity

S is divided into six identical patches Ωk (k=1,2,…,6), each is covered with a logically square N×Nmesh with

centers at (xi,yj) as

Ωki,j=[xi−1/2,xi+1/2]×[yj−1/2,yj+1/2],i,j=1,…,N,

where the mesh size is h=π/(2N). When using a finite volume method, the cell average of ϕ is denoted as

Φki,j=1h2Λi,j∫yi+1/2yi−1/2∫xi+1/2xi−1/2ΛϕdΩki,j. (2.3)

In the following, for simplicity the superscript k is ignored. Denote f(ϕ)=v1ϕ and g(ϕ)=v2ϕ. After integrating

(2.2) over cell Ωi,j and using a cell-centered finite volume method, we obtain

∂Φi,j∂t+1Λi,jh((Λf)i+1/2,j−(Λf)i−1/2,j)+1Λi,jh((Λg)i,j+1/2−(Λg)i,j−1/2)≈1h2Λi,j∫yi+1/2yi−1/2∫xi+1/2xi−1/2ϕ(∂∂

x(Λv1)+∂∂y(Λv2))dΩi,j,

(2.4)

where (Λf)i+1/2,j is approximated by Λi+1/2,jf~i+1/2,j with

f~i+1/2,j≈1h∫yj+1/2yj−1/2f(ϕ(xi+1/2,y))dy

(2.5)

and (Λg)i,j+1/2, (Λf)i−1/2,j and (Λg)i,j−1/2 are approximated similarly. The right-hand side of (2.4) is evaluated

as

(1Λ∂∂x(Λv1)+1Λ∂∂y(Λv2))i,jΦi,j≈1Λi,j((Λv1)i+1/2,j−(Λv1)i−1/2,jh+(Λv2)i,j+1/2−(Λv2)i,j−1/2h)Φi,j.

A Riemann solver is required to obtain the approximate fluxes in (2.5). In this study, we employ the local Lax-

Friedrichs flux formula:

F~(Φ−,Φ+)=[(F(Φ−)+F(Φ+))−α(Φ+−Φ−)]/2,

(2.6)

where α is the maximum absolute value of the normal velocity along each cell boundary, and Φ− and Φ+ are the

reconstructed states of Φ on the cell boundary. By using (2.6), we obtain

f~i+1/2,j=[(f(Φ−i+1/2,j)+f(Φ+i+1/2,j))−αi+1/2,j(Φ+i+1/2,j−Φ−i+1/2,j)]/2=[v1(xi+1/2,yj)(Φ−i+1/2,j+Φ+i+1/2,j)

−αi+1/2,j(Φ+i+1/2,j−Φ−i+1/2,j)]/2;

the other fluxes are defined in a similar way.

The purpose of the reconstruction is to estimate Φ on the cell boundary, based on the cell-averaged values

ofΦ on the neighboring cells, as shown in Fig. 2. For now, we use a piecewise linear reconstruction that doesn’t

destroy the linearity of the problem; later, in one of the numerical experiments, we will consider a case when the

linearity is not preserved after using a limiter in the scheme. In the x-direction, we calculate the reconstructed

states by

Φ−i−1/2,jΦ−i+1/2,j=Φi−1,j+14(Φi,j−Φi−2,j),Φ+i−1/2,j=Φi,j−14(Φi+1,j−Φi−1,j),=Φi,j+14(Φi+1,j−Φi−1,j),Φ+i

+1/2,j=Φi+1,j−14(Φi+2,j−Φi,j);

and we can similarly obtain Φ−i,j−1/2, Φ+i,j−1/2, Φ−i,j+1/2 and Φ+i,j+1/2 in the y−direction. The

reconstruction scheme is second-order accurate in space.

Page 4: Parallel Domain Decomposition Methods with Mixed Order ...scholarism.net/FullText/ijmcs20143204.pdf · discretization scheme. Section 3 focuses on the details of the domain decomposition

International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net

896

Fig. 2

Reconstructions of Φ on the cell Ωi,j=[xi−1/2,xi+1/2]×[yj−1/2,yj+1/2]

The cubed-sphere mesh provides us with a nearly uniform mesh and eliminates the mesh singularities that often

appear in the latitude-longitude mesh. However, it also gives rise to a new difficulty: artificial boundaries are

created between the six patches. Values near patch boundaries should be correctly passed to couple the patches

together. In order to solve the transport equation on the six patches as one system, we use one layer of ghost

cells along each patch boundary. Then information between patches is passed by setting appropriate boundary

conditions on ghost cells. For example, suppose Γ12 is the boundary between Patch 1and 2, the reconstruction

on the cell boundary {x1/2}×[yj−1/2,yj+1/2] for Patch 2 is given by

Φ−1/2,j=(ΦN,j)I+14(Φ∗1,j−(ΦN−1,j)I),

where (ΦN,j)I and (ΦN−1,j)I are the cell-averaged values of Φ in Patch 1, and Φ∗1,j is an interpolated value on

Patch 2. Analogously,

Φ+1/2,j=Φ1,j−14(Φ2,j−(ΦN,j)∗,I),

where Φ1,j and Φ2,j are the cell-averaged values of Φ in Patch 2, and (ΦN,j)∗,I is an interpolated value on

Patch 1. The interpolations we use to calculate Φ∗1,j on Patch 2 and (ΦN,j)∗,I on Patch 1 only depend on the

geometry position of the mesh cell. For example, let Φ¯¯¯1,j be the ghost point belonging to Patch 1outward to

Patch 2, then the values Φ∗1,j on Patch 2 are calculated by the following interpolation:

Φ∗1,j={ηjΦ1,j+(1−ηj)Φ1,j+1ηjΦ1,j+(1−ηj)Φ1,j−1j<[N/2]+1,otherwise,

where ηj is the linear interpolation coefficient defined by

ηj=⎧⎩⎨r(Φ1,j+1,Φ¯¯¯1,j)/r(Φ1,j−1,Φ1,j+1)r(Φ1,j−1,Φ¯¯¯1,j)/r(Φ1,j−1,Φ1,j+1)j<[N/2]+1,otherwise,

with r(⋅,⋅) being the great-circle distance between the two points.

After spatially discretizing (2.2), we have a semi-discrete system

∂Φi,j∂t+L(Φi,j)=0,

(2.7)

where L(Φi,j) is a linear operator defined by

(Λi+1/2,jΛi,jf~i+1/2,j−Λi−1/2,jΛi,jf~i−1/2,j)+(Λi,j+1/2Λi,jg~i,j+1/2−Λi,j−1/2Λi,jg~i,j−1/2)−1Λi,j((Λv1)i+1/2,j−(

Λv1)i−1/2,jh+(Λv2)i,j+1/2−(Λv2)i,j−1/2h)Φi,j.

For comparison purpose we implement both implicit and explicit methods for the temporal integration of (2.7).

For the implicit method, we use the second-order backward differentiation formula (BDF-2) that reads

12△t(3Φ(m)i,j−4Φ(m−1)i,j+Φ(m−2)i,j)+L(Φ(m)i,j)=0,

(2.8)

where Φ(m)i,j is the evaluation of Φi,j at the mth time step with a uniform time step size △t. Only at the first

time step, a first-order backward Euler (BDF-1) method is used. We use the second-order Strong Stability

Preserving Runge-Kutta (SSP RK-2) method for the explicit method

⎧⎩⎨⎪⎪⎪⎪Φ¯¯¯(m)i,jΦ(m)i,j=Φ(m−1)i,j−△tL(Φ(m−1)i,j),=12(Φ(m−1)i,j+Φ¯¯¯(m)i,j)−△t2L(Φ¯¯¯(m)i,j).

(2.9)

3 Fully Implicit Domain Decomposition Methods

After the discretization of (2.2) in space and time, we obtain a linear system for each time step. In this study we

propose an additive Schwarz right-preconditioned Generalized Minimal RESidual (GMRES) method for solving

the system,

Page 5: Parallel Domain Decomposition Methods with Mixed Order ...scholarism.net/FullText/ijmcs20143204.pdf · discretization scheme. Section 3 focuses on the details of the domain decomposition

International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net

897

AM−1X=b,X=MΦ,

(3.1)

where M−1 is the preconditioner [24]. To solve (3.1) at time step m, we first set the initial guess Φ0 equal to the

solution of the previous time step Φ(m−1). Only at the first time step, we choose the initial condition as the

initial guess. Then the next approximate solution is obtained by using the right-preconditioned GMRES method

with a restart 30 until the residual satisfies

∥AM−1Xn−b∥≤ηr,n=0,1,…,

where ηr=10−5 is the tolerance, and then Φ(m)=M−1Xn.

To define the one-level restricted additive Schwarz preconditioner, each patch Ωk is decomposed into p non-

overlapping subdomains Ωki. Here p is the number of processors per patch. Also, each subdomain corresponds

to one processor. Hence, the number of processors for the whole domain Ω is 6p. In order to get the overlapping

subdomain, we extend each subdomain Ωki with δ layers of mesh cells to a larger subdomain Ωki,δ that overlaps

with its neighbors. Any subdomain boundary that coincides with a patch boundary is extended to the

neighboring patch(es).

Note that the total number of unknowns is 6N2. Let Ni be the number of unknowns in Ωki,δ and the restriction

operator Rki,δ be an Ni×(6N2) matrix that maps a vector defined on the entire domain to a shorter vector defined

on subdomain Ωki,δ by discarding all components corresponding to mesh points outside Ωki,δ.

Specifically, Rki,0 is also an Ni×(6N2) matrix that is similarly defined, with the difference that its application to

a (6N2)×1 vector zeroes all those components corresponding to mesh cells outside Ωki. The

subdomain Ni×Ni matrix is defined as

Aki=Rki,δA(Rki,δ)T.

(3.2)

The one-level restricted additive Schwarz (RAS) preconditioner for A is defined as [4, 26, 27]

M−1RAS=∑k=16∑i=1p(Rki,0)T(Aki)−1Rki,δ.

(3.3)

The matrix-vector multiplication with (Aki)−1 is either exactly calculated by a LU factorization or obtained

approximately by an ILU factorization. More details will be discussed in the numerical experiments section.

The effectiveness of the Schwarz preconditioner relies on its ability to mimic the spectrum of the linear operator

and at same time is relatively cheap to apply. In the RAS preconditioner (3.3), we construct the subdomain

matrix directly from the matrix A. In this case it is effective in terms of the number of GMRES iterations, as

shown in Sect. 4.3. But the computing time is not as good as what we want because the subdomain matrices are

denser when we use a higher order discretization. In order to lower the cost of subdomain solves without losing

the effectiveness of the preconditioner, we build the subdomain matrix with a first-order spatial discretization

while the second-order scheme is still used to build the matrix A in (3.1). This idea is based on the fact that

matrices arising from the first-order scheme and the second-order scheme both originate from the same transport

equation, thus have similar eigenvalue distributions. Let A˜ denote the matrix with the first-order discretization

and Aki˜ be its restriction to the overlapping subdomain Ωki,δ. Then the new RAS preconditioner is defined by:

M˜−1RAS=∑k=16∑i=1p(Rki,0)T(Aki˜)−1Rki,δ,

(3.4)

where the matrix-vector multiplication of (Aki˜)−1 is obtained by a LU or an ILU factorization. In the first-order

discretization, the reconstructed states are given by:

Φ−i−1/2,j=Φi−1,j,Φ+i−1/2,j=Φi,j,Φ−i+1/2,j=Φi,j,Φ+i+1/2,j=Φi+1,j,

in the x-direction, and similarly in the y-direction. As an example, in Figs. 3 and 4 we show the sparsity patterns

and eigenvalue distributions of the matrices based on the first-order and second-order discretizations of a typical

transport problem on a relatively coarse mesh. We see that the sparsity patterns are similar and the eigenvalue

distributions are also similar, but the number of nonzeros of the first-order matrix is much less than that of the

second-order matrix.

Page 6: Parallel Domain Decomposition Methods with Mixed Order ...scholarism.net/FullText/ijmcs20143204.pdf · discretization scheme. Section 3 focuses on the details of the domain decomposition

International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net

898

Fig. 3

The sparsity patterns of the matrices with the second (left panel) and the first (right panel) order discretizations.

Here the mesh is 6×6×6. “nnz” is the number of nonzero elements

Fig. 4

The eigenvalue distributions of the matrices with the second (left panel) and the first (right panel) order

discretizations. Here the mesh is 6×6×6. “Asterisk” represents the eigenvalues of the matrix

4 Numerical Experiments

In this section, we test the proposed implicit method with a variety of test cases. The purposes of the tests

include: (1) the verification of the numerical order of convergence and the effective resolution; (2) the

preservation of the shape of “rough” distributions; (3) mixing diagnostics for two nonlinearly related tracers;

and (4) the parallel performance of the method.

4.1 The test cases

We consider four different initial scalar fields for ϕ0 including: a smooth scalar field, two quasi-smooth scalar

fields, and a non-smooth scalar field [12, 16, 18]. The velocity fields are given as either non-divergent or

divergent flows. Let (λi,θi), (i=1,2) be the centers of the initial distributions, (X,Y,Z) be equal

to (cosθcosλ,cosθsinλ,sinθ), and (Xi,Yi,Zi) be equal to (cosθicosλi,cosθisinλi,sinθi). Then the smooth initial

scalar field is the Gaussian hills defined by

ϕ0(λ,θ)=h1(λ,θ)+h2(λ,θ),

(4.1)

where hi=exp{−5((X−Xi)2+(Y−Yi)2+(Z−Zi)2)}. Two scalar fields are used for the quasi-smooth case: the cosine

bells and the “correlated” cosine bells. The cosine bells (C1 function) are defined as:

Page 7: Parallel Domain Decomposition Methods with Mixed Order ...scholarism.net/FullText/ijmcs20143204.pdf · discretization scheme. Section 3 focuses on the details of the domain decomposition

International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net

899

$$\begin{aligned} {\phi }_0(\lambda ,\theta )={\left\{ \begin{array}{ll} 0.1 + 0.9 h_i (\lambda ,\theta ) &{}

r_i<="" div="" style="outline: 0px;">

(4.2)

where r=1/2 is the base radius of the bells and \(h_i (\lambda ,\theta )= \frac{1}{2} \big (1+\cos (2\pi r_i) \big ) ,

\text { if } r_i with ri being the great-circle distance between (λ,θ) and (λi,θi). The correlated cosine bells are

given by:

ϕ∗=ψ(ϕ0),

(4.3)

where ϕ0 is the cosine bells condition defined in (4.2) and the nonlinear functional relation ψ is given by

ψ(χ)=aψχ2+bψ

(4.4)

with aψ=−0.8 and bψ=0.9.

For the non-smooth case, the initial condition is the slotted-cylinders defined by

ϕ0(λ,θ)=⎧⎩⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪1110.1 if ri⩽r and |λ−λi|⩽r6 for i=1,2, if r1⩽r and |λ−λ1|<r6 and θ

−θ1<−5r12, if r2⩽r and |λ−λ2|<r6 and θ−θ2>5r12,otherwise,

(4.5)

where r=1/2. In the numerical tests, we employ two different types of deformational wind fields [18]. The first

velocity field is a non-divergent flow

⎧⎩⎨⎪⎪⎪⎪⎪⎪u(λ,θ,t)=ksin2(λ)sin(2θ)cos(πtT),v(λ,θ,t)=ksin(2λ)cos(θ)cos(πtT).

(4.6)

The second velocity field is a divergent flow defined by

⎧⎩⎨⎪⎪⎪⎪⎪⎪u(λ,θ,t)=−ksin2(λ2)sin(2θ)cos2(θ)cos(πtT),v(λ,θ,t)=k2sin(λ)cos3(θ)cos(πtT).

(4.7)

The components of the velocity vector for the zonal background flow are given by

⎧⎩⎨⎪⎪⎪⎪⎪⎪u(λ,θ,t)=ksin2(λ¯)sin(2θ)cos(πtT)+2πcos(θ)T,v(λ,θ,t)=ksin(2λ¯)cos(θ)cos(πtT),

(4.8)

where λ¯=λ−2πt/T. This wind field is non-divergent but highly deformational [18]. In the experiments, the

following combinations of the initial conditions and velocity fields are used:

Case-1: Gaussian hills (4.1) and non-divergent flow (4.6);

Case-2: Cosine bells (4.2) and divergent flow (4.7);

Case-3: Slotted-cylinders (4.5) and zonal background flow (4.8);

Case-4: Cosine bells (4.2) and zonal background flow (4.8);

Case-5: Correlated cosine bells (4.3) and zonal background flow (4.8).

We set the duration of integration to T=5 time units. The parameter k and the centers of the initial

distributions (λi,θi), (i=1,2) are chosen to make the test cases challenging,

for the test cases with the non-divergent flow: k=2, (λ1,θ1)=(5π/6,0) and (λ2,θ2)=(−5π/6,0);

for the test cases with the divergent flow: k=1, (λ1,θ1)=(3π/4,0) and (λ2,θ2)=(−3π/4,0).

Case-1, Case-2, and Case-3 are first proposed in Nair et al. [18] to validate global transport schemes. Case-

4 and Case-5 are utilized to evaluate schemes using interrelated tracers, scatter plots and numerical mixing

diagnostics in Lauritzen et al. [15]. These test cases are designed in a way that the flow reverses its course at

half-time t=T/2 and the scalar field returns to its initial position and shape in the end of the simulation; that is,

the final solution ϕT is identical to the initial condition ϕ0.

4.2 Correctness and Accuracy

In the tests, we compute certain errors to assess the order of convergence of the proposed scheme. We define the

following measurements [18, 29, 30]:

l1=I(|ϕ0−ϕT|)I(|ϕ0|),l2=(I((ϕ0−ϕT)2)I((ϕ0)2))1/2,l∞=max|ϕ0−ϕT|max|ϕ0|;

ϕmax=maxϕT−maxϕ0maxϕ0−minϕ0,ϕmin=minϕT−minϕ0maxϕ0−minϕ0.

All functions are evaluated at the mesh points, and the integral I(ϕ) is calculated as the discrete summation over

all cell centers ∑6k=1∑Ni=1∑Nj=1Λki,jΦki,j.

Following [12, 15, 29], we estimate the numerical convergence

rates K1, K2 and K∞ for l1, l2 and l∞respectively, by using a least-squares linear regression

log(li)=Ai−Kilog(Δλ),i=1,2,∞,

Page 8: Parallel Domain Decomposition Methods with Mixed Order ...scholarism.net/FullText/ijmcs20143204.pdf · discretization scheme. Section 3 focuses on the details of the domain decomposition

International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net

900

where Δλ is the average mesh-spacing in degrees and Ai (i=1,2,∞) are constants defined in Lauritzen et al. [15],

Harris et al. [8]. For the transport equation, the maximum CFL number can be defined by

CFL=△tUmaxΔλ(π180∘), (4.9)

where Umax is the maximum wind speed [12, 15].

Since the Gaussian-hills initial condition in Case-1 is infinitely smooth, we use it to assess the order of accuracy

of the implicit scheme for the non-divergent flow. In the test, we use the following

meshes: N=16,32,64,128,256,512 (correspondingly Δλ=90∘/16,90∘/32 and so on). Table 1 and Figure 5show the

errors in space. We see that the order of accuracy approaches to the third-order line when the mesh becomes

finer. This is because the solution of the test case is smooth and compact. Figure 6 shows the initial condition

and numerical solutions at t=T/2 and t=T by using the implicit method. We find that the final solution is in good

agreement with the initial condition.

Table 1

Results for Case-1 by using the implicit method with different meshes

Mesh l1 l2 l∞ ϕmax ϕmin

16×16×6 3.90E−1 3.28E−1 3.73E−1 −3.66E−1 −5.20E−2

32×32×6 1.40E−1 1.37E−1 1.81E−1 −1.67E−1 −2.35E−2

64×64×6 2.94E−2 3.31E−2 5.27E−2 −4.41E−2 −1.88E−3

128×128×6 4.29E−3 5.18E−3 9.07E−3 −7.39E−3 0

256×256×6 5.60E−4 6.89E−4 1.23E−3 −1.01E−3 0

512×512×6 8.10E−5 9.92E−5 1.72E−4 −1.49E−4 0

The time step size is fixed to △t=T/5,000

Fig. 5

Convergence plots for the l1, l2 and l∞ errors as the mesh is refined for Case-1. The problem is solved

with △t=T/5,000 by using the implicit method. Legends “2nd-order” and “3rd-order” represent the second and

third order convergence rates in space

Page 9: Parallel Domain Decomposition Methods with Mixed Order ...scholarism.net/FullText/ijmcs20143204.pdf · discretization scheme. Section 3 focuses on the details of the domain decomposition

International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net

901

Fig. 6

Contour plots of Case-1. The problem is solved on a 512×512×6 mesh by the implicit method with △t=T/5,000.

The upper left panel is the initial scalar field ϕ0, theupper right panel is the numerical solution at t=T/2, and

the lower panel is the numerical solution at t=T

We next compare the fully implicit method with the explicit method (2.9). Table 2 and Fig. 7 show the

computed errors in time. The CFL condition causes the time step size of the explicit method to be small when

solving the problem on fine meshes, as shown in Table 2. On the other hand, the implicit method allows the

time step size to be independent of the mesh resolution. From Table 2 we see that explicit and implicit methods

have similar numerical accuracy as △t=T/104 for the explicit method and △t=T/500for the implicit method. It is

observed from Fig. 8 that the performance of the implicit method is better than that of the explicit method in

terms of the total computing time with up to 3072 processors. We remark that the comparison of the explicit and

the implicit methods is based on the same spatial discretization. The performance of the explicit scheme may be

improved by using a semi-Lagrangian finite volume method. But such a comparison is beyond the scope of this

study.

Fig. 7

Convergence plots for the l1, l2 and l∞ errors of Case-1. The problem is solved on a 1024×1,024×6 mesh by

using the implicit method. Legends “1st-order” and “2nd-order” represent the first and second order

convergence rates in time

Page 10: Parallel Domain Decomposition Methods with Mixed Order ...scholarism.net/FullText/ijmcs20143204.pdf · discretization scheme. Section 3 focuses on the details of the domain decomposition

International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net

902

Fig. 8

Comparison between the implicit method and the explicit method for Case-1 in terms of the computing time

with different number of processors. The tests are performed on a 1,024×1,024×6 mesh with Δt=T/500 for the

implicit method and with △t=T/104 for the explicit method, respectively

Table 2

Results for Case-1 by using explicit and implicit methods with different time step sizes

△t CFL l1 l2 l∞ ϕmax ϕmin

Implicit method

T/10 7.56E+2 8.29E−1 6.13E−1 6.82E−1 −4.79E−1 −4.79E−4

T/50 1.51E+2 2.54E−1 2.35E−1 2.96E−1 −2.36E−1 −1.13E−2

T/100 7.56E+1 9.37E−2 9.65E−2 1.31E−2 −9.67E−2 −5.70E−3

T/500 1.51E+1 2.82E−3 3.23E−3 5.00E−3 −3.38E−3 0

T/1,000 7.56E+0 5.94E−4 6.61E−4 1.00E−3 −7.24E−4 0

T/2,000 3.78E+0 1.38E−4 1.49E−4 2.22E−4 −1.73E−4 0

T/5,000 1.51E+0 2.58E−5 2.86E−5 4.65E−5 −3.91E−5 0

Explicit method

T/5,000 1.51E+0 – – – – –

T/104 7.56E−1 1.86E−3 1.79E−3 1.76E−3 −1.36E−5 0

T/(2×104) 3.78E−1 9.35E−4 8.98E−4 8.84E−4 −1.38E−5 0

T/(5×104) 1.51E−1 3.74E−4 3.60E−4 3.59E−4 −1.47E−5 0

Page 11: Parallel Domain Decomposition Methods with Mixed Order ...scholarism.net/FullText/ijmcs20143204.pdf · discretization scheme. Section 3 focuses on the details of the domain decomposition

International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net

903

△t CFL l1 l2 l∞ ϕmax ϕmin

T/105 7.56E−2 1.88E−4 1.81E−4 1.84E−4 −1.52E−5 0

The mesh is 1024×1024×6. “–” denotes no convergence. In (4.9), Umax=2.32 for this particular test case

We then study Case-2, where the cosine bells initial condition and the divergent wind are used. The flow is more

complex compared to the non-divergent case. In Lauritzen and Skamarock [12], Lauritzen [15], White and

Dongarra [29, “effective resolution” is defined to assess the absolute error and the rate of convergence. In our

implicit simulation we define the effective resolution to be the one when the l2 error is approximately 0.033.

Table 3 and Fig. 9 show the computed errors. The convergence plot in the middle curve of Fig. 9shows the

effective resolution by using the intersection between the convergence curve of l2 and the line l2=0.033. As

shown in Fig. 9, the effective resolution for the implicit method is about 90∘/64=1.4062∘when

using △t=T/5,000. Figure 10 shows the initial condition and solutions at t=T/2 and t=T by using the implicit

method.

Fig. 9

Convergence plots for the l1, l2 and l∞ errors as the mesh is refined for Case-2. The problem is solved

with △t=T/5,000 by using the implicit method. Legends “2nd-order” and “3rd-order” represent the second and

third order convergence rates in space. The blue line is l2=0.033, which is used to define “effective resolution”

Fig. 10

Contour plots of Case-2. The problem is solved on a 512×512×6 mesh by the implicit method with △t=T/5,000.

The upper left panel is the initial scalar field ϕ0, the upper right panel is the numerical solution at t=T/2, and

the lower panel is the numerical solution at t=T

Table 3

Results for Case-2 by using the implicit method with different meshes

Page 12: Parallel Domain Decomposition Methods with Mixed Order ...scholarism.net/FullText/ijmcs20143204.pdf · discretization scheme. Section 3 focuses on the details of the domain decomposition

International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net

904

Mesh l1 l2 l∞ ϕmax ϕmin

16×16×6 1.66E−1 3.25E−1 3.97E−1 −3.68E−1 0

32×32×6 7.40E−2 1.52E−1 1.68E−1 −1.63E−1 0

64×64×6 1.90E−2 4.02E−2 6.00E−2 −3.56E−2 0

128×128×6 3.56E−3 8.32E−3 1.98E−2 −5.19E−3 0

256×256×6 6.33E−4 1.84E−3 5.89E−3 −6.73E−4 0

512×512×6 1.15E−4 4.59E−4 1.82E−3 −8.60E−5 0

The time step size is fixed to △t=T/5,000

In Case-3, the slotted-cylinders initial condition and the zonal background flow are used. The non-smooth initial

condition is used to challenge the proposed scheme. We use the implicit method to obtain the numerical solution

with △t=T/5,000 and 512×512×6. Figure 11 shows contour plots of the initial condition, the numerical solutions

at t=T/2, and t=T by using a contour interval of 0.05. The errors

are l1=8.67E−2, l2=1.93E−1, l∞=8.48E−1, ϕmax=1.43E−1, and ϕmin=−4.25E−2.

Fig. 11

Contour plots of Case-3. The problem is solved on a 512×512×6 mesh by the implicit method with △t=T/5,000.

The upper left panel is the initial scalar field ϕ0, the upper right panel is the numerical solution at t=T/2, and

the lower panel is the numerical solution at t=T

In Lauritzen and Thuburn [16], in order to explore the mixing characteristics of a transport scheme, a mixing

diagnostics is defined to quantify the numerical mixing in terms of the normalized distance between the pre-

existing functional curve and scatter points. This mixing diagnostics is based on the highly deformational

analytical flow field (4.8) and two nonlinearly related tracers. The two nonlinearly related tracers are the cosine

bells and the correlated cosine bells (i.e., Case-4 and Case-5). The mixing ratio for two nonlinearly related

tracers is referred to as the cosine bells condition ϕ and the correlated cosine bells condition ϕ∗.

In general, a plot of scatter points (ϕ,ϕ∗) follows a constant curve from the pre-existing functional relation

curve ψ defined in (4.4). Based on the distance between the pre-existing functional curve and scatter points,

three categories of deviation are defined from this curve [16]: “real” mixing, “range-preserving” unmixing and

overshooting. The three diagnostics that quantitatively account for numerical mixing that resembles the three

deviations are referred to as lr, lu and lo, respectively. More details about the definitions of the

errors lr, luand lo can be found in Lauritzen et al. [15], Lauritzen and Thuburn [16] and references therein. In

Page 13: Parallel Domain Decomposition Methods with Mixed Order ...scholarism.net/FullText/ijmcs20143204.pdf · discretization scheme. Section 3 focuses on the details of the domain decomposition

International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net

905

the tests, we compute the mixing diagnostics (lr,lu,lo) at t=T/2 at resolutions N=64,128,256,512, as shown in

Table 4 and Fig. 12. Figure 13 shows contour plots of the initial condition, the numerical solution at t=T/2. In

the figure, “real mixing” denotes the area where points are below the curve but within the triangle, “range-

preserving unmixing” denotes the area where points are above the curve but within the triangle, and

“overshooting” denotes the area where points are outside the triangle [29].

Table 4

Diagnostics for the real mixing (lr), the range-preserving unmixing (lu), and the overshooting (lo) at t=T/2by

using the implicit method with different meshes

Mesh lr lu lo

64×64×6 4.30E−3 9.94E−4 2.80E−3

128×128×6 1.80E−3 2.27E−4 1.40E−3

256×256×6 5.67E−4 2.49E−4 4.84E−4

512×512×6 9.61E−5 6.28E−5 6.60E−5

The time step size is fixed to △t=T/5,000

Page 14: Parallel Domain Decomposition Methods with Mixed Order ...scholarism.net/FullText/ijmcs20143204.pdf · discretization scheme. Section 3 focuses on the details of the domain decomposition

International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net

906

Fig. 12

Scatter plots at t=T/2 with the implicit method for two nonlinearly correlated tracers based on the cosine-bells

initial conditions. The horizontal axis denotes the value of the numerical solution ϕ for the cosine-bells initial

condition and the vertical axis denotes the value of the numerical solution ϕ∗ for correlated cosine bells

condition at t=T/2

Page 15: Parallel Domain Decomposition Methods with Mixed Order ...scholarism.net/FullText/ijmcs20143204.pdf · discretization scheme. Section 3 focuses on the details of the domain decomposition

International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net

907

Fig. 13

Contour plots of Case-4 and Case-5. The problem is solved on a 512×512×6 mesh by the implicit method

with △t=T/5,000. The upper left panel is the initial scalar field ϕ0 for Case-5, the upper right panelis the

numerical solution at t=T/2 for Case-4, and the lower panel is the numerical solution at t=T/2 forCase-5. The

initial field for Case-4 is the same as Case-2; see Fig. 9

It is observed from Figs. 10 and 11 that there are spurious oscillations occurring at the quasi-smooth or non-

smooth area. To reduce the spurious oscillations, we modify the scheme by adding a slope limiter in the state

reconstruction. For example, we calculate the reconstructed states Φ−i−1/2,j and Φ+i−1/2,j by

Φ−i−1/2,j=Φi−1,j+12(limiter(Φi−1,j−Φi−2,j,Φi,j−Φi−1,j))

and

Φ+i−1/2,j=Φi,j−12(limiter(Φi,j−Φi−1,j,Φi+1,j−Φi,j)),

respectively, and others are defined in a similar way. Here we use the corrected van Albada limiter [28]:

limiter(d1,d2)=⎧⎩⎨⎪⎪d1d2(d1+d2)d21+d220 if d1d2⩽0,otherwise.

In this case, (2.8) becomes a system of nonlinear algebraic equations. We use a Newton-Krylov-Schwarz type

algorithm to solve it. The algorithm includes the following steps: an inexact Newton method for the nonlinear

system and the linear algorithms described in Sect. 3 for the Jacobian system [3, 35]. We solveCase-2 on

a 512×512×6 mesh by the implicit method with △t=T/5,000. As shown in Fig. 14, undershoots appear when the

limiter is not applied, while the undershoots disappear and the monotonicity is obtained with the use of the

limiter.

Fig. 14

Contour plots of Case-2 at t=T/2 without (left panel) or with (right panel) the limiter. The problem is solved on

a 512×512×6 mesh by the implicit method with △t=T/5,000

Page 16: Parallel Domain Decomposition Methods with Mixed Order ...scholarism.net/FullText/ijmcs20143204.pdf · discretization scheme. Section 3 focuses on the details of the domain decomposition

International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net

908

4.3 Parallel Performance

In this subsection, we present some numerical results using Case-1 and mainly focus on the parallel performance

of the Schwarz preconditioner which is the most critical component of the implicit method. Our algorithms are

implemented based on the portable extensible toolkit for scientific computation (PETSc) [1]. All computations

are performed on a Dell PowerEdge C6100 supercomputer located at the University of Colorado Boulder. Each

node contains 24 GB local memory and two hex-core 2.8 Ghz Intel Westmere processors. The nodes are

interconnected via a non-blocking QDR Infiniband high performance network.

Scalability is an important issue in parallel computing, especially for solving large-scale problems with many

processors. In our tests, the strong scalability is defined by Speedup=T1/T2 where T1 and T2 are the execution

times obtained by running the parallel code with Np,1 and Np,2 processors (Np,1≤Np,2), respectively. The weak

scalability is used to examine how the execution time varies with the number of processors when the problem

size per processor is fixed.

For the implicit solver, we check the robustness of the algorithms with respect to the time step size △t.

Table 5 shows the influence of Δt for a fixed mesh 2,048×2,048×6. As Δt increases, the number of iterations

increases, while the computing time deceases. Also, it is clear that the performance of the proposed method is

better when Δt becomes small, in terms of the strong scalability. In this test case, a sparse LU factorization is

used to solve the subdomain problems in the RAS preconditioner. It is important to note that the timing results

obtained by using a first-order discretization based preconditioner is always better than the results obtained with

a second-order discretization based preconditioner.

Table 5

Effect of time step sizes for Case-1

Np Iter Time Iter Time Iter Time Iter Time

T/10 T/50 T/200 T/500

1st-order-pre

192 132.6 176.5 67.3 630.7 40.3 2,205.7 27.3 4,869.9

384 139.5 83.6 68.7 293.0 41.2 977.0 27.9 2,226.5

768 168.5 63.2 73.3 126.3 42.4 399.3 28.6 893.3

1,536 182.0 28.9 76.2 67.4 43.3 201.7 29.1 503.2

3,072 257.4 14.6 84.6 37.7 45.5 110.2 29.8 285.5

2nd-order-pre

192 62.8 449.5 15.3 1,830.7 9.6 7313.7 8.6 20,062.9

384 67.0 216.3 16.3 816.6 10.0 3,501.6 8.9 9,274.0

768 119.2 85.3 25.2 220.9 10.6 767.0 9.0 2,039.3

1,536 134.4 49.3 27.8 107.7 10.9 352.5 9.1 934.8

3,072 250.5 40.2 46.5 48.8 14.4 124.6 9.5 388.1

2,048×2,048×6 mesh, LU subdomain solver, and δ=1. “Iter” denotes the average number of iterations per time

step, and “Time” denotes the total computing time in seconds. “1st-order-pre” denotes the preconditioner with

the first-order discretization; “2nd-order-pre” denotes the preconditioner with the second-order discretization

Page 17: Parallel Domain Decomposition Methods with Mixed Order ...scholarism.net/FullText/ijmcs20143204.pdf · discretization scheme. Section 3 focuses on the details of the domain decomposition

International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net

909

In the Schwarz procondtioner, the overlapping parameter δ plays an important role in controlling the number of

iterations and the totally computing time. For this experiment, we consider two meshes 1,024×1,024×6 and

2,048×2,048×6, and use two fixed time step sizes Δt=T/100,T/1,000. We run Case-1 using different overlapping

sizes and different number of processors. The subdomain solve is set to be the LU factorization. In

Tables 6 and 7, we show the performance of the additive Schwarz preconditioner. The results suggest that an

optimal overlapping size exists if the goal is to minimize the total computing time for a given number of

processors on a particular machine. Also, the performance of the implicit method by using the preconditioner

with the first-order preconditioner is more attractive measured by the computing time, as shown in Fig. 15.

Fig. 15

Strong scalability results for Case-1 with different number of processors Np. The mesh is 2,048×2,048× 6

and △t=T/100. We use LU and δ=1

Table 6

Effect of overlapping size δ for Case-1

Np Iter Time Iter Time Iter Time Iter Time

δ=0 δ=1 δ=2 δ=3

1st-order-pre

192 54.6 1,112.4 53.0 1,135.9 51.7 1,202.1 51.6 1,233.7

384 54.0 516.5 54.1 531.8 53.0 562.3 52.7 604.9

768 58.2 221.2 56.5 223.5 55.0 247.9 54.6 275.8

1,536 58.7 115.7 58.1 117.7 56.7 119.5 56.1 141.5

3,072 67.1 64.3 62.7 63.8 60.5 80.5 59.7 86.9

2nd-order-pre

192 22.8 3,672.8 10.7 3,609.0 8.7 3,883.5 8.4 4,421.4

Page 18: Parallel Domain Decomposition Methods with Mixed Order ...scholarism.net/FullText/ijmcs20143204.pdf · discretization scheme. Section 3 focuses on the details of the domain decomposition

International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net

910

Np Iter Time Iter Time Iter Time Iter Time

δ=0 δ=1 δ=2 δ=3

384 24.1 1,551.6 11.1 1,630.7 9.1 1,736.6 8.7 1,716.3

768 26.8 368.8 14.6 405.5 13.6 434.8 13.3 456.9

1,536 28.3 232.2 15.7 196.3 14.6 199.8 14.3 219.0

3,072 37.4 81.0 24.7 88.1 23.8 93.8 23.4 103.0

LU subdomain solver, 2,048×2,048×6 mesh, and Δt=T/100. “Iter” denotes the average number of iterations per

time step, and “Time” denotes the total computing time in seconds. “1st-order-pre” denotes the preconditioner

with the first-order discretization; “2nd-order-pre” denotes the preconditioner with the second-order

discretization

Table 7

Effect of the overlapping size δ for Case-1

Np Iter Time Iter Time Iter Time Iter Time

δ=0 δ=1 δ=2 δ=3

1st-order-pre

192 18.4 1,277.7 15.5 1,262.7 14.7 1,371.0 14.5 1,452.7

384 18.5 626.8 15.7 620.7 14.9 639.2 14.5 760.9

768 18.9 255.6 16.1 264.0 15.4 270.5 15.0 304.7

1,536 19.2 173.3 16.3 152.5 15.4 159.6 15.1 181.8

3,072 19.6 101.3 16.7 103.0 15.9 110.6 15.5 121.6

2nd-order-pre

192 14.5 3,083.0 6.6 3,443.8 4.1 3,507.8 3.3 4,084.6

384 14.8 1,372.0 6.9 1,496.7 4.1 1,624.4 3.3 1,764.1

768 15.1 351.7 6.9 343.3 4.2 385.0 3.3 427.3

1,536 15.4 227.3 7.2 215.0 4.2 200.8 3.3 248.4

3,072 15.6 110.1 7.4 110.5 4.4 128.6 3.8 150.0

LU subdomain solver, 1,024×1,024×6 mesh, and Δt=T/1,000. “Iter” denotes the average number of iterations

per time step, and “Time” denotes the total computing time in seconds. “1st-order-pre” denotes the

Page 19: Parallel Domain Decomposition Methods with Mixed Order ...scholarism.net/FullText/ijmcs20143204.pdf · discretization scheme. Section 3 focuses on the details of the domain decomposition

International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net

911

preconditioner with the first-order discretization; “2nd-order-pre” denotes the preconditioner with the second-

order discretization

The performance of the implicit method depends heavily on how the subdomain problems are solved. In the

following set of the tests, we compare several different subdomain solvers including sparse ILU factorizations

with different levels of fill-ins. We run the test on a fixed mesh 2,048×2,048× 6 and Δt=T/100. We summarize

the results with different number of processors and levels of fill-ins in Table 8. Compared with Table 6, we see

that the implicit method with ILU is more attractive in the terms of the computing time. The number of

iterations is relatively small and slightly increases as Np tends to 3,072. Moreover, in terms of the total

computing time, the performance of the “2nd-order-pre” approach is better than that of the “1st-order-pre”

approach when the number of processors is small such as 192 and 384, but the “1st-order-pre” approach

becomes more competitive approach become more competitive as Np becomes larger. As a result, as shown in

Fig. 16, the “1st-order-pre” approach shows better strong scalability than that of the“2nd-order-pre” approach.

In this sense, the “1st-order-pre” approach is more suitable for solving large-scale problems with many

processors.

Fig. 16

Strong scalability results for Case-1 with different number of processors Np. The mesh is 2,048×2,048×6

and △t=T/100. We use ILU(3) and δ=1

Table 8

Test results using different fill-in levels k and different number of processors for Case-1

Np Iter Time Iter Time Iter Time Iter Time

k=2 k=3 k=4 k=5

1st-order-pre

192 106.1 588.5 77.9 518.8 64.0 492.0 57.4 494.3

384 107.5 305.2 79.6 269.3 65.3 255.0 58.2 251.5

768 109.6 160.9 82.4 145.0 67.2 140.0 60.6 137.5

1,536 111.8 90.9 84.5 84.7 68.9 80.7 62.1 80.9

Page 20: Parallel Domain Decomposition Methods with Mixed Order ...scholarism.net/FullText/ijmcs20143204.pdf · discretization scheme. Section 3 focuses on the details of the domain decomposition

International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net

912

Np Iter Time Iter Time Iter Time Iter Time

k=2 k=3 k=4 k=5

3,072 114.8 58.5 87.6 56.5 72.9 54.2 66.6 53.5

2nd-order-pre

192 64.5 599.4 32.3 434.2 22.8 402.8 18.6 401.6

384 67.6 352.4 33.3 273.5 23.8 244.0 19.7 230.1

768 68.6 202.7 34.1 162.7 24.2 146.7 20.1 138.0

1,536 71.8 108.3 38.1 89.0 26.9 84.1 22.6 96.8

3,072 73.6 87.6 38.9 59.1 29.6 58.9 26.3 58.7

δ=1 and Δt=T/100. 2,048×2,048× 6 mesh. “Iter” denotes the average number of iterations per time step, and

“Time” denotes the total computing time in seconds. “1st-order-pre” denotes the preconditioner with the first-

order discretization; “2nd-order-pre” denotes the preconditioner with the second-order discretization

Finally, to further examine the parallel performance of the proposed methods, we show the weak scalability in

Table 9 and Fig. 17. The first-order approach is clearly better than the second-order approach in terms of the

computing time, although the first-order approach needs more iterations. We also observe that for the implicit

solver the number of iterations suffers when the number of processors increases and the mesh is refined, as a

result the computing time can not stay unchanged. This suggests the need of a two-level or multilevel Schwarz

algorithm.

Fig. 17

Weak scalability results for Case-1 with different number of processors Np. A fixed 96×96 mesh is used per

processor, “1st-order-pre-LU” denotes the 1st-order scheme with a LU subdomain solve and the others are

defined similarly

Table 9

Weak scalability results for Case-1 with a fixed 96×96 mesh per processor

Np Mesh Iter Time Iter Time

Page 21: Parallel Domain Decomposition Methods with Mixed Order ...scholarism.net/FullText/ijmcs20143204.pdf · discretization scheme. Section 3 focuses on the details of the domain decomposition

International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net

913

1st-order-pre 2nd-order-pre

LU

24 192×192×6 17.0 19.5 6.3 38.3

96 384×384×6 24.0 24.8 8.1 41.7

384 768×768×6 34.8 34.7 10.1 46.7

864 1,152×1,152×6 43.3 43.4 12.4 60.1

1,536 1,536×1,536×6 52.0 55.5 15.2 68.8

1,944 1,728×1,728×6 54.7 61.1 16.7 75.1

2,904 2,112×2112×6 61.7 68.3 19.8 89.8

ILU(4)

24 192×192×6 17.1 13.9 6.5 14.1

96 384×384×6 24.1 16.9 9.0 15.9

384 768×768×6 35.6 22.9 13.6 22.2

864 1,152×1,152×6 45.9 30.0 17.8 30.5

1,536 1,536×1,536×6 56.5 40.0 22.5 42.9

1,944 1,728×1,728×6 62.0 49.1 24.9 55.9

2,904 2,112×2,112×6 73.3 56.3 30.5 70.1

δ=1 and △t=T/100. “Iter” denotes the average number of iterations per time step, and “Time” denotes the total

computing time in seconds. “1st-order-pre” denotes the preconditioner with the first-order discretization; “2nd-

order-pre” denotes the preconditioner with the second-order discretization. The total degree of freedom for the

biggest case is 2,112×2,112×6=26,763,264

5 Concluding Remarks

A parallel, fully implicit method was developed for solving the tracer transport problem on the cubed-sphere.

Domain decomposition methods with both first-order and second-order discretizations are proposed to solve the

linear system at each time step. The implicit method with the second-order temporal discretization allows much

larger time steps than the explicit method, while preserving the accuracy of the solution, and also demonstrates

superior performance in terms of the total computing time compared to the explicit method. The effectiveness

and scalability of the implicit method depends heavily on the design of the preconditioner. After many

experiments, we found the class of restricted additive Schwarz method based on a first-order discretization

works well, and is more attractive than the second-order discretization. Excellent results were obtained for

solving several test problems with tens of millions of unknowns and on a parallel machine with up to 3,072

processors. We believe that the family of Schwarz methods with low order discretization is suitable for larger

problems and on machines with lager number of processors. Future research may include solving other flow

Page 22: Parallel Domain Decomposition Methods with Mixed Order ...scholarism.net/FullText/ijmcs20143204.pdf · discretization scheme. Section 3 focuses on the details of the domain decomposition

International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net

914

problems on the cubed-sphere on much finer meshes with a larger number of processors and with multilevel

Schwarz preconditioners.

Acknowledgments

The authors would like to express their appreciations to the anonymous reviewers for the invaluable comments

that greatly improved the quality of the manuscript. This work was supported in part by NSF grant CCF-

1216314 and DOE grant DE-SC0001774. H. Yang was also supported in part by NSFC grants 91330111,

11201137 and 11272352. C. Yang was also supported in part by NSFC grants 61170075 and 91130023

References

1.Balay, S., Buschelman, K., Gropp, W.D., Kaushik, D., Knepley, M., McInnes, L.C., Smith, B.F., Zhang, H.:

PETSc Users Manual. Argonne National Laboratory (2012)

2.Brown, P.N., Shumaker, D.E., Woodward, C.S.: Fully implicit solution of large-scale non-equilibrium

radiation diffusion with high order time integration. J. Comput. Phys. 204, 760–783

(2005)MathSciNetCrossRefMATH

3.Cai, X.-C., Gropp, W.D., Keyes, D.E., Melvin, R.G., Young, D.P.: Parallel Newton-Krylov-Schwarz

algorithms for the transonic full potential equation. SIAM J. Sci. Comput. 19, 246–265

(1998)MathSciNetCrossRefMATH

4.Cai, X.-C., Sarkis, M.: A restricted additive Schwarz preconditioner for general sparse linear systems. SIAM

J. Sci. Comput. 21, 792–797 (1999)MathSciNetCrossRefMATH

5.Chen, C., Xiao, F.: Shallow water model on cubed-sphere by multi-moment finite volume method. J. Comput.

Phys. 227, 5019–5044 (2008)MathSciNetCrossRefMATH

6.Evans, K.J., Knoll, D.A.: Temporal accuracy of phase change convection simulations using the JFNK-

SIMPLE algorithm. Int. J. Num. Meth. Fluids. 55, 637–655 (2007)CrossRefMATH

7.Erath, C., Lauritzen, P.H., Garcia, J.H., Tufo, H.M.: Integrating a scalable and efficient semi-Lagrangian

multi-tracer transport scheme in HOMME. Proc. Comput. Sci. 9, 994–1003 (2012)CrossRef

8.Harris, L.M., Lauritzen, P.H., Mittal, R.: A flux-form version of the conservative semi-Lagrangian multi-

tracer transport scheme (CSLAM) on the cubed-sphere grid. J. Comput. Phys. 230, 1215–1237

(2011)MathSciNetCrossRefMATH

9.Jacobson, M.Z.: Fundamentals of Atmospheric Modeling. Cambridge University Press, New York (1999)

10.Knoll, D.A., Chacon, L., Margolin, L.G., Mousseau, V.A.: On balanced approximations for time integration

of multiple time scale systems. J. Comput. Phys. 185, 583–611 (2003)CrossRefMATH

11.Lauritzen, P.H., Nair, R.D., Ullrich, P.A.: A conservative semi-Lagrangian multi-tracer transport scheme

(CSLAM) on the cubed-sphere grid. J. Comput. Phys. 229, 1401–1424 (2010)MathSciNetCrossRefMATH

12.Lauritzen, P.H., Skamarock, W.C.: Test-case suite for 2D passive tracer transport: a proposal for the NCAR

transport workshop. March (2011)

Page 23: Parallel Domain Decomposition Methods with Mixed Order ...scholarism.net/FullText/ijmcs20143204.pdf · discretization scheme. Section 3 focuses on the details of the domain decomposition

International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net

915

13.Lauritzen, P.H., Jablonowski, C., Taylor, M., Nair, R.: Numerical Techniques for Global Atmospheric

Models. Lecture Notes in Computational Science and Engineering. Springer, Berlin (2011)CrossRef

14.Lauritzen, P.H., Ullrich, P.A., Nair, R.D.: Atmospheric transport schemes: desirable properties and a semi-

Lagrangian view on finite-volume discretizations. In: Lecture Notes in Computational Science and Engineering

(Tutorials), vol. 80, Springer, (2011)

15.Lauritzen, P.H., Skamarock, W.C., Prather, M.J., Taylor, M.A.: A standard test case suite for two-

dimensional linear transport on the sphere. Geosci. Model Dev. Discuss. 5, 189–228 (2012)CrossRef

16.Lauritzen, P.H., Thuburn, J.: Evaluating advection/transport schemes using interrelated tracers, scatter plots

and numerical mixing diagnostics. Q. J. Roy. Meteor. Soc. 138, 906–918 (2012)CrossRef

17.Nair, R.D., Thomas, S.J., Loft, R.D.: A discontinuous Galerkin global shallow water model. Mon. Weather

Rev. 133, 876–888 (2005)CrossRef

18.Nair, R.D., Lauritzen, P.H.: A class of deformational-flow test cases for linear transport problems on the

sphere. J. Comput. Phys. 229, 8868–8887 (2010)MathSciNetCrossRefMATH

19.Putman, W.M., Lin, S.-J.: Finite-volume transport on various cubed-sphere grids. J. Comput. Phys. 227, 55–

78 (2007)MathSciNetCrossRefMATH

20.

Rancic, M.R., Purser, J., Mesinger, F.: A global-shallow water model using an expanded spherical cube:

Gnomonic versus conformal coordinates. Q. J. Roy. Meteor. Soc. 122, 959–982 (1996)CrossRef

21.Ronchi, C., Iacono, R., Paolucci, P.: The cubed sphere: a new method for the solution of partial differential

equations in spherical geometry. J. Comput. Phys. 124, 93–114 (1996)MathSciNetCrossRefMATH

22.Sadourny, R., Arakawa, A., Mintz, Y.: Integration of the nondivergent barotropic vorticity equation with an

icosahedralhexagonal grid for the sphere. Mon. Weather Rev. 96, 351–356 (1968)CrossRef

23.Sadourny, R.: Conservative finite-difference approximations of the primitive equations on quasi-uniform

spherical grids. Mon. Weather Rev. 100, 211–224 (1972)CrossRef

24.Saad, Y.: Iterative Methods for Sparse Linear Systems. SIAM, Philadelphia (2003)CrossRefMATH

25.Shadid, J.N., Tuminaro, R.S., Devine, K.D., Hennigan, G.L., Lin, P.T.: Performance of fully coupled domain

decomposition preconditioners for finite element transport/reaction simulations. J. Comput. Phys. 205, 24–47

(2005)MathSciNetCrossRefMATH

26.Smith, B., Bjørstad, P., Gropp, W.: Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial

Differential Equations. Cambridge University Press, Cambridge (1996)MATH

27.Toselli, A., Widlund, O.: Domain Decomposition Methods-Algorithms and Theory. Springer, Berlin

(2005)MATH

28.Van Albada, G.D., van Leer, B., Roberts, W.W.: A comparative study of computational methods in cosmic

gas dynamics. Astron. Astrophys.108, 95–103 (1982)

Page 24: Parallel Domain Decomposition Methods with Mixed Order ...scholarism.net/FullText/ijmcs20143204.pdf · discretization scheme. Section 3 focuses on the details of the domain decomposition

International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net

916

29.White III, J.B., Dongarra, J.J.: High-performance high-resolution semi-Lagrangian tracer transport on a

sphere. J. Comput. Phys. 230, 6778–6799 (2011)CrossRefMATH

30.Williamson, D.L., Drake, J.B., Hack, J.J., Jakob, R., Swarztrauber, P.N.: A standard test set for numerical

approximations to the shallow water equations in spherical geometry. J. Comput. Phys. 102, 211–224

(1992)MathSciNetCrossRefMATH

31.Wu, Y., Cai, X.-C., Keyes, D.E.: Additive Schwarz methods for hyperbolic equations. In: Mandel, J., Farhat,

C., Cai, X.-C. (eds.) Proceedings of the 10th International Conference on Domain Decomposition Methods,

AMS, pp. 513–521 (1998)

32.Yang, C., Cao, J., Cai, X.-C.: A fully implicit domain decomposition algorithm for shallow water equations

on the cubed-sphere. SIAM J. Sci. Comput. 32, 418–438 (2010)MathSciNetCrossRefMATH

33.Yang, C., Cai, X.-C.: Parallel multilevel methods for implicit solution of shallow water equations with

nonsmooth topography on cubed-sphere. J. Comput. Phys. 230, 2523–2539 (2011)MathSciNetCrossRefMATH

34.Yang, C., Cai, X.-C.: A scalable fully implicit compressible Euler solver for mesoscale nonhydrostatic

simulation of atmospheric flows. SIAM J. Sci. Comput. To appear

35.Yang, H., Cai, X.-C.: Parallel two-grid semismooth Newton-Krylov-Schwarz method for nonlinear

complementarity problems. J. Sci. Comput. 47, 258–280 (2011)MathSciNetCrossRefMATH

36.Yang, H., Prudencio, E., Cai, X.-C.: Fully implicit Lagrange-Newton-Krylov-Schwarz algorithms for

boundary control of unsteady incompressible flows. Int. J. Numer. Meth. Eng. 91, 644–665

(2012)MathSciNetCrossRefMATH

37.Zhang, J., Wang, L.L., Rong, Z.: A prolate-element method for nonlinear PDEs on the sphere. J. Sci.

Comput. 47, 73–92 (2011)MathSciNetCrossRefMATH