petascale application of a coupled cpu-gpu algorithm for...

11
Petascale Application of a Coupled CPU-GPU Algorithm for Simulation and Analysis of Multiphase Flow Solutions in Porous Medium Systems James E. McClure * , Hao Wang , Jan F. Prins Cass T. Miller § and Wu-chun Feng , * Advanced Research Computing Virginia Tech, Blacksburg, Virginia Email: [email protected] Dept. of Computer Science University of North Carolina at Chapel Hill, Chapel Hill, North Carolina Email:[email protected] § Dept. of Environmental Science & Engineering University of North Carolina at Chapel Hill, Chapel Hill, North Carolina Email:[email protected] Dept. of Computer Science Virginia Tech, Blacksburg, Virginia Email: {hwang121, wfeng}@vt.edu Abstract —Large-scale simulation can provide a wide range of information needed to develop and validate theoretical models for multiphase flow in porous medium systems. In this paper, we consider a coupled solution in which a multiphase flow simulator is coupled to an analysis approach used to extract the interfacial geometries as the flow evolves. This has been implemented using MPI to target heterogeneous nodes equipped with GPUs. The GPUs evolve the multiphase flow solution using the lattice Boltzmann method while the CPUs compute upscaled measures of the morphology and topology of the phase distri- butions and their rate of evolution. Our approach is demonstrated to scale to 4,096 GPUs and 65,536 CPU cores to achieve a maximum performance of 244,754 million-lattice-node updates per second (MLUPS) in double precision execution on Titan. In turn, this approach increases the size of systems that can be considered by an order of magnitude compared with previous work and enables detailed in situ tracking of averaged flow quantities at temporal resolutions that were previously impossible. Furthermore, it virtually eliminates the need for post-processing and intensive I/O and mitigates the potential loss of data associated with node failures. Keywords-CUDA, Heterogeneous Computing, Parallel Computing, GPGPU, Lattice Boltzmann Method I. Introduction Multiphase flow processes in porous media are oper- ative in many applications that include carbon seques- tration, remediation of environmental contaminants and enhanced oil recovery. The ability to describe these pro- cesses in an averaged sense is essential due to the size and complexity of these systems, and computational studies have great promise to advance fundamental understand- ing of the transport behavior. Multi-scale descriptions have gained traction as a concrete way to resolve defi- ciencies in the existing mathematical formulations that are typically applied to describe multiphase transport phenomena in porous media [1, 2]. These development advance a theoretical basis for upscaled models and pro- vide both general and specific forms of closures relations that can be evaluated, advanced, and validated using high resolution simulations that have until now been beyond reach computationally. Theoretically approaches have already established a rigorous connection between microscopic flow processes and the larger-scale macro- scopic behavior, and accurately tracking changes in in- terfacial areas has been identified as critically important to the closure of thermodynamically-consistent models [3]. Directly simulating the microscopic details of flow processes that occur within the complex micrometer- sized interstices within the solid geometry provides a way to advance these models. The lattice Boltzmann method (LBM) is well- established as a useful approach for numerical investiga- tion of multiphase flow in porous media [4, 5, 6, 7]. The computational properties of the LBM make it well-suited for parallel implementation, and the performance and scalability for GPU-based implementations have been demonstrated for multiple computational fluid dynamics applications [8, 9, 10]. For a multiphase implementation the LBM targeting multiple GPU with MPI, Wang et al. reported performance of 2850 MLUPS on 64 GPUs, demonstrating that excellent weak scaling can be obtained provided that sufficiently large sub-domains IEEE International Parallel and Distributed Processing Symposium, May 2014, Phoenix, Arizona

Upload: others

Post on 05-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Petascale Application of a Coupled CPU-GPU Algorithm for ...synergy.cs.vt.edu/pubs/papers/mcclure-multiphase-flow-ipdps14.pdfidle CPU cores to perform analysis so I/O and data transfer

Petascale Application of a Coupled CPU-GPU Algorithm for Simulationand Analysis of Multiphase Flow Solutions in Porous Medium Systems

James E. McClure∗, Hao Wang †, Jan F. Prins‡ Cass T. Miller§ and Wu-chun Feng †,∗Advanced Research Computing

Virginia Tech, Blacksburg, VirginiaEmail: [email protected]‡Dept. of Computer Science

University of North Carolina at Chapel Hill,Chapel Hill, North CarolinaEmail:[email protected]

§Dept. of Environmental Science & EngineeringUniversity of North Carolina at Chapel Hill,

Chapel Hill, North CarolinaEmail:[email protected]†Dept. of Computer Science

Virginia Tech, Blacksburg, VirginiaEmail: hwang121, [email protected]

Abstract—Large-scale simulation can provide awide range of information needed to develop andvalidate theoretical models for multiphase flow inporous medium systems. In this paper, we consider acoupled solution in which a multiphase flow simulatoris coupled to an analysis approach used to extract theinterfacial geometries as the flow evolves. This hasbeen implemented usingMPI to target heterogeneousnodes equipped with GPUs. The GPUs evolve themultiphase flow solution using the lattice Boltzmannmethod while the CPUs compute upscaled measuresof the morphology and topology of the phase distri-butions and their rate of evolution. Our approach isdemonstrated to scale to 4,096 GPUs and 65,536 CPUcores to achieve a maximum performance of 244,754million-lattice-node updates per second (MLUPS) indouble precision execution on Titan. In turn, thisapproach increases the size of systems that can beconsidered by an order of magnitude compared withprevious work and enables detailed in situ tracking ofaveraged flow quantities at temporal resolutions thatwere previously impossible. Furthermore, it virtuallyeliminates the need for post-processing and intensiveI/O and mitigates the potential loss of data associatedwith node failures.

Keywords-CUDA, Heterogeneous Computing,Parallel Computing, GPGPU, Lattice BoltzmannMethod

I. IntroductionMultiphase flow processes in porous media are oper-

ative in many applications that include carbon seques-tration, remediation of environmental contaminants andenhanced oil recovery. The ability to describe these pro-cesses in an averaged sense is essential due to the size andcomplexity of these systems, and computational studies

have great promise to advance fundamental understand-ing of the transport behavior. Multi-scale descriptionshave gained traction as a concrete way to resolve defi-ciencies in the existing mathematical formulations thatare typically applied to describe multiphase transportphenomena in porous media [1, 2]. These developmentadvance a theoretical basis for upscaled models and pro-vide both general and specific forms of closures relationsthat can be evaluated, advanced, and validated usinghigh resolution simulations that have until now beenbeyond reach computationally. Theoretically approacheshave already established a rigorous connection betweenmicroscopic flow processes and the larger-scale macro-scopic behavior, and accurately tracking changes in in-terfacial areas has been identified as critically importantto the closure of thermodynamically-consistent models[3]. Directly simulating the microscopic details of flowprocesses that occur within the complex micrometer-sized interstices within the solid geometry provides a wayto advance these models.

The lattice Boltzmann method (LBM) is well-established as a useful approach for numerical investiga-tion of multiphase flow in porous media [4, 5, 6, 7]. Thecomputational properties of the LBM make it well-suitedfor parallel implementation, and the performance andscalability for GPU-based implementations have beendemonstrated for multiple computational fluid dynamicsapplications [8, 9, 10]. For a multiphase implementationthe LBM targeting multiple GPU with MPI, Wanget al. reported performance of 2850 MLUPS on 64GPUs, demonstrating that excellent weak scaling canbe obtained provided that sufficiently large sub-domains

IEEE International Parallel and Distributed Processing Symposium, May 2014, Phoenix, Arizona

Page 2: Petascale Application of a Coupled CPU-GPU Algorithm for ...synergy.cs.vt.edu/pubs/papers/mcclure-multiphase-flow-ipdps14.pdfidle CPU cores to perform analysis so I/O and data transfer

are retained by each MPI process [11]. In order toextend this performance to flow in porous media, GPUimplementations must not only solve for the multiphaseflow, but also accommodate complex solid boundaryconditions within GPU kernels. An additional challengeis presented by the need to track the dynamics ofmultiphase flow in large, complex systems in situ so thatthe need to save the simulation state to disk at regularintervals can be avoided.

In this paper we construct a scalable approach to thesimulation and analysis of multiphase flows that takesadvantage of heterogeneous parallelism by coupling anGPU-based solution for the multiphase flow problemto an analysis algorithm used to extract morphologicalinformation on the CPU. This approach enables thepractical application of the LBM to study multiphaseflow in porous media on the world’s largest GPU-equipped supercomputer. The implementation achievesnear-ideal weak scaling and achieves a performance of244, 754 MLUPS for double-precision execution on 4, 096GPU. The specific contributions of this work are asfollows:

1) development of a GPU-based simulator for multi-phase flow in porous media using CUDA and MPI;

2) significant reduction in the need for I/O, datastorage, data transfer and post-processing analysisdue to performing all analysis on the CPU duringthe course of a simulation;

3) evaluation of the scaling and load-balance for thedeveloped algorithm and demonstrate the perfor-mance; and

4) show that the method can be used to extractpreviously inaccessible scientific results from large-scale simulations.

II. Background

A. Scientific MeritPorous media include a wide range of natural and

synthetic materials in which flow processes can occur inthe interior spaces, also known as the pore-space. Directsimulation of the microscopic details of multiphase flowprocesses requires knowledge of the pore-space geometryin which this flow occurs. In order to obtain sufficientlycomplex geometries, dense sphere packs are often used.The spheres represent the solid phase s, which is immo-bile and non-deformable for our simulations. An examplesphere packs is shown in Fig. 1, in which non-overlappingspheres with log-normally distributed radii have beeninserted into a cubic domain to serve as surrogate porousmedia [12]. The region of the domain occupied by solidphase is denoted by Ωs. In many cases it is importantto predict the behavior of two fluid phases, which arereferred to as the wetting (w) and non-wetting (n) fluids.

Figure 1. An example of a porous medium system constructedfrom a random close packing of 85,519 spheres with log-normallydistributed radii

This classification is determined based on the contactangle formed where the two fluids meet the solid phase.

For two-phase flow in porous media, capillary forcestend to dominate other effects and the evolution ofthe interfaces is a main factor that determines theequilibrium and non-equilibrium behavior. Extractinginformation about the dynamic behavior of interfaces isnot currently possible in an experimental setting, andcomputational approaches currently provide the mostpractical way to generate detailed insight into theseprocesses. For the systems considered in this work, threeinterfaces can exist in: the interface between the wettingand non-wetting fluids Ωwn, the interface between thewetting fluid and the solid Ωws and the interface be-tween the non-wetting fluid and the solid Ωns. Existingapproaches used to have been evaluate interfacial areain porous medium systems that are approximately 4003

in size [13]. In this work we develop an approach thatcan be applied to systems that are order of magnitudelarger, with analysis applied thousands of times over thecourse of a single simulation.

In order to develop a coherent macroscopic multi-phase flow model for porous media, it is necessary tocomprehend the behavior of a wide range of averagedquantities that are obtained by integrating quantitiesover the phase volumes, interfaces between phases, andthe common curve where all three phases coexist [1].In this work we consider a relatively simple computa-tional experiment that is of direct importance to theadvancement of multi-scale models for multiphase flowin porous media. A non-equilibrium configuration ofmultiple phases is initialized within the pore-space of

IEEE International Parallel and Distributed Processing Symposium, May 2014, Phoenix, Arizona

Page 3: Petascale Application of a Coupled CPU-GPU Algorithm for ...synergy.cs.vt.edu/pubs/papers/mcclure-multiphase-flow-ipdps14.pdfidle CPU cores to perform analysis so I/O and data transfer

a sphere packing, then the interfacial areas awn, awsand ans are evaluated numerically as the system relaxestoward equilibrium. The interfacial areas are computeddirectly at intervals of T time steps throughout thecourse of simulation. As a measure of the approach toequilibrium, we evaluate the change in surface energy:

∆Es = γwn∆awn + (γns − γws)∆ans. (1)

The surface energy for the wn interface, γwn, can becomputed based on the parameters for the simulation.The difference between the ns and ws interfacial en-ergies, given by γns and γns respectively, is relatedto the equilibrium contact angle by Young’s equation[14]. The contact angle can be determined from on thesimulation parameters, permitting the computation of∆Es based on the time history of interfacial areas [15].It is important to note that while the evaluation ofthe interfacial areas is a simple calculation, due to thefact that a numerical approximation for the interfacesis constructed explicitly it is also possible to track thetransient behavior of virtually any scalar, vector, or ten-sor quantity averaged over the phase volumes, interfacesor the common curve.

B. GPU-Accelerated Supercomputer ArchitectureDue to their superior performance and energy effi-

ciency, GPUs are widely used to speedup scientific com-putational applications. In the Top500 list [16] publishedin November 2013, 41 systems, including two of the topten supercomputers, were installed with GPUs. Sincewe carry out our experimental evaluations on Titansupercomputer, ranking on No. 2 in the Top500 listto date, we introduce Titan as the GPU-acceleratedsupercomputer architecture.

Titan uses NVIDIA Kepler K20x GPU on the com-putational nodes. Each Kepler K20x GPU has 6 GBGDDR5 SDRAM device memory and 2688 CUDA Coresin total. The peak performance of a GPU is 3.95 TFlopsand 1.31 TFlops in single precision and double precision,respectively. A GPU contains 14 streaming multiproces-sors (SMs), each of which consists of 192 CUDA Coresas SIMD (single instruction, multiple data stream) units.The device memory, also know as the global memory, isshared by all the SMs in a GPU with 250 GB/sec peakbandwidth.

We use Compute Unified Device Architecture(CUDA) programming model provided by NVIDIA tocompose our program. The CUDA program consists ofCPU codes and GPU codes. The functions that willbe running on GPUs are called GPU kernels. A kernelfunction will be running in parallel by a large numberof threads on GPU. The threads are grouped into theblock of threads and the grid of blocks. When a kernelis launched (or offloaded) by the CPU, the parameters,

Figure 2. An example workflow depicting the simulation andanalysis of multiphase flow. Simulation data is written to disk andtransferred from the supercomputer to a local workstation whereanalysis is performed. The time interval for analysis T is therebyconstrained by data movement and analysis bottlenecks.

such as the number of threads per block and thenumber of blocks per grid, should be specified. Othermajor parts of Titan include the AMD Opteron 6274hexa-core processor and 32GB DDR3 main memory oneach node, and Cray Gemini interconnect to connectnodes with a 3D torus architecture. We use CUDAfor the parallel computation on GPU and MPI for theparallel communication on multiple nodes.

III. Computational ApproachA. Overall Simulator Design

This work considers a multiphase implementation ofthe LBM that is written in CUDA and MPI and coupledto an image processing code (written in C++) thatextracts the interfaces averages during the course ofsimulation. The multiphase flow simulator is a GPU-based implementation of the color model lattice Boltz-mann scheme developed by McClure et al., a schemethat was constructed to use both GPU and multi-coreCPU in tandem [15]. In this work, all LBM computationsare performed on the GPU and CPU are dedicated tocarrying out image-processing computations needed toextract information from the simulation. This analysisis performed using a porous medium marching cubes(PMMC) algorithm developed to extract the interfacesand calculate a variety of other averages [17]. The detailsfor each aspect of this approach are presented in thefollowing sections.

In order to enable large-scale simulations that canmeet the scientific objectives outlined for this applica-tion, it is essential to define a scalable workflow. Very

IEEE International Parallel and Distributed Processing Symposium, May 2014, Phoenix, Arizona

Page 4: Petascale Application of a Coupled CPU-GPU Algorithm for ...synergy.cs.vt.edu/pubs/papers/mcclure-multiphase-flow-ipdps14.pdfidle CPU cores to perform analysis so I/O and data transfer

Figure 3. The overall design of the simulation tool considered inthis work relies on GPU to execute a simulation for the multiphaseflow while the CPU perform analysis at regular intervals through-out the simulation. As a consequence, the interval for analysis Tcan be chosen as a small value, yielding detailed information forthe system dynamics.

large systems must be considered in order to producegeneralizable results from pore-scale studies. However, itis cumbersome to retain and analyze the resulting data.This bottleneck is evident based on the workflow shownin Fig. 2. In this example, a CPU-based implementationof the LBM is used to provide the solution for the phaseindicator field φ(xi, t), the pressure field p(xi, t) and thevelocity field u(xi, t). The simulation is executed by PMPI processes, each of which owns a subdomain of sizeN × N × N . In order to extract averaged informationfrom the simulation, the solution was saved to disk everyT time steps and transferred to a local workstationwhere the PMMC approach is applied to determine theinterfacial areas.

In order to track the dynamics of the system, thesimulation state is analyzed at intervals of T time steps.For the present study tracking the interfacial areas, it issufficient to save only the phase indicator field to disk,although this is not true in general. If the analysis isto be performed as shown in Fig. 2, 8PN3 bytes willbe written to disk and transferred off of the cluster forpost-processing. The interval T is constrained by theextreme costs of data transfer, in particular. Further-more, the total system size PN3 is constrained by theamount of memory available on the desktop machineand the time required to execute the PMMC analysisdoes not scale with the number of processors P . As aconsequence, the interval T becomes unacceptably largeas the P increases. Note that in the present study we areinterested in computing only four quantities: awn, aws,

ans and ∆Es.In a GPU-based simulation, it is attractive to use the

idle CPU cores to perform analysis so I/O and datatransfer costs can be reduced. The strategy outlinedin Fig. 3 provides a way to avoid data transfers andI/O while taking advantage of the available CPU. Atintervals of T time steps, the requisite data is transferredfrom the GPU to the CPU using cudaMemcpy. EachMPI process then applies the PMMC to extract thelocal interfacial areas on the CPU, which are reduced toobtain awn, aws and ans using MPI_Reduce. Only thesevalues are written to output, effectively eliminating thecost of data transfer since the size of the output datais no longer proportional to PN3. The time interval Tcan be chosen such that the costs of carrying out theaveraging have minimal impact on the performance ofthe LBM simulator itself. A typical multiphase LBMcalculation for multiphase flow in porous media requires105 − 107 time steps. In this work we demonstrate thatexcellent performance can be sustained while performingaveraging every T = 103 time steps such that dynamicinformation multiphase flow processes can be collectedhundreds to thousands of times per simulation.

B. Multiphase Lattice Boltzmann Scheme

A variety of schemes have been constructed to modelmultiphase flows using the lattice Bolztmann method.The ‘color model’, originally proposed by Gunstensen,has since been extended to describe flows in porousmedia [7]. For the LBM, a set of discrete velocitiesprovide the mechanism to track the transport behavior.The momentum transport solution relies on the D3Q19discrete velocity set:

0, 0, 0T for q = 0±1, 0, 0T , for q = 1, 20,±1, 0T , for q = 3, 40, 0,±1T for q = 5, 6±1,±1, 0T , for q = 7, 8, 9, 10±1, 0,±1T , for q = 11, 12, 13, 140,±1,±1T for q = 15, 16, 17, 18.

(2)

A mass transport solution can be performed usingthe simpler D3Q7 velocity set, which corresponds toq = 0, 1, 2, . . . , 6 in Eq. 2. Two sets of distributions areinstantiated locally from the density values ρw and ρn

and the flow velocity u. The mass transport distributionsare constructed according to:

gwq = wq

[ρw (1 + ξq · u) + β

ρwρn

ρw + ρn(n · ξq)

](3)

gnq = wq

[ρn (1 + ξq · u)− β ρwρn

ρw + ρn(n · ξq)

], (4)

IEEE International Parallel and Distributed Processing Symposium, May 2014, Phoenix, Arizona

Page 5: Petascale Application of a Coupled CPU-GPU Algorithm for ...synergy.cs.vt.edu/pubs/papers/mcclure-multiphase-flow-ipdps14.pdfidle CPU cores to perform analysis so I/O and data transfer

where q = 0, 1, . . . , 6. The vector n is the unit normal tothe interface, determined from phase indicator field:

φ = ρw − ρn

ρw + ρn. (5)

The gradient of the phase indicator field is computedaccording to

C(xi, t) = ∇φ =Q−1∑q=1

φ(xi + ξq, t)ξq, (6)

and the unit normal vector is :

n = ∇φ|∇φ|

. (7)

For the D3Q7 model, the weights are w0 = 1/3 andwq = 1/9 for q = 1, 2, . . . 6. Eqs. 3 and 4 enforce immis-cibility of the two phases while satisfying local mass andmomentum conservation. Once the distributions havebeen computed locally, they are “pushed" to neighboringlattice sites to update the densities used in the nexttimestep:

ρk(xi, t+ δt) =6∑q=0

gkq (xi − ξqδt, t). (8)

where k = w, n. This approach is distinct from otherLBMs because the distributions are not stored in mem-ory. This has implications for parallel implementationbecause the ρk must be updated atomically for multi-threaded computation of Eq. 8.

To solve for the momentum transport, a set of distri-butions fq are constructed to track the fluid propertiesassociated with each discrete velocity ξq appearing inEq. 2. The distributions evolve according to the latticeBoltzmann equation (LBE):

fq(xi + ξqδt, t+ δt)− fq(xi, t) = Jq(xi, t), (9)

for q = 0, 1, . . . 18. The distributions can be used tocompute the fluid pressure:

p = 13

18∑q=0

fq, (10)

and the fluid momentum:

j = ρu =18∑q=0

fqξq. (11)

The term Jq accounts for changes to fq that result frominter-molecular collisions and interaction forces. Theform of Jq for our multiphase implementation matcheswhat is presented by McClure et al., where additionaldetails are available [15]. Most typically, solution of Eq.9 is performed in two steps, known as the collision step:

f∗q (xi, t) = Jq(xi, t), (12)

and the streaming step:

fq(xi + ξqδt, t+ δt) = f∗q (xi, t). (13)

The collision step is the more computationally intensiveaspect of the LBM, whereas the streaming step reliesentirely on memory bandwidth. For flow in porous mediait is important to rely on multi-relaxation time (MRT)approaches that are both more accurate and morecomputationally intensive compared with the single-relaxation time approaches that are frequently encoun-tered in performance studies. In the MRT approach theform of the collision operator is constructed from linearcombinations of the distributions [18, 19]:

fm =Q−1∑q=0

Mm,qfq (14)

and the collision term is defined as:

Jq =Q−1∑m=0

M∗q,mλm

(feqm − fm

). (15)

The transformation matrix Mm,q and its inverse M∗q,m

are defined for the D3Q19 velocity structure byd’Humiëres and Ginzburg [20]. Each moment relaxestoward its equilibrium value feqm at a unique rate spec-ified by λm. In the formulation proposed by Arhenholzet al. the non-zero equilibrium moments incorporateadditional terms to accommodate an anisotropic contri-bution to the stress tensor due to the surface tension[7]:

feq1 =(j2x + j2

y + j2z

)+ σ|C| (16)

feq9 =(2j2x − j2

y − j2z

)+σ |C|2

(2n2

x − n2y − n2

z

)(17)

feq11 =(j2y − j2

z

)+ σ|C|2

(n2y − n2

z

)(18)

feq13 = jxjy + σ|C|2 nxny (19)

feq14 = jyjz + σ|C|2 nynz (20)

feq15 = jxjx + σ|C|2 nxnz (21)

where the parameter σ is linearly related to the interfa-cial tension and the relaxation parameters λm relate tothe fluid kinematic viscosity.

For flow in porous media the distributions can beignored at all lattice sites that satisfy xi ∈ Ωs, where Ωsis the part of domain occupied by the solid. A “bounce-back” rule is applied to set a no-slip boundary conditionfor the fluid phase at the solid wall: the most commonlyused rule for setting solid boundary conditions:

fq(xi, t+ δt) = fq+1(xi, t) if q is odd, (22)

IEEE International Parallel and Distributed Processing Symposium, May 2014, Phoenix, Arizona

Page 6: Petascale Application of a Coupled CPU-GPU Algorithm for ...synergy.cs.vt.edu/pubs/papers/mcclure-multiphase-flow-ipdps14.pdfidle CPU cores to perform analysis so I/O and data transfer

fq(xi, t+ δt) = fq−1(xi, t) if q is even. (23)

An analogous expression is applied for the mass trans-port distributions gwq and gnq .

C. GPU Implementation of the LBMImplementation of the LBM on a GPU represents a

considerably different task from CPU implementation.Each GPU contains a large number of relatively sim-plistic cores that are capable of carrying out tasks inparallel. The CPU and GPU have their own dedicatedmemory spaces, and initialized variables are copied tothe GPU where the main computations are performed.At each interval T data is copied back to the CPU so thatanalysis can be performed. CUDA kernels are based ona SIMD model in which many threads execute identicalinstructions simultaneously. Domain decomposition ona GPU is structured so as to take advantage of themulti-threaded framework. The number of threads andthe number of threadblocks can be varied to maximizeperformance. Kernels provide instructions to the GPUthat account for all computations required to completeeach time step of the LBM. A set of registers are createdto store the nineteen distributions required to performcomputations at each lattice site. These values reside infast register memory once they have been accessed fromthe main arrays. Computations are expressed in termsof these registers in order to maximize performance.

The memory system of the GPU functions mostefficiently when the values read by each threadblockare aligned within contiguous blocks of memory. Whendata is properly aligned, the GPU is able to coalescememory accesses within each threadblock into a smallernumber of memory transactions. In order to use memorybandwidth most efficiently, each half-warp of sixteenthreads should access contiguous memory blocks of size32, 64, or 128 bytes. As a consequence, the layout ofthe distributions f0, f1, . . . f18 must accommodate thesememory accesses. A full description of how the GPUsolution of Eqs. 9 – 21 is performed is available fromMcClure et al [15]. In this work we also implementthe solution of Eqs. 3 – 8 equations on GPU. Thiscalculation maps to the GPU in a straightforward way,although it is different from the solution of Eq. 9 inthat atomic operations are required to determine Eq. 8.Double precision atomicAdd was therefore implementedas suggested in CUDA programming guide [21].

D. Distributed Memory Implementation of the LBM us-ing GPU

In order to obtain a distributed memory solution forthe multiphase flow problem, cubic sub-domains Ω(p) areassociated with each MPI process p = 0, 1, . . . P − 1.The sub-domains are of equal size with N × N × Nlattice sites. The position of the solid phase is initialized

Pack send

buffer at

process p

Send data

from process p

to process r

Unpack

receive buffer

at process r

Figure 4. Graphical depiction of the MPI communication patternalong a boundary for the LBM. Communicated distributions arepacked into contiguous send buffers on the GPU, ignoring latticesites at any xi ∈ Ωs. The data is then sent to the neighboringprocess and unpacked to the appropriate location on the GPU.

using information provided from a sphere packing. Thelist of spheres is read by the MPI process with p = 0,then broadcast to all other processes. The sub-region ofthe sphere packing assigned to process p is determinedfrom a cartesian process grid p = kPxPy + jPx + i,where (i, j, k) provides the location of the subdomainassociated with process p. The full domain thereforecorresponds to a cartesian grid with P = Px × Py × Pzsub-domains, where Px, Py and Pz are the numberof sub-domains in each direction. A solution for themultiphase flow problem is obtained by concurrentlysolving the LBM within each sub-domain and using MPIto exchange information at the sub-domain boundaries.Each MPI process requires enough memory to store thesub-domain and a halo of width one that is used tofacilitate communication.

Based on the straightforward domain decompositionstrategy employed in our approach, each subdomaincommunicates with immediately adjacent sub-domainsonly. Communication is required to execute the stream-ing step for the D3Q19 momentum transport solutiongiven in Eq. 13, for the D3Q7 mass transport solutiongiven in Eq. 8, and to fill in the values of φ within thehalo so that the color gradient can be computed in Eq.6. To complete the streaming step at lattice site xi,processor r must provide to processor p all distributionsfαq (xi) that satisfy xi − ξq ∈ Ω(r),xi − ξq 6∈ Ωs. Thesymmetry of the discrete velocity set ensures that eachvalue received by processor p from processor r will mirrora value sent by p to r. Each process p must thereforecommunicate with 18 other processes to solve its part ofthe problem. These include those process that share oneof the six faces: (i+1, j, k), (i−1, j, k), (i, j+1, k), (i, j−1, k), (i, j, k+1), (i, j, k−1) and the processes which shareone of the twelve edges: (i+1, j+1, k), (i−1, j−1, k), (i+1, j − 1, k), (i − 1, j + 1, k), (i + 1, j, k + 1), (i + 1, j, k +1), (i− 1, j, k − 1), (i+ 1, j, k − 1), (i− 1, j, k + 1), (i, j +1, k + 1), (i, j − 1, k − 1), (i, j + 1, k − 1), (i, j − 1, k + 1).Full periodic boundary conditions are imposed such that

IEEE International Parallel and Distributed Processing Symposium, May 2014, Phoenix, Arizona

Page 7: Petascale Application of a Coupled CPU-GPU Algorithm for ...synergy.cs.vt.edu/pubs/papers/mcclure-multiphase-flow-ipdps14.pdfidle CPU cores to perform analysis so I/O and data transfer

Figure 5. Closeup view of a multiphase interfacial configurationin a sub-region of a sphere packing. The wn interface Ωwn (red)and the ns interface Ωns (pink) are also shown, constructed usingthe PMMC algorithm. The solid spheres are also depicted.

all neighboring processes are determined.Wang et. al described in detail the steps required

to implement GPU-based communications for a simi-lar D3Q7/D3Q19 multiphase flow formulation [11]. Asignificant difference in our implementation is that thepresence of an immobile solid phase significantly reducesthe number of communications that must be performedfor flows in porous media. The region of space occu-pied by the solid phase Ωs typically represents 60-70%of the total volume. Since communications never needto be performed for any lattice sites xi ∈ Ωs, it ispossible to significantly reduce the size of the messagespassed. This communication procedure is summarizedby Fig. 4. The requisite communications are identifiedby first constructing a list of the lattice sites xi oneach boundary for which xi 6∈ Ωs. This list is thencommunicated to the adjacent receiving process, andappropriately sized send and receive buffers are allocatedby each process within GPU memory. Send and receivebuffers are allocated for each pair of communicatingprocesses. Pack and unpack kernels are then constructedso that communicated values are placed contiguouslywithin the send and receive buffers.

E. Porous Media Marching Cubes Algorithm (PMMC)In the LBM, the interfacial regions tend to be diffuse

due to the way interactions are defined between the fluidcomponents. In order to evaluate the various interfacialareas, the interfaces must first be extracted based on thevalues of the phase indicator field. The marching cubesalgorithm provides a straightforward way to constructinterfaces as iso-contours of a three-dimensional function

defined on a lattice. In this algorithm, lists of trianglesare constructed to represent the corresponding surfacesusing linear interpolation [22]. The surfaces obtainedfrom the marching cubes algorithm are widely used invisualization; in our case the triangles are used as the aframework for numerical methods.

We used a previously developed variant of the march-ing cubes algorithm that is designed specifically for thecase of two-phase flow in porous media. The objective isto explicitly construct a numerical representation for allthree interfaces in a two-fluid phase system: Ωwn,Ωwsand Ωns. A typical multiphase interfacial configuration,including the Ωwn and Ωns interfaces, is shown in Fig. 5with the solid phase spheres. To construct these surfaces,the PMMC algorithm requires: (1) the distance to thesolid surface evaluated at each lattice site and (2) thevalues of the phase indicator field. The method is de-scribed in detail by McClure et al. [17]. The basic stepsare as outlined follows:

1) construct the solid surface using the marchingcubes algorithm;

2) using linear interpolation, evaluate the phase indi-cator field at the vertices of each triangle on thesolid surface;

3) again using linear interpolation, identify points onthe contact curve where all three phases coexist;

4) trim the solid surface at the contact line to formthe list of triangles that approximate the interfacesΩws and Ωns;

5) Using vertices obtained from marching cubes algo-rithm and vertices on the common curve, constructa list of triangles that approximate the Ωwn inter-face within each cube.

In order to visualize the surface, all of the trianglesmust be retained after they have been computed. How-ever, visualization is not an objective of the presentstudy; we simply wish to evaluate the component interfa-cial areas. As a result, triangle lists can be constructedlocally within each cube and then discarded once thelocal contribution to awn, aws and ans has been deter-mined. Since the full list of triangles is not retained, thememory required by this approach is minimal. More im-portantly, it avoids the possibility that dynamic memoryallocations will be necessary. The total number of tri-angles can change significantly as flow processes evolve,both globally and within an individual process. However,since the maximum number of possible triangles within acube can be determined easily, this approach minimizesthe memory footprint for the CPU while guaranteeingsafe execution.

As the PMMC algorithm is embarrassingly parallel,the steps required to integrate this analysis into a par-allel LB simulator are straightforward. The distance to

IEEE International Parallel and Distributed Processing Symposium, May 2014, Phoenix, Arizona

Page 8: Petascale Application of a Coupled CPU-GPU Algorithm for ...synergy.cs.vt.edu/pubs/papers/mcclure-multiphase-flow-ipdps14.pdfidle CPU cores to perform analysis so I/O and data transfer

the solid surface is evaluated at the beginning of thesimulation and stored in CPU memory. Since the solidis immobile, these values do not change and do notneed to be transferred to the GPU at any point duringthe simulation. Information about the phase position isprovided by the phase indicator field φ, which evolveson the GPU based on the multiphase flow dynamics.As a result, this data needs to be transferred from theGPU to the CPU each time that the PMMC algorithmis applied.

For more generic interfacial averages such as surfaceaveraged curvatures, velocities and orientations, localconstruction of the triangle lists remains the most ju-dicious approach. In the general case, a larger amountof data must be transferred from the GPU to the CPU,including the pressure and velocity fields. Both thenumber of averaged quantities and the complexity of theaverages computed impact the computational cost asso-ciated with the CPU-based analysis. Since the amount ofcomputation depends on the amount of interfacial areapresent, it is difficult to establish a concrete estimate forthe associated computational costs. However, our objec-tive is to enable a simulation for which T is sufficientlysmall to resolve the changes in the macroscopic behavior.

IV. Results and DiscussionA. Parallel Performance of the LBM

Parallel performance of the multiphase LBM simulatorwas evaluated using Titan. Two sphere packings weregenerated to serve as the domain for these calculations:a packing of 11, 402 spheres with lognormal mean µ =−2.23 and variance σ = 0.1, and a packing of 85, 539spheres with lognormal mean µ = −2.24 and varianceσ = 0.1. The volume fraction occupied by Ωs was 0.631for each packing. Strong scaling results were obtainedusing these random close packings discretized to obtainlattice sizes of 9603 and 19203, respectively. Details forthese simulations including the parallel efficiency areprovided in Table I, and a scaling plot is shown in Fig.6. The ideal scaling line is based on a simulation per-formed in a smaller packing consisting of 1, 896 spheresdiscretized to 5403 and simulated using a 3×3×3 grid ofprocesses. A performance of 76 MLUPS were obtainedfor each node in this simulation. Based on the initialpoints in our strong scaling plots, the weak scaling forthis code is very nearly ideal.

B. Evaluation of Non-Equilibrium Interfacial AreasIn this section, we demonstrate the practical results

that are obtained by combining the GPU-based im-plementation of the multiphase LBM coupled to withthe CPU-based multi-scale analysis tools described inthe previous sections. We consider the simulation of amultiphase flow process in a packing of 11,402 spheres,

Table IStrong Scaling for the Multiphase LBM

# Spheres Process Grid N3 MLUPS Efficiency11,402 5× 5× 5 1923 9, 407 0.9811,402 8× 8× 8 1603 15, 130 0.9111,402 8× 8× 8 1283 31, 275 0.8011,402 10× 10× 10 1203 50, 907 0.6785,539 10× 10× 10 1923 74, 338 0.9785,539 15× 15× 15 1283 199, 692 0.7785,539 16× 16× 16 1203 244, 754 0.78

5403

9603

19203

30403

Ideal Scaling

0 500 1000 1500 2000 2500 3000 3500 4000

105

0.5

1.0

1.5

2.0

2.5

3.0

GPUs

MLU

PS

Figure 6. Scaling results obtained for several domain sizes onTitan, including strong scaling for both a 9603 lattice and a 19203

lattice. Ideal scaling is shown based on a 5403 simulation performedon 27 GPU. Results are reported in million-lattice-node updatesper second (MLUPS).

discretized to obtain a lattice of 12003 and distributedacross P = 1000 MPI processes on Titan. The PMMCapproach was applied to extract multi-scale averages (inthis case the interfacial areas) at an interval of T = 1000time steps. The simulation was initialized to correspondwith a non-equilibrium multiphase configuration corre-sponding to a the non-wetting phase saturation equalto 0.4. The system was then allowed to relax towardequilibrium for a wall time equal to two hours. Theresulting plots for the interfacial areas Fig. 7 (a) - (c).The change in surface energy is shown for each interval Tin Fig. 7 (d), which provides a measure of the proximityto the ultimate equilibrium state.

The non-equilibrium interfacial area results obtainedusing this approach are quite well-resolved. The relax-ation process is plotted for awn, aws and ans in Fig. 7 (a)(b) and (c). Each value was computed 163 times over thecourse of 163, 000 time steps. The total size of the outputfrom this simulation was 20kB. In order to produce thesame results using the workflow shown in Fig. 2, therequired size of the output files can be computed in a

IEEE International Parallel and Distributed Processing Symposium, May 2014, Phoenix, Arizona

Page 9: Petascale Application of a Coupled CPU-GPU Algorithm for ...synergy.cs.vt.edu/pubs/papers/mcclure-multiphase-flow-ipdps14.pdfidle CPU cores to perform analysis so I/O and data transfer

10 42 4 6 8 10 12 14 16

10 -6

10 -5

10 -4

10 -3

Timestep

-ΔEs

1040 2 4 6 8 10 12 14 16

0.55

0.60

0.65

Timesteps

!"#

10 40 2 4 6 8 10 12 14 161.40

1.45

1.50

1.55

1.60

1.65

1.70

Timesteps

!"#

10 40 2 4 6 8 10 12 14 16

0.65

0.70

0.75

0.80

0.85

0.90

Timesteps

!"#

Figure 7. Interfacial area and surface energy measurementsconducted during a multiphase flow simulation within a packing11, 402 spheres on a lattice size of 12003. Area measurements werecalculated every 103 time steps and the duration of the simulationwas 163, 000 time steps.

straightforward manner:

163× PN3 × 8B = 2.25TB. (24)

The costs of transferring this data alone represent asignificant bottleneck that constrains the spatial resolu-tion T that multi-scale information can be accessed. Notonly does this represent the first opportunity to studythese dynamics in detail, but the result is obtained for asystem that is an order of magnitude larger than whathas been considered in previous studies. Also note thatthis calculation is for a two-hour simulation in whicha single multiphase flow configuration was considered.In order to perform a complete study it is desirable toconsider hundreds of configurations corresponding to awide range of fluid saturations. Without an integratedstrategy to extract multi-scale information from thesimulation as it proceeds, many of the opportunitiesto advance the state of scientific understanding for thisproblem would be inaccessible.

The capability to extract multi-scale information fromsimulations of this nature has immediate implicationsand will enable a diverse set of computational studiesthat were previously impossible. Averaged interfacialproperties and their time derivatives can be computedaccurately and efficiently. As a consequence, a widerange of constitutive relationships that have been pro-posed based on theoretical work can now be studied andevaluated within a computational setting. Ongoing workwill expand the range of multi-scale information that canbe explored within this approach and pursue the closureof new models for multiphase flow in porous media.

V. ConclusionsA scalable approach was constructed to simulate mul-

tiphase flow in porous medium systems. The approachprovides a viable means to extract interfacial areasduring the execution of multiphase flow simulations,which can be scaled across thousands GPU-acceleratednodes. The CPU cores are used to extract interfacialareas in parallel during the course of the simulation, anapproach which enables high resolution studies of thenon-equilibrium dynamics of multiphase flow in porousmedia. The practical reduction in I/O, data transferand storage requirements is approximately nine orders ofmagnitude. The peak performance for the LBM reportedin this paper was 244, 754 MLUPS on 4, 096 GPU-equipped nodes on the Titan supercomputer.

AcknowledgmentThis work was supported in part by National Science

Foundation grant 0941235, Department of Energy grantDE-SC0002163x, the GPU-accelerated HokieSpeed su-percomputer via NSF CNS-0960081 from the NSF MRI-R2 program and used resources of the Oak Ridge Lead-

IEEE International Parallel and Distributed Processing Symposium, May 2014, Phoenix, Arizona

Page 10: Petascale Application of a Coupled CPU-GPU Algorithm for ...synergy.cs.vt.edu/pubs/papers/mcclure-multiphase-flow-ipdps14.pdfidle CPU cores to perform analysis so I/O and data transfer

ership Computing Facility at oak Ridge National Lab-oratory, which is supported by the Office of Science ofthe Department of Energy under Contract DE-AC05-00OR22725

References

[1] W. G. Gray and C. T. Miller, “Thermodynamicallyconstrained averaging theory approach for modelingflow and transport phenomena in porous mediumsystems: 8. Interface and common curve dynamics,”Advances in Water Resources, vol. 33, no. 12, pp.1427–1443, DEC 2010.

[2] ——, “TCAT analysis of capillary pressure in non-equilibrium, two-fluid-phase, porous medium sys-tems,” Advances in Water Resources, vol. 34, no. 6,pp. 770–778, 2011.

[3] V. Joekar-Niasar and S. M. Hassanizadeh, “Specificinterfacial area: The missing state variable in two-phase flow equations?” Water Resources Research,vol. 47, 2011.

[4] C. Pan, J. F. Prins, and C. T. Miller, “A high-performance lattice Boltzmann implementation tomodel flow in porous media,” Computer PhysicsCommunication, vol. 158, no. 2, pp. 89–105, 2004.

[5] J. Wang, X. Zhang, A. Bengough, and J. Craw-ford, “Domain-decomposition method for parallellattice Boltzmann simulation of incompressible flowin porous media,” Physical Review E, vol. 72, no. 1,Part 2, JUL 2005.

[6] M. G. Schaap, M. L. Porter, B. S. B. Christensen,and D. Wildenschild, “Comparison of pressure-saturation characteristics derived from computedtomography and lattice Boltzmann simulations,”Water Resources Research, vol. 43, 2007.

[7] B. Ahrenholz, J. Toelke, P. Lehmann, A. Peters,A. Kaestner, M. Krafczyk, and W. Durner, “Pre-diction of capillary hysteresis in a porous materialusing lattice-Boltzmann methods and comparisonto experimental data and a morphological pore net-work model,” Advances in Water Resources, vol. 31,no. 9, pp. 1151–1173, SEP 2008.

[8] M. Bernaschi, M. Bisson, T. Endo, S. Matsuoka,M. Fatica, and S. Melchionna, “Petaflop biofluidicssimulations on a two million-core system,” inProceedings of 2011 International Conference forHigh Performance Computing, Networking, Storageand Analysis, ser. SC ’11. New York, NY,USA: ACM, 2011, pp. 4:1–4:12. [Online]. Available:http://doi.acm.org/10.1145/2063384.2063389

[9] J. Myre, S. D. C. Walsh, D. Lilja, and M. O. Saar,“Performance analysis of single-phase, multiphase,and multicomponent lattice-Boltzmann fluid flowsimulations on GPU clusters,” Concurrency and

Computation-Practice & Experience, vol. 23, no. 4,pp. 332–350, MAR 2011.

[10] C. Obrecht, F. Kuznik, B. Tourancheau, and J.-J. Roux, “Scalable lattice Boltzmann solvers forCUDA GPU clusters,” Parallel Computing, vol. 39,no. 6-7, pp. 259–270, JUN-JUL 2013.

[11] H. Wang, S. Potluri, D. Bureddy, C. Rosales, andD. K. Panda, “Gpu-aware mpi on rdma-enabledclusters: Design, implementation and evaluation,”IEEE Transactions on Parallel and Distributed Sys-tems, vol. 99, no. PrePrints, p. 1, 2013.

[12] A. L. Dye, J. E. McClure, C. T. Miller, and W. G.Gray, “Description of non-Darcy flows in porousmedium systems,” Physical Review E, vol. 87, no. 3,MAR 18 2013.

[13] M. L. Porter, M. G. Schaap, and D. Wilden-schild, “Comparison of interfacial area estimatesfor multiphase flow through porous media usingcomputed microtomography and lattice-Boltzmannsimulations,” Water Resources Research, vol. to besubmitted, 2009.

[14] A. W. Adamson, Physical Chemistry of Surfaces.New York: John Wiley & Sons, 1990.

[15] J. E. McClure, J. F. Prins, and C. T. Miller, “Anovel heterogeneous algorithm to simulate multi-phase flow in porous media on multicore cpu-gpusystems,” Computer Physics Communications, vol.Under Review, 2013.

[16] TOP500 Supercomputing Sites, “TOP500 Super-computing List,” http://www.top500.org/.

[17] J. E. McClure, D. Adalsteinsson, C. Pan, W. G.Gray, and C. T. Miller, “Approximation of inter-facial properties in multiphase porous medium sys-tems,” Advances in Water Resources, vol. 30, no. 3,pp. 354–365, 2007.

[18] C. Pan, L.-S. Luo, and C. T. Miller, “An evaluationof lattice Boltzmann schemes for porous mediumflow simulation,” Computers & Fluids, vol. 35, no.8–9, pp. 898–909, 2006.

[19] P. Lallemand and L. S. Luo, “Theory of the lat-tice Boltzmann method: Dispersion, dissipation,isotropy, Galilean invariance, and stability,” Physi-cal Review E, vol. 61, no. 6, pp. 6546–6562, 2000.

[20] D. d’Humiéres, I. Ginzburg, M. Krafczyk, P. Lalle-mand, and L. S. Luo, “Multiple-relaxation-time lat-tice Boltzmann models in three dimensions,” Philo-sophical Transactions of the Royal Society of LondonSeries A-Mathematical Physical and EngineeringSciences, vol. 360, pp. 437–451, 2002.

[21] NVidia CUDA Programming Guide, Version 5.0,NVIDIA, 2012.

[22] W. E. Lorenson and H. E. Cline, “Marching cubes:A high resolution 3D surface construction algo-rithm,” Computer Graphics, vol. 21, no. 4, pp. 163–

IEEE International Parallel and Distributed Processing Symposium, May 2014, Phoenix, Arizona

Page 11: Petascale Application of a Coupled CPU-GPU Algorithm for ...synergy.cs.vt.edu/pubs/papers/mcclure-multiphase-flow-ipdps14.pdfidle CPU cores to perform analysis so I/O and data transfer

169, 1987.

IEEE International Parallel and Distributed Processing Symposium, May 2014, Phoenix, Arizona