parallelizing finite element pde solvers in an object-oriented framework xing cai department of...

Parallelizing finite element Parallelizing finite element PDE solvers in an object-PDE solvers in an object-

oriented frameworkoriented framework

Xing CaiXing Cai

Department of InformaticsDepartment of Informatics

University of OsloUniversity of Oslo

Outline of the TalkOutline of the Talk

• Introduction Introduction & background& background

• 33 parallelization approach parallelization approacheses

• Implementational aspectsImplementational aspects

• Numerical experimentsNumerical experiments

The Scientific Software GroupThe Scientific Software Group

Knut Andreas Knut Andreas Lie Lie (SINTEF)(SINTEF)

Kent Andre Kent Andre MardalMardal

Åsmund Åsmund ØdegårdØdegård

Bjørn Fredrik Bjørn Fredrik Nielsen Nielsen (NR)(NR)

Joakim Joakim SundnesSundnes

Wen ChenWen ChenXing CaiXing Cai

Øyvind Øyvind HjelleHjelle(SINTEF)(SINTEF)

Ola Ola SkavhaugSkavhaug

Aicha Aicha BounaimBounaim

Hans Petter Hans Petter LangtangenLangtangen

Are Magnus Are Magnus Bruaset Bruaset (NO)(NO)

Linda Linda IngebrigtsenIngebrigtsen

Glenn Terje Glenn Terje LinesLines

Aslak TveitoAslak Tveito

Part-timePart-timePh.D. Ph.D. Students Students

Post DocsPost DocsFacultyFaculty

Department of Informatics, University of OsloDepartment of Informatics, University of Oslohttp://www.ifi.uio.no/~tpv

TomTomThorvaldsenThorvaldsen

ProjectsProjects

Simulation of electrical activity in human Simulation of electrical activity in human heartheart

Simulation of the diastolic left ventricle Simulation of the diastolic left ventricle

Numerical methods for option pricing Numerical methods for option pricing

Software for numerical solution of PDEsSoftware for numerical solution of PDEs

Scientific computing using a Linux-clusterScientific computing using a Linux-cluster

Finite element modelling of ultrasound wave propagation Finite element modelling of ultrasound wave propagation

Multi-physics models by domain decomposition methods Multi-physics models by domain decomposition methods

Scripting techniques for scientific computingScripting techniques for scientific computing

Numerical modelling of reactive fluid flow in porous mediaNumerical modelling of reactive fluid flow in porous media

http://www.ifi.uio.no/~tpv

DiffpackDiffpack

• O-O software environment for O-O software environment for scientific computation scientific computation (C++)(C++)

• Rich collection of PDE solution Rich collection of PDE solution components - components - portable, flexible, extensibleportable, flexible, extensible

• http://www.nobjects.comhttp://www.nobjects.com

• H.P.Langtangen, H.P.Langtangen, Computational Computational Partial Differential EquationsPartial Differential Equations, , Springer 1999Springer 1999

The Diffpack PhilosophyThe Diffpack Philosophy

Structuralmechanics

Porous mediaflow

Aero-dynamics

Incompressibleflow

Other PDEapplications

Waterwaves

StochasticPDEs

Heattransfer

Field

Grid

MatrixVector

I/O

Ax=b

FEM

FDM

The QuestionThe Question

Starting point: sequential Starting point: sequential PDE PDE solversolverHow to do the parallelization?How to do the parallelization?

Resulting parallel solvers should haveResulting parallel solvers should have good parallel efficiencygood parallel efficiency good overall numerical performancegood overall numerical performance

We needWe need a good parallelization strategya good parallelization strategy a good and simple implementation of the strategya good and simple implementation of the strategy

A generic finite element PDE A generic finite element PDE solversolver

• Time stepping Time stepping tt00, , tt11, , tt22……

• Spatial discretization (computational Spatial discretization (computational grid)grid)

• Solution of nonlinear problemsSolution of nonlinear problems

• Solution of linearized problemsSolution of linearized problems

• Iterative solution of Iterative solution of Ax=bAx=b

An observationAn observation

• The computation-intensive part is the The computation-intensive part is the iterative solution ofiterative solution of Ax=bAx=b

• A parallel finite element PDE solver needs A parallel finite element PDE solver needs to run the linear algebra operations in to run the linear algebra operations in parallelparallel– vector additionvector addition

– inner-product of two vectorsinner-product of two vectors

– matrix-vector productmatrix-vector product

Several parallelization optionsSeveral parallelization options

• Automatic compiler parallelizationAutomatic compiler parallelization

• Loop-level parallelization (special Loop-level parallelization (special compilation directives)compilation directives)

• Domain decompositionDomain decomposition– divide-and-conquerdivide-and-conquer

– fully distributed computingfully distributed computing

– flexibleflexible

– high parallel efficiencyhigh parallel efficiency

A natural parallelization of PDE A natural parallelization of PDE solverssolvers

• The global solution domain is The global solution domain is partitioned into many smaller sub-partitioned into many smaller sub-domainsdomains

• One sub-domain works as a ”unit”, with One sub-domain works as a ”unit”, with its sub-matrices and sub-vectorsits sub-matrices and sub-vectors

• No need to create global matrices and No need to create global matrices and vectors physicallyvectors physically

• The global linear algebra operations The global linear algebra operations can be realized by can be realized by local operations + local operations + inter-processor communicationinter-processor communication

GridGrid ppartitionartition

Linear-algebra level parallelizationLinear-algebra level parallelization

• A SPMD modelA SPMD model

• Reuse of existing code for local linear Reuse of existing code for local linear algebra operationsalgebra operations

• Need new code for the parallelization Need new code for the parallelization specific tasksspecific tasks– grid partition (non-overlapping, grid partition (non-overlapping,

overlapping)overlapping)

– inter-processor communication routinesinter-processor communication routines

Object orientationObject orientation

• An add-on ”toolbox” containing all the An add-on ”toolbox” containing all the parallelization specific codesparallelization specific codes

• The ”toolbox” has many high-level routinesThe ”toolbox” has many high-level routines

• The existing sequential libraries are slightly The existing sequential libraries are slightly modified to include a ”dummy” interface, modified to include a ”dummy” interface, thus incorporating ”fake” inter-processor thus incorporating ”fake” inter-processor communicationscommunications

• A seamless coupling between the huge A seamless coupling between the huge sequential libraries and the add-on toolboxsequential libraries and the add-on toolbox

Straightforward ParallelizationStraightforward Parallelization

• Develop a sequential simulator, without Develop a sequential simulator, without paying attention to parallelismpaying attention to parallelism

• Follow the Diffpack coding standardsFollow the Diffpack coding standards

• Use theUse the add-on add-on toolboxtoolbox for parallel for parallel computingcomputing

• Add a few new statements for Add a few new statements for transformation to a parallel simulatortransformation to a parallel simulator

A Simple Coding ExampleA Simple Coding Example

GridPartAdm* adm; GridPartAdm* adm; // // access to parallelization functionalityaccess to parallelization functionality

LinEqAdm* lineq; LinEqAdm* lineq; // // administrator for linear system & solveradministrator for linear system & solver

// ...// ...

#ifdef PARALLEL_CODE#ifdef PARALLEL_CODE

adm->scan (menu);adm->scan (menu);

adm->prepareSubgrids ();adm->prepareSubgrids ();

adm->prepareCommunication ();adm->prepareCommunication ();

lineq->attachCommAdm (*adm);lineq->attachCommAdm (*adm);

#endif#endif

// ...// ...

lineq->solve ();lineq->solve ();

set subdomain list = DEFAULTset subdomain list = DEFAULT

set global grid = grid1.fileset global grid = grid1.file

set partition-algorithm = METISset partition-algorithm = METIS

set number of overlaps = 0set number of overlaps = 0

SSolving an elliptic PDEolving an elliptic PDE

)())(( xfuxK

• Highly unstructured gridHighly unstructured grid• Discontinuity in the coefficient Discontinuity in the coefficient KK

MeasurementsMeasurements

P # iter Time Speedup

1 480 420.09 N/A

3 660 200.17 2.10

4 691 156.36 2.69

6 522 83.87 5.01

8 541 60.30 6.97

12 586 38.23 10.99

16 564 28.32 14.83

•130,561 degrees of freedom130,561 degrees of freedom•Overlapping subgridsOverlapping subgrids•Global BiCGStab using (block) ILU prec.Global BiCGStab using (block) ILU prec.

Parallel Vortex-Shedding Parallel Vortex-Shedding SimulationSimulation

incompressible Navier-Stokesincompressible Navier-Stokes

solved by a pressure correction methodsolved by a pressure correction method

Simulation SnapshotsSimulation Snapshots

PressurePressure

Some CPU MeasurementsSome CPU Measurements

P CPU Speedup Efficiency

1 1418.67 N/A N/A

2 709.79 2.00 1.00

3 503.50 2.82 0.94

4 373.54 3.80 0.95

6 268.38 5.29 0.88

8 216.73 6.55 0.82

The pressure equation is solved by the CG methodThe pressure equation is solved by the CG methodwith “subdomain-wise” MILU prec.with “subdomain-wise” MILU prec.

Animated Pressure FieldAnimated Pressure Field

Domain DecompositionDomain Decomposition

• Solution of the original large problem Solution of the original large problem through iteratively solving many smaller through iteratively solving many smaller subproblemssubproblems

• Can be used as solution method or Can be used as solution method or preconditionerpreconditioner

• Flexibility -- localized treatment of Flexibility -- localized treatment of irregular geometries, singularities etcirregular geometries, singularities etc

• Very efficient numerical methods -- even Very efficient numerical methods -- even on sequential computerson sequential computers

• Suitable for coarse grained parallelizationSuitable for coarse grained parallelization

Overlapping DDOverlapping DD

Example:Solving the PoissonSolving the Poissonproblem on the unitproblem on the unitsquaresquare

ObservationsObservations

• DD is a good parallelization strategyDD is a good parallelization strategy• The approach is not PDE-specificThe approach is not PDE-specific• A program for the original global problem A program for the original global problem

can be reused (modulo B.C.) for each can be reused (modulo B.C.) for each subdomainsubdomain

• Must communicate overlapping point Must communicate overlapping point values values

• No need for global dataNo need for global data• Data distribution impliedData distribution implied• Explicit temporal schemes are a special Explicit temporal schemes are a special

case where no iteration is needed (“exact case where no iteration is needed (“exact DD”)DD”)

Goals for the ImplementationGoals for the Implementation

• Reuse sequential solver as subdomain Reuse sequential solver as subdomain solversolver

• Add DD management and Add DD management and communication as separate modulescommunication as separate modules

• Collect common operations in generic Collect common operations in generic library moduleslibrary modules

• Flexibility and portabilityFlexibility and portability

• Simplified parallelization process for the Simplified parallelization process for the end-userend-user

Generic Programming Generic Programming FrameworkFramework

Making the Simulator ParallelMaking the Simulator Parallel

class SimulatorP : public SubdomainFEMSolverclass SimulatorP : public SubdomainFEMSolver public Simulatorpublic Simulator{{ // // … just a small amount of code… just a small amount of code virtual void createLocalMatrix ()virtual void createLocalMatrix () { Simulator::makeSystem (); }{ Simulator::makeSystem (); }};};

SubdomainSimulatorSubdomainSimulator

SubdomainFEMSolver

AdministratorAdministrator

SimulatorPSimulatorP

SimulatorSimulator

ApplicationApplication Poisson equation on unit squarePoisson equation on unit square DD as the global solution methodDD as the global solution method Subdomain solvers use CG+FFTSubdomain solvers use CG+FFT Fixed number of subdomains Fixed number of subdomains MM=32 (independent of =32 (independent of

PP)) Straightforward parallelization of an existing Straightforward parallelization of an existing

simulatorsimulator P Sim. Time Speedup Efficiency

1 53.08 N/A N/A

2 27.23 1.95 0.97

4 14.12 3.76 0.94

8 7.01 7.57 0.95

16 3.26 16.28 1.02

32 1.63 32.56 1.02

P: number of processors

A large scale problemA large scale problem

P CPU Speedup MaxNi MinNi

1 283.93 N/A 2,082,625 2,082,625

2 150.56 1.89 1,053,505 1,047,265

4 69.23 4.10 532,417 529,705

8 31.37 9.05 273,722 269,865

16 16.88 16.82 141,418 139,793

32 10.19 27.86 75,617 70,834

Solving an elliptic boundary value problem on an unstructured gridSolving an elliptic boundary value problem on an unstructured grid

Combined ApproachCombined Approach

• Use a CG-like method as basic solverUse a CG-like method as basic solver(i.e. use a parallelized Diffpack linear solver)(i.e. use a parallelized Diffpack linear solver)

• Use DD as preconditionerUse DD as preconditioner(i.e. (i.e. SimulatorPSimulatorP is invoked as a preconditioning solve) is invoked as a preconditioning solve)

• Combine with coarse grid correctionCombine with coarse grid correction

• CG-like method + DD prec. is normally CG-like method + DD prec. is normally faster than DD as a basic solverfaster than DD as a basic solver

ElasticityElasticity Test case: 2D linear elasticity, 241 x 241 global grid.Test case: 2D linear elasticity, 241 x 241 global grid.

Vector equationVector equation ),( 21 uuu

fuu )(

Straightforward parallelization based on Straightforward parallelization based on an existing Diffpack simulatoran existing Diffpack simulator

2D Linear Elasticity2D Linear Elasticity

• BiCGStab + DD prec. as global solverBiCGStab + DD prec. as global solver

• Multigrid V-cycle in subdomain solvesMultigrid V-cycle in subdomain solves

• II:: number of global BiCGStab iterations needed number of global BiCGStab iterations needed

• PP:: number of processors ( number of processors (PP=#subdomains)=#subdomains)

P CPU Speedup I Subgrid

1 66.01 N/A 19 241 x 241

2 24.64 2.68 12 129 x 241

4 14.97 4.41 14 129 x 129

8 5.96 11.08 11 69 x 129

16 3.58 18.44 13 69 x 69

2D Linear Elasticity2D Linear Elasticity

Two-Phase Porous Media FlowTwo-Phase Porous Media Flow

P Total CPU Subgrid CPU PEQ I CPU SEQ

1 4053.33 241x241 3586.98 3.10 440.58

2 2497.43 129 x 241 2241.78 3.48 241.08

4 1244.29 129 x 129 1101.58 2.97 134.28

8 804.47 129 x 69 725.58 3.93 72.76

16 490.47 69 x 69 447.27 4.13 39.64

psv

Tqps

Tsfvst

)(

,0in ))((

0,in 0))((

PEQ:PEQ:

SEQ:SEQ:

BiCGStab + DD prec. for global pressure eq.BiCGStab + DD prec. for global pressure eq.Multigrid V-cycle in subdomain solvesMultigrid V-cycle in subdomain solves

Two-phase Porous Media FlowTwo-phase Porous Media Flow

History of History of water water saturation saturation propagationpropagation

Nonlinear Water WavesNonlinear Water Waves

• Fully nonlinear 3D water waves• Primary unknowns:

wallssolidon 0

surfaceon water 02/)(

surfaceon water 0

olumein water v 0

222

2

n

gzyxt

zyyxxt

,

Nonlinear Water WavesNonlinear Water Waves• CG + DD prec. for global solverCG + DD prec. for global solver• Multigrid V-cycle as subdomain solverMultigrid V-cycle as subdomain solver• Fixed number of subdomains Fixed number of subdomains MM=16 (independent =16 (independent

of of PP))• Subgrids from partition of a global 41x41x41 gridSubgrids from partition of a global 41x41x41 grid

P Execution time Speedup Efficiency

1 1404.44 N/A N/A

2 715.32 1.96 0.98

4 372.79 3.77 0.94

8 183.99 7.63 0.95

16 90.89 15.45 0.97

Parallel Simulation of 3D Acoustic Parallel Simulation of 3D Acoustic FieldField

• A linux-cluster: 48 Pentium-III A linux-cluster: 48 Pentium-III 500Mhz procs, 100 Mbit 500Mhz procs, 100 Mbit interconnectioninterconnection

• SGI Cray Origin 2000: MIPS R10000SGI Cray Origin 2000: MIPS R10000

• LAL parallelization; 2 cases:LAL parallelization; 2 cases:– Linear Model (linear wave equation), Linear Model (linear wave equation),

solved with an explicit methodsolved with an explicit method

– Nonlinear Model, solved with an implicit Nonlinear Model, solved with an implicit methodmethod

Mathematical Nonlinear ModelMathematical Nonlinear Model

0)0,(

)0,(

1

)(),(

)(2

1)(

2

0])(2

/)[(

11

0

0

20

20

22

000

222

222

2

22

xt

xtcn

erJtrp

btct

pp

btc

AB

tctc

tj

x

x

x

x

Rtx

Rtx

cT

T

),(

),(

Results - Linear ModelResults - Linear Model

CPUsCPUs Origin 2000Origin 2000 Linux ClusterLinux Cluster

CPU-timeCPU-time SpeedupSpeedup CPU-timeCPU-time SpeedupSpeedup

11 944.83944.83 N/AN/A 640.7640.7 N/AN/A

22 549.21549.21 1.721.72 327.8327.8 1.951.95

44 282.75282.75 3.343.34 174.0174.0 3.683.68

88 155.01155.01 6.106.10 90.9890.98 7.047.04

1616 80.4180.41 11.811.8 46.3546.35 13.813.8

2424 65.6365.63 14.414.4 34.0534.05 18.818.8

3232 49.9749.97 18.918.9 26.2726.27 24.424.4

4848 35.2335.23 26.826.8 17.7417.74 36.136.1

Results - Nonlinear ModelResults - Nonlinear Model

CPUsCPUs Origin 2000Origin 2000 Linux ClusterLinux Cluster

CPU-timeCPU-time SpeedupSpeedup CPU-timeCPU-time SpeedupSpeedup

22 8670.88670.8 N/AN/A 6681.56681.5 N/AN/A

44 4726.54726.5 3.753.75 3545.93545.9 3.773.77

88 2404.22404.2 7.217.21 1881.11881.1 7.107.10

1616 1325.61325.6 13.013.0 953.89953.89 14.014.0

2424 1043.71043.7 16.616.6 681.77681.77 19.619.6

3232 725.23725.23 23.923.9 563.54563.54 23.723.7

SummarySummary

• Goal: provide software and programming Goal: provide software and programming rules for easy parallelization of sequential rules for easy parallelization of sequential simulatorssimulators

• Applicable to a wide range of PDE problemsApplicable to a wide range of PDE problems• TThreehree parallelization parallelization approachapproaches:es:

– parallelization at the linear algebra level:parallelization at the linear algebra level: ““automatic” parallelizationautomatic” parallelization – domain decomposition:domain decomposition: very flexible, compact visible code/algorithmvery flexible, compact visible code/algorithm– combined approachcombined approach

• Performance: satisfactory speed-upPerformance: satisfactory speed-uphttp://www.ifi.uio.no/~tpv

parallelizing finite element pde solvers in an object-oriented framework xing cai department of...

Documents

b slide

strategy slide

tomthorvaldsen slide

biterative solution

good parallelization

sequential pde solver

b fem fdm slide

parallel solvers