a software framework for easy parallelization of pde solvers
DESCRIPTION
A Software Framework for Easy Parallelization of PDE Solvers. Hans Petter Langtangen Xing Cai Dept. of Informatics University of Oslo. Outline of the Talk. Background Parallelization techniques based on domain decomposition at the linear algebra level Implementational aspects - PowerPoint PPT PresentationTRANSCRIPT
A Software Framework for Easy A Software Framework for Easy Parallelization of PDE SolversParallelization of PDE Solvers
Hans Petter LangtangenHans Petter Langtangen
Xing CaiXing Cai
Dept. of InformaticsDept. of InformaticsUniversity of OsloUniversity of Oslo
PC
FD 2
00
0PC
FD 2
00
0Outline of the TalkOutline of the Talk
• BackgroundBackground
• Parallelization techniquesParallelization techniques– based on domain decompositionbased on domain decomposition
– at the linear algebra levelat the linear algebra level
• Implementational aspectsImplementational aspects
• Numerical experimentsNumerical experiments
PC
FD 2
00
0PC
FD 2
00
0The QuestionThe Question
Starting point: sequential codeStarting point: sequential codeHow to do the parallelization?How to do the parallelization?
Resulting parallel solvers should haveResulting parallel solvers should have good parallel efficiencygood parallel efficiency good overall numerical performancegood overall numerical performance
We needWe need a good parallelization strategya good parallelization strategy a good and simple implementation of the strategya good and simple implementation of the strategy
PC
FD 2
00
0PC
FD 2
00
0Problem DomainProblem Domain
• Partial differential equationsPartial differential equations
• Finite elements/differencesFinite elements/differences
• Communication through message Communication through message passingpassing
PC
FD 2
00
0PC
FD 2
00
0Domain DecompositionDomain Decomposition
• Solution of the original large problem Solution of the original large problem through iteratively solving many smaller through iteratively solving many smaller subproblemssubproblems
• Can be used as solution method or Can be used as solution method or preconditionerpreconditioner
• Flexibility -- localized treatment of Flexibility -- localized treatment of irregular geometries, singularities etcirregular geometries, singularities etc
• Very efficient numerical methods -- even Very efficient numerical methods -- even on sequential computerson sequential computers
• Suitable for coarse grained parallelizationSuitable for coarse grained parallelization
PC
FD 2
00
0PC
FD 2
00
0Overlapping DDOverlapping DD
Alternating Schwarz method for two Alternating Schwarz method for two subdomainssubdomains
Example: solving an elliptic boundary value Example: solving an elliptic boundary value problemproblem
inin
A sequence of approximationsA sequence of approximations
wherewhere
on
in
gu
fAu21
nuuu ,, 10
1|
\on
in
121
111
111
nn
n
n
uu
gu
fAu
2|
\on
in
12
222
222
nn
n
n
uu
gu
fAu
PC
FD 2
00
0PC
FD 2
00
0Convergence of the SolutionConvergence of the Solution
Single-phaseSingle-phasegroundwatergroundwaterflowflow
PC
FD 2
00
0PC
FD 2
00
0Mesh Partition ExampleMesh Partition Example
PC
FD 2
00
0PC
FD 2
00
0Coarse Grid CorrectionCoarse Grid Correction
• This DD algorithm is a kind of block This DD algorithm is a kind of block Jacobi iterationJacobi iteration (CBJ) (CBJ)
• Problem: often (very) slow Problem: often (very) slow convergenceconvergence
• Remedy: coarse grid correctionRemedy: coarse grid correction
• A kind of two-grid multigrid algorithmA kind of two-grid multigrid algorithm
• Coarse grid solve on each processorCoarse grid solve on each processor
PC
FD 2
00
0PC
FD 2
00
0ObservationsObservations
• DD is a good parallelization strategyDD is a good parallelization strategy• The approach is not PDE-specificThe approach is not PDE-specific• A program for the original global problem A program for the original global problem
can be reused (modulo B.C.) for each can be reused (modulo B.C.) for each subdomainsubdomain
• Must communicate overlapping point Must communicate overlapping point values values
• No need for global dataNo need for global data• Data distribution impliedData distribution implied• Explicit temporal schemes are a special Explicit temporal schemes are a special
case where no iteration is needed (“exact case where no iteration is needed (“exact DD”)DD”)
PC
FD 2
00
0PC
FD 2
00
0A Known ProblemA Known Problem
““The hope among early domain decomposition workers The hope among early domain decomposition workers was that one could write a simple controlling program was that one could write a simple controlling program which would call the old PDE software directly to which would call the old PDE software directly to perform the subdomain solves. This turned out to be perform the subdomain solves. This turned out to be unrealistic because most PDE packages are too rigid unrealistic because most PDE packages are too rigid and inflexible.”and inflexible.”
- - Smith, Bjørstad and Smith, Bjørstad and GroppGropp
One remedy:One remedy:
Use of object-oriented programming Use of object-oriented programming techniquestechniques
PC
FD 2
00
0PC
FD 2
00
0Goals for the ImplementationGoals for the Implementation
• Reuse sequential solver as subdomain Reuse sequential solver as subdomain solversolver
• Add DD management and Add DD management and communication as separate modulescommunication as separate modules
• Collect common operations in generic Collect common operations in generic library moduleslibrary modules
• Flexibility and portabilityFlexibility and portability
• Simplified parallelization process for the Simplified parallelization process for the end-userend-user
PC
FD 2
00
0PC
FD 2
00
0Generic Programming FrameworkGeneric Programming Framework
PC
FD 2
00
0PC
FD 2
00
0The Subdomain SimulatorThe Subdomain Simulator
Subdomain SimulatorSubdomain Simulator
seq. solverseq. solver
add-onadd-oncommunicationcommunication
PC
FD 2
00
0PC
FD 2
00
0The CommunicatorThe Communicator
• Need functionality for exchanging Need functionality for exchanging point values inside the overlapping point values inside the overlapping regionsregions
• The communicator works with a The communicator works with a hidden communication modelhidden communication model
• MPI in use, but easy to changeMPI in use, but easy to change
PC
FD 2
00
0PC
FD 2
00
0RealizationRealization
• Object-oriented programming Object-oriented programming
(C++, Java, Python)(C++, Java, Python)
• Use inheritance, polymorphism, Use inheritance, polymorphism, dynamic bindingdynamic binding– Simplifies modularizationSimplifies modularization
– Supports reuse of sequential solver Supports reuse of sequential solver (without touching its source code!)(without touching its source code!)
PC
FD 2
00
0PC
FD 2
00
0Making the Simulator ParallelMaking the Simulator Parallel
class SimulatorP : public SubdomainFEMSolverclass SimulatorP : public SubdomainFEMSolver public Simulatorpublic Simulator{{ // // … just a small amount of code… just a small amount of code virtual void createLocalMatrix ()virtual void createLocalMatrix () { Simulator::makeSystem (); }{ Simulator::makeSystem (); }};};
SubdomainSimulatorSubdomainSimulator
SubdomainFEMSolver
AdministratorAdministrator
SimulatorPSimulatorP
SimulatorSimulator
PC
FD 2
00
0PC
FD 2
00
0PerformancePerformance
• Algorithmic efficiencyAlgorithmic efficiency efficiency of original sequential simulator(s)efficiency of original sequential simulator(s) efficiency of domain decomposition methodefficiency of domain decomposition method
• Parallel efficiencyParallel efficiency communication overhead (communication overhead (lowlow)) coarse grid correction overhead (coarse grid correction overhead (normally normally
lowlow)) load balancingload balancing
– subproblem sizesubproblem size– work on subdomain solveswork on subdomain solves
PC
FD 2
00
0PC
FD 2
00
0ApplicationApplication
Single-phase groundwater flowSingle-phase groundwater flow DD as the global solution methodDD as the global solution method Subdomain solvers use CG+FFTSubdomain solvers use CG+FFT Fixed number of subdomains Fixed number of subdomains MM=32 (independent of =32 (independent of
PP)) Straightforward parallelization of an existing Straightforward parallelization of an existing
simulatorsimulator P Sim. Time Speedup Efficiency
1 53.08 N/A N/A
2 27.23 1.95 0.97
4 14.12 3.76 0.94
8 7.01 7.57 0.95
16 3.26 16.28 1.02
32 1.63 32.56 1.02
P: number of processors
PC
FD 2
00
0PC
FD 2
00
0DiffpackDiffpack
• O-O software environment for O-O software environment for scientific computationscientific computation
• Rich collection of PDE solution Rich collection of PDE solution components - components - portable, flexible, extensibleportable, flexible, extensible
• www.diffpack.comwww.diffpack.com
• H.P.Langtangen: H.P.Langtangen: Computational Computational Partial Differential EquationsPartial Differential Equations, , Springer 1999Springer 1999
PC
FD 2
00
0PC
FD 2
00
0Straightforward ParallelizationStraightforward Parallelization
• Develop a sequential simulator, without Develop a sequential simulator, without paying attention to parallelismpaying attention to parallelism
• Follow the Diffpack coding standardsFollow the Diffpack coding standards
• Need Diffpack add-on libraries for Need Diffpack add-on libraries for parallel computingparallel computing
• Add a few new statements for Add a few new statements for transformation to a parallel simulatortransformation to a parallel simulator
PC
FD 2
00
0PC
FD 2
00
0Linear-Algebra-Level ApproachLinear-Algebra-Level Approach
• Parallelize matrix/vector operationsParallelize matrix/vector operations– inner-product of two vectorsinner-product of two vectors
– matrix-vector productmatrix-vector product
– preconditioning - block contribution from subgridspreconditioning - block contribution from subgrids
• Easy to useEasy to use– access to all Diffpack access to all Diffpack v3.0 v3.0 CG-like methods, CG-like methods,
preconditioners and convergence monitorspreconditioners and convergence monitors
– ““hidden” parallelizationhidden” parallelization
– need only to add a few lines of new codeneed only to add a few lines of new code
– arbitrary choice of number of procs at run-timearbitrary choice of number of procs at run-time
– less flexibility than DDless flexibility than DD
PC
FD 2
00
0PC
FD 2
00
0A Simple Coding ExampleA Simple Coding Example
GridPartAdm* adm;GridPartAdm* adm; //// access to parallelizaion functionalityaccess to parallelizaion functionality
LinEqAdm* lineq;LinEqAdm* lineq; //// administrator for linear system & solveradministrator for linear system & solver
// ...// ...
#ifdef PARALLEL_CODE#ifdef PARALLEL_CODE
adm->scan (menu);adm->scan (menu);
adm->prepareSubgrids ();adm->prepareSubgrids ();
adm->prepareCommunication ();adm->prepareCommunication ();
lineq->attachCommAdm (*adm);lineq->attachCommAdm (*adm);
#endif#endif
// ...// ...
lineq->solve ();lineq->solve ();
set subdomain list = DEFAULTset subdomain list = DEFAULT
set global grid = grid1.fileset global grid = grid1.file
set partition-algorithm = METISset partition-algorithm = METIS
set number of overlaps = 0set number of overlaps = 0
PC
FD 2
00
0PC
FD 2
00
0Single-Phase Groundwater FlowSingle-Phase Groundwater Flow
)())(( xfuxK
•Highly unstructured gridHighly unstructured grid•Discontinuity in the coefficientDiscontinuity in the coefficient K K (0.1 & 1)(0.1 & 1)
PC
FD 2
00
0PC
FD 2
00
0MeasurementsMeasurements
P # iter Time Speedup
1 480 420.09 N/A
3 660 200.17 2.10
4 691 156.36 2.69
6 522 83.87 5.01
8 541 60.30 6.97
12 586 38.23 10.99
16 564 28.32 14.83
•130,561 degrees of freedom130,561 degrees of freedom•Overlapping subgridsOverlapping subgrids•Global BiCGStab using (block) ILU prec.Global BiCGStab using (block) ILU prec.
PC
FD 2
00
0PC
FD 2
00
0A Finite Element Navier-Stokes SolverA Finite Element Navier-Stokes Solver
• Operator splitting in the tradition of Operator splitting in the tradition of pressure correction, velocity pressure correction, velocity correction, Helmholtz decompositioncorrection, Helmholtz decomposition
• This version is due to Ren & UtnesThis version is due to Ren & Utnes, , 19931993
PC
FD 2
00
0PC
FD 2
00
0The AlgorithmThe Algorithm
• Calculation of an intermediate Calculation of an intermediate velocity in a predictor-corrector velocity in a predictor-corrector way:way:
)(
)ˆˆˆ(
ˆ
)(
)2()1(21*
,,)2(
)1(
,,)1(
iinii
njji
nji
nji
ini
ni
njji
nji
nji
kkuu
uuutk
kuu
uuutk
PC
FD 2
00
0PC
FD 2
00
0The AlgorithmThe Algorithm
• Solution of a Poisson EquationSolution of a Poisson Equation
• Correction of the intermediate Correction of the intermediate velocityvelocity
*,
12jjt
n up
/)( 1,
*1 tbpuu inii
ni
PC
FD 2
00
0PC
FD 2
00
0Test Case: Vortex-SheddingTest Case: Vortex-Shedding
PC
FD 2
00
0PC
FD 2
00
0Simulation SnapshotsSimulation Snapshots
PressurePressure
PC
FD 2
00
0PC
FD 2
00
0Animated Pressure FieldAnimated Pressure Field
PC
FD 2
00
0PC
FD 2
00
0Simulation SnapshotsSimulation Snapshots
VelocityVelocity
PC
FD 2
00
0PC
FD 2
00
0Animated Velocity FieldAnimated Velocity Field
PC
FD 2
00
0PC
FD 2
00
0Some CPU MeasurementsSome CPU Measurements
P CPU Speedup Efficiency
1 1418.67 N/A N/A
2 709.79 2.00 1.00
3 503.50 2.82 0.94
4 373.54 3.80 0.95
6 268.38 5.29 0.88
8 216.73 6.55 0.82
The pressure equation is solved by the CG methodThe pressure equation is solved by the CG methodwith “subdomain-wise” MILU prec.with “subdomain-wise” MILU prec.
PC
FD 2
00
0PC
FD 2
00
0Combined ApproachCombined Approach
• Use a CG-like method as basic solverUse a CG-like method as basic solver(i.e. use a parallelized Diffpack linear solver)(i.e. use a parallelized Diffpack linear solver)
• Use DD as preconditionerUse DD as preconditioner(i.e. (i.e. SimulatorPSimulatorP is invoked as a preconditioning solve) is invoked as a preconditioning solve)
• Combine with coarse grid correctionCombine with coarse grid correction
• CG-like method + DD prec. is normally CG-like method + DD prec. is normally faster than DD as a basic solverfaster than DD as a basic solver
PC
FD 2
00
0PC
FD 2
00
0Two-Phase Porous Media FlowTwo-Phase Porous Media Flow
Simulation result obtained on 16 processorsSimulation result obtained on 16 processors
PC
FD 2
00
0PC
FD 2
00
0Two-phase Porous Media FlowTwo-phase Porous Media Flow
History of saturation for water and oilHistory of saturation for water and oil
PC
FD 2
00
0PC
FD 2
00
0Two-Phase Porous Media FlowTwo-Phase Porous Media Flow
P Total CPU Subgrid CPU PEQ I CPU SEQ
1 4053.33 241x241 3586.98 3.10 440.58
2 2497.43 129 x 241 2241.78 3.48 241.08
4 1244.29 129 x 129 1101.58 2.97 134.28
8 804.47 129 x 69 725.58 3.93 72.76
16 490.47 69 x 69 447.27 4.13 39.64
psv
Tqps
Tsfvst
)(
,0in ))((
0,in 0))((
PEQ:PEQ:
SEQ:SEQ:
BiCGStab + DD prec. for global pressure eq.BiCGStab + DD prec. for global pressure eq.Multigrid V-cycle in subdomain solvesMultigrid V-cycle in subdomain solves
PC
FD 2
00
0PC
FD 2
00
0Nonlinear Water WavesNonlinear Water Waves
PC
FD 2
00
0PC
FD 2
00
0Nonlinear Water WavesNonlinear Water Waves
• Fully nonlinear 3D water waves• Primary unknowns:• Parallelization based on an existing sequential Diffpack
simulator
wallssolidon 0
surfaceon water 02/)(
surfaceon water 0
olumein water v 0
222
2
n
gzyxt
zyyxxt
,
PC
FD 2
00
0PC
FD 2
00
0Nonlinear Water WavesNonlinear Water Waves
• CG + DD prec. for global solverCG + DD prec. for global solver• Multigrid V-cycle as subdomain solverMultigrid V-cycle as subdomain solver• Fixed number of subdomains Fixed number of subdomains MM=16 (independent =16 (independent
of of PP))• Subgrids from partition of a global 41x41x41 gridSubgrids from partition of a global 41x41x41 grid
P Execution time Speedup Efficiency
1 1404.44 N/A N/A
2 715.32 1.96 0.98
4 372.79 3.77 0.94
8 183.99 7.63 0.95
16 90.89 15.45 0.97
PC
FD 2
00
0PC
FD 2
00
0ElasticityElasticity
Test case: 2D linear elasticity, 241 x 241 global grid.Test case: 2D linear elasticity, 241 x 241 global grid.
Vector equationVector equation ),( 21 uuu
fuu )(
Straightforward parallelization based on Straightforward parallelization based on an existing Diffpack simulatoran existing Diffpack simulator
PC
FD 2
00
0PC
FD 2
00
02D Linear Elasticity2D Linear Elasticity
• BiCGStab + DD prec. as global solverBiCGStab + DD prec. as global solver
• Multigrid V-cycle in subdomain solvesMultigrid V-cycle in subdomain solves
• II:: number of global BiCGStab iterations needed number of global BiCGStab iterations needed
• PP:: number of processors ( number of processors (PP=#subdomains)=#subdomains)
P CPU Speedup I Subgrid
1 66.01 N/A 19 241 x 241
2 24.64 2.68 12 129 x 241
4 14.97 4.41 14 129 x 129
8 5.96 11.08 11 69 x 129
16 3.58 18.44 13 69 x 69
PC
FD 2
00
0PC
FD 2
00
02D Linear Elasticity2D Linear Elasticity
PC
FD 2
00
0PC
FD 2
00
0SummarySummary
• Goal: provide software and programming Goal: provide software and programming rules for easy parallelization of sequential rules for easy parallelization of sequential simulatorssimulators
• Applicable to a wide range of PDE Applicable to a wide range of PDE problemsproblems
• Two parallelization strategies:Two parallelization strategies:– domain decomposition:domain decomposition: very flexible, compact visible code/algorithmvery flexible, compact visible code/algorithm– parallelization at the linear algebra level:parallelization at the linear algebra level: “ “automatic” hidden parallelizationautomatic” hidden parallelization
• Performance: satisfactory speed-upPerformance: satisfactory speed-up
PC
FD 2
00
0PC
FD 2
00
0Future ApplicationFuture Application
DD with different PDEs and local DD with different PDEs and local solverssolvers– Out in deep sea:Out in deep sea: Eulerian, finite differences, Boussinesq PDEs, F77 Eulerian, finite differences, Boussinesq PDEs, F77
codecode– Near shore:Near shore: Lagrangian, finite element, shallow water PDEs, C+Lagrangian, finite element, shallow water PDEs, C+
+ code+ code