introduction to parallel finite element method using geofem/hpc-mw kengo nakajima dept. earth &...

Introduction to ParallelFinite Element Methodusing GeoFEM/HPC-MW

Kengo NakajimaDept. Earth & Planetary ScienceThe University of Tokyo VECPAR’06 Tutorial: An Introduction to Robust and High Performance Software Libraries for Solving Common Problems in Computational Sciences

July 13th, 2006, Rio de Janeiro, Brazil.

VECPAR06-KN

Overview

Introduction Finite Element Method Iterative Solvers

Parallel FEM Procedures in GeoFEM/HPC-MW Local Data Structure in GeoFEM/HPC-MW

Partitioning Parallel Iterative Solvers in GeoFEM/HPC-MW

Performance of Iterative Solvers Parallel Visualization in GeoFEM/HPC-MW Example of Parallel Code using HPC-MW

VECPAR06-KN

Finite-Element Method (FEM)

One of the most popular numerical methods for solving PDE’s. elements (meshes) & nodes (vertices)

Consider the following 2D heat transfer problem:

16 nodes, 9 bi-linear elements uniform thermal conductivity (=1) uniform volume heat flux (Q=1) T=0 at node 1 Insulated boundaries

5 6 7 8

9 10 11 12

13 14 15 1602

VECPAR06-KN

Galerkin FEM procedures

Apply Galerkin procedures to each element:

5 6 7 8

9 10 11 12

13 14 15 16

}{NT where

{} : T at each vertex[N] : Shape function

(Interpolation function)

Introduce the following “weak form” of original PDE using Green’s theorem:

in each elem.

VECPAR06-KN

Element Matrix

Apply the integration to each element and form “element” matrix.

)()()()(

)()()( }{}]{[

VECPAR06-KN

Global (Overall) MatrixAccumulate each element matrix to “global” matrix.

5 6 7 8

9 10 11 12

13 14 15 16

}{}]{[

XDXXXX

XXDXXX

XXXXDXXXX

XXXDXX

XXDXXX

XXXXDXXXX

XXXDXX

XXXXDX

VECPAR06-KN

To each node … Effect of surrounding elem’s/nodes are accumulated.

5 6 7 8

9 10 11 12

13 14 15 16

}{}]{[

XDXXXX

XXDXXX

XXXXDXXXX

XXXDXX

XXDXXX

XXXXDXXXX

XXXDXX

XXXXDX

VECPAR06-KN

Solve the obtained global/overall equationsunder certain boundary conditions (=0 in this case)

XDXXXX

XXDXXX

XXXXDXXXX

XXXDXX

XXDXXX

XXXXDXXXX

XXXDXX

XXXXDX

VECPAR06-KN

Result …

VECPAR06-KN

Features of FEM applications Typical Procedures for FEM Computations

Input/Output Matrix Assembling Linear Solvers for Large-scale Sparse Matrices Most of the computation time is spent for matrix assembling/formation and solving linear equations.

HUGE “indirect” accesses memory intensive

Local “element-by-element” operations sparse coefficient matrices suitable for parallel computing

Excellent modularity of each procedure

VECPAR06-KN

Goal of GeoFEM/HPC-MW as Environment for Development of Parallel FEM Applications

NO MPI call’s in user’s code !!!!! As serial as possible !!!!!

Original FEM code developed for single CPU machine can work on parallel computers with smallest modification.

Careful design of the local data structure for distributed parallel computing is very important.

VECPAR06-KN

to solve larger problems faster ...– finer meshes provide more accurate solution

What is Parallel Computing ?

Homogeneous/HeterogeneousPorous Media

Lawrence Livermore National Laboratory

Homogeneous Heterogeneous

very fine meshes are required for simulations of heterogeneous field.

VECPAR06-KN

PC with 1GB memory : 1M meshes are the limit for FEM− Southwest Japan with (1000km)3 in 1km mesh -> 109 meshes

Large Data -> Domain Decomposition -> Local Operation Inter-Domain Communication for Global Operation.

Large-ScaleData

LocalData

Communication

Partitioning

What is Parallel Computing ?(cont.)

VECPAR06-KN

Parallel Computing -> Local Operations

Communications are required in Global Operations for Consistency.

What is Communication ?

VECPAR06-KN

Parallel Computing in GeoFEM/HPCMWAlgorithms: Parallel Iterative Solvers & Local Data Structure

Parallel Iterative Solvers by (Fortran90+MPI)– Iterative method is the only choice for large-scale problems with parallel

processing.– Portability is important -> from PC clusters to Earth Simulator

Appropriate Local Data Structure for (FEM+Parallel Iterative Method)– FEM is based on local operations.

VECPAR06-KN

Large Scale Data -> partitioned into Distributed Local Data Sets.

Local Data

FEM Code

FEM code on each PE assembles coefficient matrix for each local data set : this part is completely local, same as serial operations

Linear Solver

Global Operations & Communications happen only in Linear Solversdot products, matrix-vector multiply, preconditioning

Parallel Computing in FEMSPMD: Single-Program Multiple-Data

MPIMPI

VECPAR06-KN

Parallel Computing in GeoFEM/HPC-MW

Finally, users can develop parallel FEM codes easily using GeoFEM/HPC-MW without considering parallel operations.

Local data structure and linear solvers do it. Basically, same procedures as those of serial operations. This is possible because FEM is based on local operations. FEM is really suitable for parallel computing.

NO MPI in user’s code Plug-in

VECPAR06-KN

Plug-in in GeoFEM

Visualization dataGPPView

One-domain mesh

Utilities Pluggable Analysis Modules

Partitioner

Equationsolvers

VisualizerParallelI/O

構造計算（Static linear）構造計算（Dynamic linear）構造計算（

Contact）

Partitioned mesh

PlatformSolverI/ F

Comm.I/ F

Vis.I/ F

Structure

FluidWave

Visualization dataGPPView

One-domain mesh

Utilities Pluggable Analysis Modules

Partitioner

Equationsolvers

VisualizerParallelI/O

構造計算（Static linear）構造計算（Dynamic linear）構造計算（

Contact）

Partitioned mesh

PlatformSolverI/ F

Comm.I/ F

Vis.I/ F

Structure

FluidWave

VECPAR06-KN

Plug-in in HPC-MW

Vis.LinearSolver

MatrixAssembleI/O

HPC-MW for Earth Simulator

FEM code developed on PCI/F forVis.

I/F forSolvers

I/F forMat.Ass.

I/F forI/O

Vis.LinearSolver

MatrixAssembleI/O

HPC-MW for Hitachi SR1100

Vis.LinearSolver

MatrixAssembleI/O

HPC-MW for Opteron Cluster

VECPAR06-KN

Plug-in in HPC-MW

Vis.LinearSolver

MatrixAssembleI/O

HPC-MW for Earth Simulator

FEM code developed on PCI/F forVis.

I/F forSolvers

I/F forMat.Ass.

I/F forI/O

VECPAR06-KN

VECPAR06-KN 23

Bi-Linear Square ElementsValues are defined on each node

Local information is not enough for matrix assembling.

divide into two domains by “node-based” manner, where number of “nodes (vertices)” are balanced.

Information of overlapped elements and connected nodes are required for matrix assembling on boundary nodes.

VECPAR06-KN

Local Data of GeoFEM/HPC-MW Node-based partitioning for IC/ILU type preconditioning methods Local data includes information for :

Nodes originally assigned to the partition/PE Elements which include the nodes : Element-based operations (Matrix Assemble) are allowed for fluid/structure subsystems. All nodes which form the elements but out of the partition

Nodes are classified into the following 3 categories from the viewpoint of the message passing Internal nodes originally assigned nodes External nodes in the overlapped elements but out of the partition Boundary nodes external nodes of other partition

Communication table between partitions NO global information required except partition-to-partition connectivity

VECPAR06-KN

Node-based Partitioninginternal nodes - elements - external nodes

8 9 11

7 8 9 10

4 5 6 12

PE#17 1 2 3

10 9 11 12

１２３４ 5

21 22 23 24 25

1617 18

1112 13 14

67 8 9

PE#0PE#1

PE#2PE#3

VECPAR06-KN

Elements which include Internal Nodes

8 9 11

External Nodes included in the Elementsin overlapped region among partitions.

Partitioned nodes themselves (Internal Nodes)

Info of External Nodes are required for completely local Info of External Nodes are required for completely local element–based operations on each processor. element–based operations on each processor.

VECPAR06-KN

Elements which include Internal Nodes

8 9 11

External Nodes included in the Elementsin overlapped region among partitions.

Partitioned nodes themselves (Internal Nodes)

Info of External Nodes are required for completely local Info of External Nodes are required for completely local element–based operations on each processor. element–based operations on each processor.

We do not need communication We do not need communication during matrix assemble !!during matrix assemble !!

VECPAR06-KN

Local DataLocal Data FEM codeFEM code Linear SolversLinear Solvers

MPIMPIMPIMPI

VECPAR06-KN

MPIMPIMPIMPI

VECPAR06-KN

MPIMPIMPIMPI

8 9 11

7 1 2 3

10 9 11 12

7 1 2 3

10 9 11 12

7 8 9 10

4 5 6 12

7 8 9 10

4 5 6 12

VECPAR06-KN

MPIMPIMPIMPI

8 9 11

7 1 2 3

10 9 11 12

7 1 2 3

10 9 11 12

7 8 9 10

4 5 6 12

7 8 9 10

4 5 6 12

VECPAR06-KN

MPIMPIMPIMPI

VECPAR06-KN

• Getting Information for EXTERNAL NODES from Getting Information for EXTERNAL NODES from EXTERNAL PARTITIONS.EXTERNAL PARTITIONS.

• ““Communication tables” in local data structure includes Communication tables” in local data structure includes the procedures for communication.the procedures for communication.

VECPAR06-KN 34

Parallel procedures are required in:

Dot products Matrix-vector multiplication

How to “Parallelize” Iterative Solvers ?e.g. CG method (with no preconditioning)

Compute r(0)= b-[A]x(0)

for i= 1, 2, … z(i-1)= r(i-1)

i-1= r(i-1) z(i-1)

if i=1 p(1)= z(0) else i-1= i-1/i-2

p(i)= z(i-1) + i-1 p(i)

endif q(i)= [A]p(i)

i = i-1/p(i)q(i)

x(i)= x(i-1) + ip(i)

r(i)= r(i-1) - iq(i)

check convergence |r|end

VECPAR06-KN 35

use MPI_ALLreduce after local operations

How to “Parallelize” Dot Products

RHO= 0.d0

do i= 1, N RHO= RHO + W(i,R)*W(i,Z) enddo

Allreduce RHO

VECPAR06-KN 36

We need values of {p} vector at EXTERNAL nodes BEFORE computation !!

How to “Parallelize” Matrix-Vec. Multilplication

do i= 1, N q(i)= D(i)*p(i) do k= INDEX_L(i-1)+1, INDEX_U(i) q(i)= q(i) + AMAT_L(k)*p(ITEM_L(k)) enddo

do k= INDEX_U(i-1)+1, INDEX_U(i) q(i)= q(i) + AMAT_U(k)*p(ITEM_U(k)) enddo enddo

get {p} at EXTERNAL nodes

VECPAR06-KN

• Getting Information for EXTERNAL NODES from Getting Information for EXTERNAL NODES from EXTERNAL PARTITIONS.EXTERNAL PARTITIONS.

• ““Communication tables” in local data structure includes Communication tables” in local data structure includes the procedures for communication.the procedures for communication.

VECPAR06-KN 38

Number of neighbors NEIBTOT

Neighboring domains NEIBPE(ip), ip= 1, NEIBPETOT

1D compressed index for “boundary” nodes EXPORT_INDEX(ip), ip= 0, NEIBPETOT

Array for “boundary” nodes EXPORT_ITEM(k), k= 1, EXPORT_INDEX(NEIBPETOT)

Communication Table: SEND

VECPAR06-KN

PE-to-PE comm. : SENDPE#2 : send information on “boundary nodes”

77 11 22 33

1010 99 1111 1212

556688

PE#2PE#2

11 22 33

88 99 1111

1414 1313

PE#0PE#0

1010 1212

PE#3PE#3

NEIBPE= 2NEIBPE(1)=3, NEIBPE(2)= 0

EXPORT_INDEX(0)= 0EXPORT_INDEX(1)= 2EXPORT_INDEX(2)= 2+3 = 5

EXPORT_ITEM(1-5)=1,4,4,5,6

VECPAR06-KN

Communication Table : SENDsend information on “boundary nodes”

neib#1SENDbuf

neib#2 neib#3 neib#4

export_index(0)+1

BUFlength_e BUFlength_e BUFlength_e BUFlength_e

export_index(1)+1 export_index(2)+1 export_index(3)+1

do neib= 1, NEIBPETOT do k= export_index(neib-1)+1, export_index(neib) kk= export_item(k) SENDbuf(k)= VAL(kk) enddo enddo

do neib= 1, NEIBPETOT iS_e= export_index(neib-1) + 1 iE_e= export_index(neib ) BUFlength_e= iE_e + 1 - iS_e

call MPI_ISEND && (SENDbuf(iS_e), BUFlength_e, MPI_INTEGER, NEIBPE(neib), 0,&& MPI_COMM_WORLD, request_send(neib), ierr) enddo

call MPI_WAITALL (NEIBPETOT, request_send, stat_recv, ierr)

export_index(4)

VECPAR06-KN 41

Number of neighbors NEIBTOT

Neighboring domains NEIBPE(ip), ip= 1, NEIBPETOT

1D compressed index for “external” nodes IMPORT_INDEX(ip), ip= 0, NEIBPETOT

Array for “external” nodes IMPORT_ITEM(k), k= 1, IMPORT_INDEX(NEIBPETOT)

Communication Table: RECEIVE

VECPAR06-KN

PE-to-PE comm. : RECEIVEPE#2 : receive information for “external nodes”

77 11 22 33

1010 99 1111 1212

556688

44PE#PE#22

11 22 33

88 99 1111

1414 1313

PE#0PE#0

1010 1212

PE#3PE#3

NEIBPE= 2NEIBPE(1)=3, NEIBPE(2)= 0

IMPORT_INDEX(0)= 0IMPORT_INDEX(1)= 3IMPORT_INDEX(2)= 3+3 = 6

IMPORT_ITEM(1-6)=7,8,10,9,11,12

VECPAR06-KN

Communication Table : RECV.recv. information for “external nodes”

neib#1RECVbuf

neib#2 neib#3 neib#4

BUFlength_i BUFlength_i BUFlength_i BUFlength_i

do neib= 1, NEIBPETOT iS_i= import_index(neib-1) + 1 iE_i= import_index(neib ) BUFlength_i= iE_i + 1 - iS_i

call MPI_IRECV && (RECVbuf(iS_i), BUFlength_i, MPI_INTEGER, NEIBPE(neib), 0,&& MPI_COMM_WORLD, request_recv(neib), ierr) enddo

call MPI_WAITALL (NEIBPETOT, request_recv, stat_recv, ierr)

do neib= 1, NEIBPETOT do k= import_index(neib-1)+1, import_index(neib) kk= import_item(k) VAL(kk)= RECVbuf(k) enddo enddo

import_index(0)+1 import_index(1)+1 import_index(2)+1 import_index(3)+1 import_index(4)

44VECPAR06-KN

So far, we have spent several slides for describing the concept of local data structure of GeoFEM/HPC-MW which includes information for inter-domain communications. Actually, users do not need to know the detail of such local data structure. Most of the procedures with communication tables, such as parallel I/O, linear solvers and

parallel visualization are executed in subroutines provided by GeoFEM/HPC-MW. What you have to do is just calling these subroutines.

Local Data Structure …

introduction to parallel finite element method using geofem/hpc-mw kengo nakajima dept. earth &...

parallel code

kngoal of geofemhpc

form element matrix

parallel computers

original fem code

distributed parallel

vecpar06knelement matrixapply

knfiniteelement method

Documents

parallel iterative solvers with the selective blocking...

new parallel iterative solvers with preconditioning in the...

architecture and food - archea.it · kengo kuma &...

nakajima - use of sky brightness

1d-fem in fortran: steady state heat...

parallel iterative solvers of geofem with selective...

rie nakajima school of pharmacy, nihon university makoto

painlevé equations from nakajima-yoshioka blowup...

3d parallel fem(iv) (openmp+ mpi) hybrid parallel...

report s2 -...

jaxa archive activity september 2005 ukraine kyiv jaxa/eorc...

open access research article nakajima template …research...

spin-polarization using ns~fs laser pulses takashi nakajima...

architecture as landscape: kengo kuma, … · juhani...

fales & nakajima aortic emergencies

iterative linear solvers for sparse...

kengo kuma & associates yusuhara marche · kengo kuma is...

accelerated computing activities at itc/university of...

jaxa network report jaxa/eorc kengo aizawa jaxa/eorc osamu...

painlevé equations from nakajima-yoshioka blow-up...