ppopen-hpc open source infrastructure for development and execution of large-scale scientific...

ppOpen-HPCOpen Source Infrastructure for Development

and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT)

Kengo NakajimaInformation Technology Center

The University of Tokyo

Lessons learned in the 20th Century

• Methods for scientific computing (e.g. FEM, FDM, BEM etc.) consists of typical data structures, and typical procedures.• Optimization of each procedure is possible and effective.

• Well-defined data structure can “hide” communication processes with MPI from code developer.• Code developers do not have to care about communications• Halo for parallel FEM

2

3

１２３４ 5

21 22 23 24 25

1617 18

20

1112 13 14

15

67 8 9

10

19

PE#0PE#1

PE#2PE#3

77 11 22 33

1010 99 1111 1212

556688

44PE#2PE#2

11 22 33

44 55

66 77

88 99 1111

1010

1414 1313

1515

1212

11 22 33

44 55

66 77

88 99 1111

1010

1414 1313

1515

1212

PE#0PE#0

3344

88

6699

1010 1212

11 22

55

1111

77

3344

88

6699

1010 1212

11 22

55

1111

77

PE#3PE#3

77 11 22 33

1010 99 1111 1212

556688

44

PE#2PE#2

11 22 33

44 55

66 77

88 99 1111

1010

1414 1313

1515

1212

11 22 33

44 55

66 77

88 99 1111

1010

1414 1313

1515

1212

PE#0PE#0

3344

88

6699

1010 1212

11 22

55

1111

77

3344

88

6699

1010 1212

11 22

55

1111

77

PE#3PE#3

1 2 3

4 5

6 7

8 9 11

10

14 13

15

12

PE#0

7 8 9 10

4 5 6 12

3 111 2

PE#37 1 2 3

10 9 11 12

5 68 4

PE#2

3 4 8

6 9

10 12

1 2

5

11

7

PE#1

ppOpen-HPC: Overview• Open Source Infrastructure for development and

execution of large-scale scientific applications on post-peta-scale supercomputers with automatic tuning (AT) • “pp” : post-peta-scale

• Five-year project (FY.2011-2015) (since April 2011) • P.I.: Kengo Nakajima (ITC, The University of Tokyo)• Part of “Development of System Software Technologies for

Post-Peta Scale High Performance Computing” funded by JST/CREST (Japan Science and Technology Agency, Core Research for Evolutional Science and Technology)

• Team with 7 institutes, >30 people (5 PDs) from various fields: Co-Desigin• ITC/U.Tokyo, AORI/U.Tokyo, ERI/U.Tokyo, FS/U.Tokyo• Hokkaido U., Kyoto U., JAMSTEC

4

ii

ii

ii

FT

FDMFEM BEM DEM

COMM

ppOpen-APPL

MGppOpen-MATH

ppOpen-SYS

GRAPH VIS MP

FVM

iippOpen-AT STATIC DYNAMIC

ppOpen-HPC

User’s Program

Optimized Application with Optimized ppOpen-APPL, ppOpen-MATH

5

FrameworkAppl. Dev.

MathLibraries

AutomaticTuning (AT)

SystemSoftware

• Group Leaders– Masaki Satoh (AORI/U.Tokyo)– Takashi Furumura (ERI/U.Tokyo)– Hiroshi Okuda (GSFS/U.Tokyo)– Takeshi Iwashita (Kyoto U., ITC/Hokkaido U.)– Hide Sakaguchi (IFREE/JAMSTEC)

• Main Members – Takahiro Katagiri (ITC/U.Tokyo)– Masaharu Matsumoto (ITC/U.Tokyo)– Hideyuki Jitsumoto (ITC/U.Tokyo)– Satoshi Ohshima (ITC/U.Tokyo)– Hiroyasu Hasumi (AORI/U.Tokyo)– Takashi Arakawa (RIST)– Futoshi Mori (ERI/U.Tokyo)– Takeshi Kitayama (GSFS/U.Tokyo)– Akihiro Ida (ACCMS/Kyoto U.)– Miki Yamamoto (IFREE/JAMSTEC)– Daisuke Nishiura (IFREE/JAMSTEC)

6

ppOpen-HPC: ppOpen-APPL

• ppOpen-HPC consists of various types of optimized libraries, which covers various types of procedures for scientific computations. • ppOpen-APPL/FEM, FDM, FVM, BEM, DEM• Linear Solvers, Mat. Assemble, AMR., Visualization etc.• written in Fortran 2003 (C interface is available soon)

• Source code developed on a PC with a single processor is linked with these libraries, and generated parallel code is optimized for post-peta scale system.

• Users don’t have to worry about optimization tuning, parallelization etc.• Part of MPI, OpenMP, (OpenACC)

7

ppOpen-HPC covers …8

FEM Code on ppOpen-HPCOptimization/parallelization could be hidden from

application developersProgram My_pFEMuse ppOpenFEM_utiluse ppOpenFEM_solver

call ppOpenFEM_initcall ppOpenFEM_cntlcall ppOpenFEM_meshcall ppOpenFEM_mat_init

do call Users_FEM_mat_ass call Users_FEM_mat_bc call ppOpenFEM_solve call ppOPenFEM_vis Time= Time + DTenddo

call ppOpenFEM_finalizestopend

9

ppOpen-HPC: AT & Post T2K

• Automatic Tuning (AT) enables development of optimized codes and libraries on emerging architectures− Directive-based Special Language for AT− Optimization of Memory Access

• Target system is Post T2K system− 20-30 PFLOPS, FY.2015-2016

JCAHPC: U. Tsukuba & U. Tokyo

− Many-core based (e.g. Intel MIC/Xeon Phi)− ppOpen-HPC helps smooth transition of users (> 2,000) to

new system

10

Supercomputers in U.Tokyo2 big systems, 6 yr. cycle

11

FY05 06 07 08 09 10 11 12 13 14 15 16 17 18 19

Hitachi SR11000/J218.8TFLOPS, 16.4TB

Fat nodes with large memory

(Flat) MPI, good comm. performance

京 (=K)Peta

Turning point to Hybrid Parallel Prog. Model

Fujitsu PRIMEHPC FX10 based on SPARC64 IXfx

1.13 PFLOPS, 150 TB

Hitachi SR16000/M1based on IBM Power-7

54.9 TFLOPS, 11.2 TB

Our last SMP, to be switched to MPP

Hitachi HA8000 (T2K)140TFLOPS, 31.3TB

Post T2K20-30 PFLOPS

ii

ii

ii

FT

FDMFEM BEM DEM

COMM

ppOpen-APPL

MGppOpen-MATH

ppOpen-SYS

GRAPH VIS MP

FVM

iippOpen-AT STATIC DYNAMIC

ppOpen-HPC

User’s Program

Optimized Application with Optimized ppOpen-APPL, ppOpen-MATH

12

Weak-Coupled Simulation by the ppOpen-HPC Libraries

Seism3D+

ppOpen-APPL/FDM

FrontISTR++

Two kinds of applications (Seism3D+ based on FDM, and FrontISTR++ based on FEM) are connected by the ppOpen-MATH/MP coupler.

ppOpen-APPL/FEM

ppOpen-MATH/MP

Principal Functions

Make a mapping table Convert physical variables Choose a timing of data

transmission

Velocity Displacement

…

13

update_stress update_vel update_stress_sponge

diff_* passing_* whole

558

200 17130 20 51

Speedup [%]!oat$ install LoopFusionSplit region start!$omp parallel do private(k,j,i,STMP1,STMP2,STMP3,STMP4,RL,RM,RM2,RMAXY,RMAXZ,RMAYZ,RLTHETA,QG) DO K = 1, NZ DO J = 1, NY DO I = 1, NX RL = LAM (I,J,K); RM = RIG (I,J,K); RM2 = RM + RM RLTHETA = (DXVX(I,J,K)+DYVY(I,J,K)+DZVZ(I,J,K))*RL!oat$ SplitPointCopyDef region start QG = ABSX(I)*ABSY(J)*ABSZ(K)*Q(I,J,K)!oat$ SplitPointCopyDef region end SXX (I,J,K) = ( SXX (I,J,K) + (RLTHETA + RM2*DXVX(I,J,K))*DT )*QG SYY (I,J,K) = ( SYY (I,J,K) + (RLTHETA + RM2*DYVY(I,J,K))*DT )*QG SZZ (I,J,K) = ( SZZ (I,J,K) + (RLTHETA + RM2*DZVZ(I,J,K))*DT )*QG!oat$ SplitPoint (K, J, I) STMP1 = 1.0/RIG(I,J,K); STMP2 = 1.0/RIG(I+1,J,K); STMP4 = 1.0/RIG(I,J,K+1) STMP3 = STMP1 + STMP2 RMAXY = 4.0/(STMP3 + 1.0/RIG(I,J+1,K) + 1.0/RIG(I+1,J+1,K)) RMAXZ = 4.0/(STMP3 + STMP4 + 1.0/RIG(I+1,J,K+1)) RMAYZ = 4.0/(STMP3 + STMP4 + 1.0/RIG(I,J+1,K+1))!oat$ SplitPointCopyInsert SXY (I,J,K) = ( SXY (I,J,K) + (RMAXY*(DXVY(I,J,K)+DYVX(I,J,K)))*DT )*QG SXZ (I,J,K) = ( SXZ (I,J,K) + (RMAXZ*(DXVZ(I,J,K)+DZVX(I,J,K)))*DT )*QG SYZ (I,J,K) = ( SYZ (I,J,K) + (RMAYZ*(DYVZ(I,J,K)+DZVY(I,J,K)))*DT )*QG END DO; END DO; END DO!$omp end parallel do!oat$ install LoopFusionSplit region end

Example of directivesfor ppOpen-AT

Loop spilitting/fusion

ppOpen-AT on Seism3D(FDM) (Xeon Phi 8-nodes)

Schedule of Public Release (with English Documents)

http://ppopenhpc.cc.u-tokyo.ac.jp/

• Released at SC-XY (or can be downloaded)• Multicore/manycore cluster version (Flat MPI,

OpenMP/MPI Hybrid) with documents in English• We are now focusing on MIC/Xeon Phi• Collaborations with scientists are welcome

History• SC12, Nov 2012 (Ver.0.1.0)• SC13, Nov 2013 (Ver.0.2.0)• SC14, Nov 2014 (Ver.0.3.0)

15



New Features in Ver.0.3.0http://ppopenhpc.cc.u-tokyo.ac.jp/

• ppOpen-APPL/AMR-FDM: AMR framework with a dynamic load-balancing method for various FDM applications

• HACApK library for H-matrix comp. in ppOpen-APPL/BEM

16

Processor Assigned data

when

: A small submatrixof P0P1P2P3P4P5P6P7

Assignac to Pk if S(ac1j) in R(Pk)

• Utilities for pre-processing in ppOpen-APPL/DEM

• Booth #713



17

Collaborations, Outreaching• Collaborations

– International Collaborations• Lawrence Berkeley National Lab.• National Taiwan University• IPCC （ Intel Parallel Computing

Center ）

• Outreaching, Applications– Large-Scale Simulations

• Geologic CO2 Storage

• Astrophysics• Earthquake Simulations etc.• ppOpen-AT, ppOpen-MATH/VIS,

ppOpen-MATH/MP, Linear Solvers

– Intl. Workshops (2012, 2013)– Tutorials, Classes

18

from Post-Peta to Exascale

• Currently, we are focusing on Post-T2K system by manycore architectures (Intel Xeon/Phi)

• Outline of the Exascale Systems is much clearer than which were in 2011 (when this project started).– Frameworks like ppOpen-HPC are really needed

• More complex, and huge system• More difficult to extract performance of applications

– Smooth transition from post-peta to exa will be possible through continuous development and improvement of ppOpen-HPC (We need funding for that !)

• Research Topics in Exascale Era– Power-Aware Algorithms/AT – Communication/Synchronization Reducing Algorithms

• pK-Open-HPC• Ill-Conditioned Problems• SPPEXA Proposal

19

SIAM PP1416th SIAM Conference on Parallel Processing for Scientific

Computing, Feb.18-21, 2014http://www.siam.org/meetings/pp14/

• Comm./Synch. Avoiding/Reducing Algorithms– Direct/Iterative Solvers, Preconditioning, s-step method– Coarse Grid Solvers on Parallel Multigrid

• Preconditioning Methods on Manycore Architectures– SPAI, Poly., ILU with Multicoloring/RCM, (no CM-RCM),

GMG– Asynchronous ILU on GPU (E.Chow (Ga.Tech.)) ?

• Preconditioning Methods for Ill-Conditioned Problems– Low-Rank Approximation

• Block-Structured AMR– SFC: Space-Filling Curve

20

http://www.siam.org/meetings/pp14/

http://www.siam.org/meetings/pp14/

Communications are expensive ...Ph.D Thesis by Mark Hoemmen (SNL) at

UC Berkeley on CA-KSP Solvers (2010) 　

• Serial Communications– Data Transfer through Hierarchical Memory

• Parallel Communications– Message Passing through Network

• Efficiency for Memory Access -> Exa-Feasible Applications

21

Future of CAE Applications

• FEM with fully-unstructured meshes is un-feasible for exa-scale supercomputer systems

• Block-structured AMR, Voxel-type FEM

• Serial Comm.– Fixed inner loops

• e.g. sliced ell

22

Level-0

Level-1

Level-2

• Parallel Comm.– Krylov iterative solvers– Preconditioning, Dot Products, HALO communications

23

ELL: Fixed Loop-length, Nice for Pre-fetching

50001

04730

00314

00521

00031 1 3

1 2 5

4 1 3

3 7 4

1 5

1 3

1 2 5

4 1 3

3 7 4

1 5

0

0

(a) CRS (b) ELL

Special Treatment for “Boundary” Meshes connected to “Halo”

• Distribution of Lower/Upper Non-Zero Off-Diagonal Components

• If we adopt RCM (or CM) reordering ...

• Pure Internal Meshes– L: ~3, U: ~3

• Boundary Meshes– L: ~3, U: ~6

24

External Meshes

Internal Meshes on Boundary

Pure Internal Meshes

x

yz

Pure Internal Meshes

Internal Meshes on Boundary

● Internal (lower)

● Internal (upper)

● External (upper)

Original ELL: Backward Subst.Cache is not well-utilized: IAUnew(6,N), Aunew(6,N)

25

Pure Internal CellsAUnew(6,N)

Boundary CellsAUnew(6,N)

Improved ELL: Backward Subst.Cache is well-utilized, separated: AUnew3/AUnew6Sliced ELL [Monakov et al. 2010] (for SpMV/GPU)

26

Pure Internal CellsAUnew3(3,N)

Boundary CellsAUnew6(6,N)

Analyses by Detailed Profiler of Fujitsu FX10, single node, Flat

MPI, RCM (Multigrid Part), 643cells/core, 1-node

27

Instruction L1Dmiss

L2 miss SIMD Op. Ratio

GFLOPS

CRS 1.53109 2.32107 1.67107 30.14% 6.05

OriginalELL 4.91108 1.67107 1.27107 93.88% 6.99

ImprovedELL 4.91108 1.67107 9.14106 93.88% 8.56

Hierarchical CGA: Comm. Reducing MGReduced number of MPI processes[KN 2013]

28

Level=1

Level=2

Level=m-3

Level=m-3

Fine

Coarse

Level=m-2

Coarse grid solver at a single MPI process (multi-threaded, further multigrid)

29

Weak Scaling: ~4,096 nodesup to 17,179,869,184 meshes (643 meshes/core)

DOWN is GOOD

Effect of Sliced ELL

HB 8x2 on FX10Effect ofhCGA

Down is good

Matrix Coarse Grid

C0 CRS Single Core

C1 ELL (org) Single Core

C2 ELL (org) CGA

C3 ELL (sliced)

CGA

C4 ELL (sliced)

hCGA

0.00

5.00

10.00

15.00

20.00

100 1000 10000 100000

sec.

CORE#

HB 8x2:C0HB 8x2:C1HB 8x2:C2HB 8x2:C3

5.0

7.5

10.0

12.5

15.0

100 1000 10000 100000

sec.

CORE#

Flat MPI:C3

Flat MPI:C4

HB 4x4:C4

HB 8x2:C3

HB 16x1:C3

Framework for Exa-Feasible Applications: pK-Open-HPC

• Should be based on voxel-type meshes• Robust/Scalable Geometric/Algebraic Multigrid

Preconditioning– for various types of applications, PDE’s– Robust smoother– Low-rank approximations– Overhead for coarse grid solvers (hCGA)– Dot products– Limited or fixed number of non-zero off-diagonals of

coefficient matrices at every level• Generally speaking, AMG has many non-zero off-diagonals at

coarse level of meshes

• Preprocessors

30SDHPC-11

31

AMROctree (8/27-children), Self-Similar Blocks

Block CellSelf-similar block• Cell size becomes half• The same number of cells

programtype(oct_Block) :: blockreal :: A(nx,ny,nz), B(nx,ny,nz), C(nx,ny,nz)

Initial parameter setup

do istep=1,Nstep (Time loop)Treatment of cell refinement

do index=1,Nall (Block loop)Main calculation

Data copy A(:,:,:) = block(index)%F(1,:,:,:)B(:,:,:) = block(index)%F(2,:,:,:)C(:,:,:) = block(index)%F(3,:,:,:)Insert uniform-cell program

call advance_fieldData copy to block arraysblock(index)%F(1,:,:,:) = A(:,:,:)block(index)%F(2,:,:,:) = B(:,:,:)block(index)%F(3,:,:,:) = C(:,:,:)

enddoOuter Boundary treatment

enddostopend program

subroutine advance_fieldreal :: A(nx,ny,nz), B(nx,ny,nz), C(nx,ny,nz)do iz=1,nzdo iy=1,nydo ix=1,nxA(ix,iy,iz) = ・・・

・・・・B(ix,iy,iz) = ・・・

・・・・C(ix,iy,iz) = ・・・

・・・・enddoenddoenddo returnend subroutine advance_field

Uniform-cell program

①

②

③

Framework


32

Extension of Depth of Overlapping

●：Internal Nodes，●：External Nodes■：Overlapped Elements●：Internal Nodes，●：External Nodes■：Overlapped Elements

5

21 22 23 24 25

1617 18 19

20

1113 14

15

67 8 9

10

PE#0PE#1

PE#2PE#3

12

32 41 5

21 22 23 24 25

1617 18 19

20

1113 14

15

67 8 9

10

PE#0PE#1

PE#2PE#3

12

32 41

1 2 3

4 5

6 7

8 9 11

10

14 13

15

12

PE#0

7 8 9 10

4 5 6 12

3111

2

PE#1

7 1 2 3

10 9 11 12

568

4

PE#2

34

8

69

10 12

1 2

5

11

7PE#3

1 2 3

4 5

6 7

8 9 11

10

14 13

15

12

PE#0

7 8 9 10

4 5 6 12

3111

2

PE#1

7 1 2 3

10 9 11 12

568

4

PE#2

34

8

69

10 12

1 2

5

11

7PE#3

Cost for computation and communication may increase

33

• Multilevel Domain Decomposition– Extension of Nested Dissection

• Non-overlapping at each level: Connectors, Separators• Suitable for Parallel Preconditioning Method

HID: Hierarchical Interface Decomposition [Henon & Saad 2007]

level-1 ：●level-2 ：●level-4 ：●

0 0 0 1 1 1

0,2 0,2 0,2 1,3 1,3 1,3

2 2 2 3 3 3

2 2 2 2,3 3 3 3

2 2 2 2,3 3 3 3

0 0 0 0,1 1 1 1

0 0 0 0,1 1 1 1

0,12,3

0,12,3

0,12,3

34

Parallel ILU in HID for each Connector at each LEVEL

• The unknowns are reordered according to their level numbers, from the lowest to highest.

• The block structure of the reordered matrix leads to natural parallelism if ILU/IC decompositions or forward/backward substitution processes are applied.

0

1

2

3

0,1

0,2

2,3

1,30,1,2,3

Level-1

Level-2

Level-4

35

Results: 64 coresContact ProblemsBILU(p)-(depth of overlapping)3,090,903 DOF

0

50

100

150

200

250

300

350

BILU(1) BILU(1+) BILU(2)

sec.

0

500

1000

1500

BILU(1) BILU(1+) BILU(2)

ITE

RA

TIO

NS

■BILU(p)-(0): Block Jacobi■BILU(p)-(1)■BILU(p)-(1+)■BILU(p)-HID GPBiCG

36

Hetero 3D (1/2)• Parallel FEM Code (Flat MPI)

– 3D linear elasticity problems in cube geometries with heterogeneity

– SPD matrices– Young’s modulus: 10-6~10+6

• (Emin-Emax): controls condition number

• Preconditioned Iterative Solvers– GP-BiCG [Zhang 1997]– BILUT(p,d,t)

• Domain Decomposition– Localized Block-Jacobi with

Extended Overlapping (LBJ)– HID/Extended HID

x

y

z

Uz=0 @ z=Zmin

Ux=0 @ x=Xmin

Uy=0 @ y=Ymin

Uniform Distributed Force in z-direction @ z=Zmax

(Ny-1) elementsNy nodes

(Nx-1) elementsNx nodes

(Nz-1) elementsNz nodes

x

y

z

Uz=0 @ z=Zmin

Ux=0 @ x=Xmin

Uy=0 @ y=Ymin

Uniform Distributed Force in z-direction @ z=Zmax

(Ny-1) elementsNy nodes

(Nx-1) elementsNx nodes

(Nz-1) elementsNz nodes

37

Hetero 3D (2/2)

• based on the parallel FEM procedure of GeoFEM– Benchmark developed in FP3C project under Japan-France

collaboration• Parallel Mesh Generation

– Fully parallel way• each process generates local mesh, and assembles local matrices.

– Total number of vertices in each direction (Nx, Ny, Nz)– Number of partitions in each direction (Px,Py,Pz)– Number of total MPI processes is equal to PxPyPz

– Each MPI process has (Nx/Px)( Ny/Py)( Nz/Pz) vertices.– Spatial distribution of Young’s modulus is given by an

external file, which includes information for heterogeneity for the field of 1283 cube geometry. • If Nx (or Ny or Nz) is larger than 128, distribution of these 1283 cubes

is repeated periodically in each direction.

FP3C

38

BILUT(p,d,t)

• Incomplete LU factorization with threshold (ILUT)• ILUT(p,d,t) [KN 2010]

– p: Maximum fill-level specified before factorization– d, t: Criteria for dropping tolerance before/after

factorization

• The process (b) can be substituted by other factorization methods or more powerful direct linear solvers, such as MUMPS, SuperLU and etc.

A

Initial Matrix

Dropping Components- Aij< d- Location

A’

DroppedMatrix

ILU (p)Factorization

(ILU)’

ILUFactorization

(ILUT)’

ILUT(p,d,t)

Dropping Components- Aij< t- Location

(a) (b) (c)

39

Preliminary Results• Hardware

– 16-240 nodes (256-3,840 cores) of Fujitsu PRIMEHPC FX10 (Oakleaf-FX), University of Tokyo

• Problem Setting– 420×320×240 vertices (3.194×107 elem’s, 9.677×107 DOF)– Strong scaling– Effect of thickness of overlapped zones

• BILUT(p,d,t)-LBJ-X (X=1,2,3)– RCM-Entire renumbering for LBJ

40

Effect of t on PerformanceBILUT(2,0,t)-GPBi-CG with 240 nodes (3,840 cores)

Emax=10-6, Emax=10+6

Normalized by results of BILUT(2,0,0)-LBJ-2●: [NNZ], ▲: Iterations, ◆: Solver Time

41

0.00

0.25

0.50

0.75

1.00

1.25

1.50

0.00E+00 1.00E-02 2.00E-02 3.00E-02

Rat

io

t: BILUT(2,0,t)-HID

0.00

0.25

0.50

0.75

1.00

1.25

1.50

0.00E+00 1.00E-02 2.00E-02 3.00E-02

Rat

io

t: BILUT(2,0,t)-LJB-2

BILUT(p,0,0) at 3,840 coresNO dropping: Effect of Fill-in

PreconditionerNNZ of

[M]Set-up(sec.)

Solver(sec.)

Total(sec.)

Iterations

BILUT(1,0,0)-LBJ-1 1.9201010 1.35 65.2 66.5 1916

BILUT(1,0,0)-LBJ-2 2.5191010 2.03 61.8 63.9 1288

BILUT(1,0,0)-LBJ-3 3.1971010 2.79 74.0 76.8 1367

BILUT(2,0,0)-LBJ-1 3.3511010 3.09 71.8 74.9 1339

BILUT(2,0,0)-LBJ-2 4.3941010 4.39 65.2 69.6 939

BILUT(2,0,0)-LBJ-3 5.6311010 5.95 83.6 89.6 1006

BILUT(3,0,0)-LBJ-1 6.4681010 9.34 105.2 114.6 1192

BILUT(3,0,0)-LBJ-2 8.5231010 12.7 98.4 111.1 823

BILUT(3,0,0)-LBJ-3 1.1011011 17.3 101.6 118.9 722

BILUT(1,0,0)-HID 1.6361010 2.24 60.7 62.9 1472

BILUT(2,0,0)-HID 2.9801010 5.04 66.2 71.7 1096

[NNZ] of [A]: 7.174109

42

BILUT(p,0,0) at 3,840 coresNO dropping: Effect of OverlappingPreconditioner

NNZ of [M]

Set-up(sec.)

Solver(sec.)

Total(sec.)

Iterations

BILUT(1,0,0)-LBJ-1 1.9201010 1.35 65.2 66.5 1916

BILUT(1,0,0)-LBJ-2 2.5191010 2.03 61.8 63.9 1288

BILUT(1,0,0)-LBJ-3 3.1971010 2.79 74.0 76.8 1367

BILUT(2,0,0)-LBJ-1 3.3511010 3.09 71.8 74.9 1339

BILUT(2,0,0)-LBJ-2 4.3941010 4.39 65.2 69.6 939

BILUT(2,0,0)-LBJ-3 5.6311010 5.95 83.6 89.6 1006

BILUT(3,0,0)-LBJ-1 6.4681010 9.34 105.2 114.6 1192

BILUT(3,0,0)-LBJ-2 8.5231010 12.7 98.4 111.1 823

BILUT(3,0,0)-LBJ-3 1.1011011 17.3 101.6 118.9 722

BILUT(1,0,0)-HID 1.6361010 2.24 60.7 62.9 1472

BILUT(2,0,0)-HID 2.9801010 5.04 66.2 71.7 1096

[NNZ] of [A]: 7.174109

43

BILUT(p,0,t) at 3,840 coresOptimum Value of t

Preconditioner NNZ of [M]Set-up(sec.)

Solver(sec.)

Total(sec.)

Iterations

BILUT(1,0,2.7510-2)-LBJ-1 7.755109 1.36 45.0 46.3 1916

BILUT(1,0,2.7510-2)-LBJ-2 1.0191010 2.05 42.0 44.1 1383

BILUT(1,0,2.7510-2)-LBJ-3 1.2851010 2.81 54.2 57.0 1492

BILUT(2,0,1.0010-2)-LBJ-1 1.1181010 3.11 39.1 42.2 1422

BILUT(2,0,1.0010-2)-LBJ-2 1.4871010 4.41 37.1 41.5 1029

BILUT(2,0,1.0010-2)-LBJ-3 1.8931010 5.99 37.1 43.1 915

BILUT(3,0,2.5010-2)-LBJ-1 8.072109 9.35 38.4 47.7 1526

BILUT(3,0,2.5010-2)-LBJ-2 1.0631010 12.7 35.5 48.3 1149

BILUT(3,0,2.5010-2)-LBJ-3 1.3421010 17.3 40.9 58.2 1180

BILUT(1,0,2.5010-2)-HID 6.850109 2.25 38.5 40.7 1313

BILUT(2,0,1.0010-2)-HID 1.0301010 5.04 36.1 41.1 1064

44

[NNZ] of [A]: 7.174109

Strong Scaling up to 3,840 coresaccording to elapsed computation time (set-

up+solver) for BILUT(1,0,2.510-2)-HID with 256 cores

0.00E+00

1.00E+03

2.00E+03

3.00E+03

4.00E+03

0 500 1000 1500 2000 2500 3000 3500 4000

Sp

eed

-Up

CORE#

BILUT(1,0,2.50e-2)-HIDBILUT(2,0,1.00e-2)-HIDBILUT(1,0,2.75e-2)-LBJ-2BILUT(2,0,1.00e-2)-LBJ-2BILUT(3,0,2.50e-2)-LBJ-2Ideal

70

80

90

100

110

120

130

100 1000 10000

Par

alle

l Per

form

ance

(%)

CORE#

BILUT(1,0,2.50e-2)-HIDBILUT(2,0,1.00e-2)-HIDBILUT(1,0,2.75e-2)-LBJ-2BILUT(2,0,1.00e-2)-LBJ-2BILUT(3,0,2.50e-2)-LBJ-2

45

MUMPS: Low-Rank-Approx.


46

Developments for SPPEXA (1/2)• Geometric/Algebraic Multigrid Solvers• Robust Preconditioning Method with Special

Partitioning Method• pK-Open-HPC

– Robust and Scalable– Communication Reducing/Avoiding Algorithms– Extensions of Sliced ELL for Future Architectures

• Post K (Sparc), Xeon Phi, Nvidia Volta• U.Tokyo’s Systems

– FY.2016: KNL-based System (20-30 PF)– FY.2018-2019: Post FX10 System (Sparc ?, Nvidia ?) (50PF)

– OpenMP/MPI Hybrid– Automatic Tuning (AT)

• Parameter Selection, Stencil Construction

47

Developments for SPPEXA (2/2)• Deployment of pK-Open-HPC will be support

JST/CREST (~2016.3), and RIKEN-U.Tokyo Collaboration (starts 2014.12~2020)– not much funding

• FY.2016, FY.2017: supported by JST/CREST if the proposal is accepted

• FY.2018 (~2019.3): RIKEN-U.Tokyo Collaboration, Other Funding needed

• Members– University of Tokyo: K. Nakajima, T. Katagiri– Hokkaido University: T. Iwashita (AMG)– Kyoto University: A. Ida (AMG, Low-Rank)– French Partners ?: Michel Dayde (Toulouse)– Users, Application Groups

48

Multigrid Solvers (1/2)• Geometric, Algebraic

– Adaptive Mesh Refinement (AMR)

• Robust for various types of problems– Target Applications: Poisson, Solid Mechanics, ME– Smoothers: ILU/IC etc.

• Carefully designed strategy of stencil construction for feasibility on post-peta/exascale systems– Modified/Extended Sliced ELL– Generally speaking, matrices of AMG at coarse lelve

become dense. • Critical for load balancing on massively parallel systems• ELL is difficult, some strategy for construction of stencil

49

Multigrid Solvers (2/2)• MPI/OpenMP Hybrid

– Reordering for extraction of parallelism– hCGA-type strategy for communication reducing

• Automatic Tuning– Stencil construction– Various parameters

50

Robust Preconditioner

• ILU/BILU(p,d,t)– Strategy for domain decomposition– Flexible overlapping, local reordering– Extended HID– AT for selection of optimum parameters

• Low-Rank Approximation– H-Matrix solver– MUMPS

51

(Rough) Schedule (1/2)• 1st yr.

– Multgrid• Development & Test

– ILU/BILU• Development & Test

– Low-Rank-Approximation• Development & Test

• 2nd yr.– Multgrid

• Development & Test (cont.), Optimization, AT – ILU/BILU

• Development & Test (cont.), Optimization, AT– Low-Rank-Approximation

• Development & Test (cont.), Optimization

52

(Rough) Schedule (2/2)• 3rd yr.

– Evaluations through real applications– Multgrid

• Development & Test, Optimization, AT (cont.) – ILU/BILU

• Development & Test, Optimization, AT (cont.)– Low-Rank-Approximation

• Development & Test, Optimization, AT (cont.)

• Proposal– Deadline: Jan.31, 2015– KN attends a conference in Barcelona Jan.29-30, and can

visit Germany on Jan.27 (T).

53