recipe for high performance computing software and

81
Credits Motivations MORSE T1 T2 T3 Conclusion Recipe for High Performance Computing Software and Application at Exascale Hatem Ltaief KAUST Supercomputing Lab HPC Advisory Council Switzerland Conference March 13-15, 2012 Lugano, Switzerland H. Ltaief HPC Swiss 1 / 77

Upload: others

Post on 28-Nov-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Recipe for High Performance ComputingSoftware and Application at Exascale

Hatem LtaiefKAUST Supercomputing Lab

HPC Advisory Council Switzerland Conference

March 13-15, 2012Lugano, Switzerland

H. Ltaief HPC Swiss 1 / 77

Page 2: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Credits

H. Ltaief HPC Swiss 2 / 77

Page 3: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Credits

H. Ltaief HPC Swiss 2 / 77

Page 4: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Outline

1 Motivations

2 Matrices Over Runtime Systems at Exascale

3 Cholesky-based Matrix Inversion and Generalized SymmetricEigenvalue Problem

4 N-Body Simulations

5 Seismic Applications

6 Conclusion

H. Ltaief HPC Swiss 3 / 77

Page 5: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Outline

1 Motivations

2 Matrices Over Runtime Systems at Exascale

3 Cholesky-based Matrix Inversion and Generalized SymmetricEigenvalue Problem

4 N-Body Simulations

5 Seismic Applications

6 Conclusion

H. Ltaief HPC Swiss 4 / 77

Page 6: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

The ”K” Computer

H. Ltaief HPC Swiss 5 / 77

Page 7: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

HPC Recipe For Exascale Computing

H. Ltaief HPC Swiss 6 / 77

Page 8: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

HPC Recipe For Exascale Computing

H. Ltaief HPC Swiss 7 / 77

Page 9: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

HPC Recipe For Exascale Computing

Fine-grain

parallelism Data

Motio

n

Reduci

ng

Synch

roniza

tion

Reduci

ng

Auto-Tuning

Power

efficiency

Dynamic runtime systems

Hardware and SoftwareCo-design

1018

H. Ltaief HPC Swiss 8 / 77

Page 10: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

A Look Back...

Software infrastructure and algorithmic design follow hardwareevolution in time:

• 70’s - LINPACK, vector operations:Level-1 BLAS operation

• 80’s - LAPACK, block,cache-friendly:Level-3 BLAS operation

• 90’s - ScaLAPACK, distributedmemory:PBLAS Message passing

• 00’s:• PLASMA, MAGMA, many

x86/cuda cores friendly:DAG scheduler, tile data layout, someextra kernels

H. Ltaief HPC Swiss 9 / 77

Page 11: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

A Look Back...

Software infrastructure and algorithmic design follow hardwareevolution in time:

• 70’s - LINPACK, vector operations:Level-1 BLAS operation

• 80’s - LAPACK, block,cache-friendly:Level-3 BLAS operation

• 90’s - ScaLAPACK, distributedmemory:PBLAS Message passing

• 00’s:• PLASMA, MAGMA, many

x86/cuda cores friendly:DAG scheduler, tile data layout, someextra kernels

H. Ltaief HPC Swiss 9 / 77

Page 12: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

A Look Back...

Software infrastructure and algorithmic design follow hardwareevolution in time:

• 70’s - LINPACK, vector operations:Level-1 BLAS operation

• 80’s - LAPACK, block,cache-friendly:Level-3 BLAS operation

• 90’s - ScaLAPACK, distributedmemory:PBLAS Message passing

• 00’s:• PLASMA, MAGMA, many

x86/cuda cores friendly:DAG scheduler, tile data layout, someextra kernels

H. Ltaief HPC Swiss 9 / 77

Page 13: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

A Look Back...

Software infrastructure and algorithmic design follow hardwareevolution in time:

• 70’s - LINPACK, vector operations:Level-1 BLAS operation

• 80’s - LAPACK, block,cache-friendly:Level-3 BLAS operation

• 90’s - ScaLAPACK, distributedmemory:PBLAS Message passing

• 00’s:• PLASMA, MAGMA, many

x86/cuda cores friendly:DAG scheduler, tile data layout, someextra kernels

H. Ltaief HPC Swiss 9 / 77

Page 14: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Block Algorithms

• Panel-Update Sequence

• Transformations are blocked/accumulated within the Panel(Level 2 BLAS)

• Transformations applied at once on the trailing submatrix(Level 3 BLAS)

• Parallelism hidden inside the BLAS

• Fork-join Model

H. Ltaief HPC Swiss 10 / 77

Page 15: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

One-Sided Block Algorithms: LU

H. Ltaief HPC Swiss 11 / 77

Page 16: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Block Algorithms: Fork-Join Paradigm

H. Ltaief HPC Swiss 12 / 77

Page 17: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

PLASMA: Tile Algorithms

PLASMA: Parallel Linear Algebra for Scalable Multi-coreArchitectures=⇒ http://icl.cs.utk.edu/plasma

• Parallelism is brought to the fore

• May require the redesign of linear algebra algorithms

• Remove unnecessary synchronization points betweenPanel-Update sequences

• DAG execution where nodes represent tasks and edges definedependencies between them

• Dynamic runtime system environment QUARK

• Tile data layout translation

H. Ltaief HPC Swiss 13 / 77

Page 18: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Tile Data Layout Format

LAPACK: column-major format PLASMA: tile format

H. Ltaief HPC Swiss 14 / 77

Page 19: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

MAGMA: x86 + MultiGPUs

MAGMA: Matrix Algebra on GPU and Multicore Architectures=⇒ http://icl.cs.utk.edu/magma

• Lessons Learned from PLASMA!

• CUDA-based hybrid systems

• New high performance numerical kernels

• StarPU Runtime System (INRIA, Bordeaux)

• Both: x86 and GPUs =⇒ Hybrid Computations

• Similar to LAPACK in functionality

H. Ltaief HPC Swiss 15 / 77

Page 20: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Dynamic Runtime System

• Conceptually similar to out-of-order processor scheduling

because it has:

• Dynamic runtime DAG scheduler

• Out-of-order execution flow of fine-grained tasks

• Task scheduling as soon as dependencies are satisfied

• Producer-Consumer

H. Ltaief HPC Swiss 16 / 77

Page 21: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

DataFlow Programming

• Five decades OLD concept• Programming paradigm that models a program as a directed

graph of the data flowing between operations (cf. Wikipedia)• Think ”how things connect” rather than ”how things happen”• Assembly line• Inherently parallel

H. Ltaief HPC Swiss 17 / 77

Page 22: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Outline

1 Motivations

2 Matrices Over Runtime Systems at Exascale

3 Cholesky-based Matrix Inversion and Generalized SymmetricEigenvalue Problem

4 N-Body Simulations

5 Seismic Applications

6 Conclusion

H. Ltaief HPC Swiss 18 / 77

Page 23: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Matrices Over Runtime Systems at Exascale

• Mission statement: ”Design dense and sparse linear algebramethods that achieve the fastest possible time to an accuratesolution on large-scale Hybrid systems”.

• Runtime challenges due to the ever growing hardwarecomplexity.

• Algorithmic challenges to exploit the hardware capabilities atmost.

H. Ltaief HPC Swiss 19 / 77

Page 24: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

QUARK

• From Sequential Nested-Loop Code to Parallel Execution

• Task-based parallelism

• Out-of-order dynamic scheduling

• Scheduling a Window of Tasks

• Data Locality and Cache Reuse

• High user-productivity

• Shipped within PLASMA but standalone project

H. Ltaief HPC Swiss 20 / 77

Page 25: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

QUARK

H. Ltaief HPC Swiss 21 / 77

Page 26: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

QUARK

int QUARK_core_dpotrf( Quark *quark, char uplo, int n,

double *A, int lda, int *info )

{

QUARK_Insert_Task( quark, TASK_core_dpotrf, 0x00,

sizeof(char), &uplo, VALUE,

sizeof(int), &n, VALUE,

sizeof(double)*n*n, A, INOUT | LOCALITY,

sizeof(int), &lda, VALUE,

sizeof(int), info, OUTPUT,

0);

}

void TASK_core_dpotrf(Quark *quark)

{

char uplo; int n; double *A; int lda; int *info;

quark_unpack_args_5( quark, uplo, n, A, lda, info );

dpotrf_( &uplo, &n, A, &lda, info );

}

H. Ltaief HPC Swiss 22 / 77

Page 27: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

StarPU

• RunTime which provides:=⇒ Task scheduling=⇒ Memory management

• Supports:=⇒ SMP/MulticoreProcessors (x86, PPC, . . . )=⇒ NVIDIA GPUs (e.g.heterogeneous multi-GPU)=⇒ OpenCL devices=⇒ Cell Processors(experimental)

H. Ltaief HPC Swiss 23 / 77

Page 28: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

StarPU

starpu_Insert_Task(&cl_dpotrf,

VALUE, &uplo, sizeof(char),

VALUE, &n, sizeof(int),

INOUT, Ahandle(k, k),

VALUE, &lda, sizeof(int),

OUTPUT, &info, sizeof(int)

CALLBACK, profiling?cl_dpotrf_callback:NULL, NULL,

0);

H. Ltaief HPC Swiss 24 / 77

Page 29: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

SMPSs

• Compiler technology.

• Task parameters and directionality defined by the userthrough pragmas

• Translates C codes with pragma annotations to standard C99code

• Embedded Locality optimizations

• Data renaming feature to reduce dependencies, leaving onlythe true dependencies.

H. Ltaief HPC Swiss 25 / 77

Page 30: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

SMPSs

#pragma css task input(A[NB][NB]) inout(T[NB][NB])void dsyrk(double *A, double *T);

#pragma css task inout(T[NB][NB])void dpotrf(double *T);

#pragma css task input(A[NB][NB], B[NB][NB]) inout(C[NB][NB])void dgemm(double *A, double *B, double *C);

#pragma css task input(T[NB][NB]) inout(B[NB][NB])void dtrsm(double *T, double *C);

#pragma css start for (k = 0; k < TILES; k++) {

for (n = 0; n < k; n++) dsyrk(A[k][n], A[k][k]); dpotrf(A[k][k]);

for (m = k+1; m < TILES; m++) { for (n = 0; n < k; n++) dgemm(A[k][n], A[m][n], A[m][k]); dtrsm(A[k][k], A[m][k]); }}#pragma css finish

H. Ltaief HPC Swiss 26 / 77

Page 31: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Standardization???

• Efforts to define an API standard for these runtime systems.

• Difficult task...

• But worth the time and sacrifice when it comes to making endusers life easier.

H. Ltaief HPC Swiss 27 / 77

Page 32: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

DAGuE

DAGuE: Directed Acyclic Graph Unified Environment=⇒ http://icl.cs.utk.edu/dague

• Compiler technology.

• Converting Sequential Code to a DAG representation.

• Parametrized DAG scheduler for distributed memory systems.

• Engine of DPLASMA library

H. Ltaief HPC Swiss 28 / 77

Page 33: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

DAGuE

H. Ltaief HPC Swiss 29 / 77

Page 34: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Outline

1 Motivations

2 Matrices Over Runtime Systems at Exascale

3 Cholesky-based Matrix Inversion and Generalized SymmetricEigenvalue Problem

4 N-Body Simulations

5 Seismic Applications

6 Conclusion

H. Ltaief HPC Swiss 30 / 77

Page 35: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

A−1, Seriously???

• YES!

• Critical component of the variance-covariance matrixcomputation in statistics (cf. Higham, Accuracy and Stabilityof Numerical Algorithms, Second Edition, SIAM, 2002)

• A is a dense symmetric matrix

• Three steps:1 Cholesky factorization (DPOTRF)2 Inverting the Cholesky factor (DTRTRI)3 Calculating the product of the inverted Cholesky factor with its

transpose (DLAUUM)

• StarPU runtime used here

H. Ltaief HPC Swiss 31 / 77

Page 36: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

A−1, Hybrid Architecture Targeted

• =⇒ PCI Interconnect 16X 64Gb/s, very thin pipe!

• =⇒ Fermi C2050 448 cuda cores 515 Gflop/s

H. Ltaief HPC Swiss 32 / 77

Page 37: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

How to choose NB (Tile size)? Auto-Tuning??? :)

H. Ltaief HPC Swiss 33 / 77

Page 38: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Auto-Tuning

H. Ltaief HPC Swiss 34 / 77

Page 39: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Auto-Tuning

* Slide courtesy of J. Kurzak, UTK

H. Ltaief HPC Swiss 35 / 77

Page 40: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

A−1, Preliminary Results

0.5 1 1.5 2 2.5

x 104

50

100

150

200

250

300

350

400

450

500

Matrix size

Gflo

p/s

Tile Hybrid CPU−GPUMAGMAPLASMALAPACK

H. Ibeid, D. Kaushik, D. Keyes and H. Ltaief, HIPC’11, IndiaH. Ltaief HPC Swiss 36 / 77

Page 41: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

GSEVP: What we solve?

Ax = λBx

A,B ∈ Rn×n, x ∈ Rn, λ ∈ R or A,B ∈ Cn×n, x ∈ Cn, λ ∈ R

A = AT or A = AH A is symmetric or HermitianxBxH > 0 B is symmetric positive definite

H. Ltaief HPC Swiss 37 / 77

Page 42: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

GSEVP: Why we solve it?

To obtain energy eigenstates in:

• Chemical cluster theory

• Electronic structure of semiconductors

• Ab-initio energy calculations of solids

H. Ltaief HPC Swiss 38 / 77

Page 43: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

GSEVP: How to solve it?

Ax = λBxOperation Explanation LAPACK routine name

1 B = L× LT Cholesky factorization POTRF

2 C = L−1 × A× L−T application of triangular factors SYGSTor HEGST

3 T = QT × C × Q tridiagonal reduction SYEVD or HEEVD

4 Tx = λx QR iteration STERF

H. Ltaief HPC Swiss 39 / 77

Page 44: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

All computational stages: separately

1:1

2:3

3:6

4:1

5:2

6:3

7:1

8:1

9:1

10:1

1:1

2:2

3:3

4:2

5:2

6:1

7:2

8:1

9:1

10:1

11:1

1:1

2:2

3:3

4:2

5:2

6:1

7:2

8:1

9:1

10:1

11:1

H. Ltaief HPC Swiss 40 / 77

Page 45: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

All computational stages: combined

• Dependencies are tracked inside PLASMA by QUARK.

1:1

2:4

3:9

4:4

5:11

6:8

7:6

8:5

9:7

10:4

11:4

12:2

13:2

14:3

15:3

16:1

17:2

18:1

19:1

20:1

21:1

22:1

23:1

24:1

H. Ltaief HPC Swiss 41 / 77

Page 46: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Combining stages: matrix view

H. Ltaief HPC Swiss 42 / 77

Page 47: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Results on 4-socket AMD Magny Cours (48 cores)

2000 4000 6000 8000 10000 12000 14000 16000

500

1000

1500

2000

2500

Matrix size

Tim

e in

sec

onds

PLASMA xSYGST + two−stage TRDMKL xSYGST + MKL SBRMKL xSYGST + MKL TRDMKL xSYGST + Netlib SBRLAPACK xSYGST + LAPACK TRD

H. Ltaief, P. Luszczek, A. Haidar, J. Dongarra – ParCo’11, BelgiumH. Ltaief HPC Swiss 43 / 77

Page 48: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Power Monitoring with PowerPack

H. Ltaief HPC Swiss 44 / 77

Page 49: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Power Rate of LAPACK TRD

0

50

100

150

200

250

300

0 50 100 150 200 250 300 350

Pow

er

(Watts)

Time (seconds)

SystemCPU

MemoryMotherboad

Fan

H. Ltaief, P. Luszczek, J. Dongarra – EnaHPC’11, GermanyH. Ltaief HPC Swiss 45 / 77

Page 50: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Power Rate of PLASMA TRD

0

50

100

150

200

250

300

0 50 100 150 200 250 300 350

Pow

er

(Watts)

Time (seconds)

SystemCPU

MemoryMotherboad

Fan

H. Ltaief, P. Luszczek, J. Dongarra – EnaHPC’11, GermanyH. Ltaief HPC Swiss 46 / 77

Page 51: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

QUARK and DVFS

H. Ltaief HPC Swiss 47 / 77

Page 52: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Outline

1 Motivations

2 Matrices Over Runtime Systems at Exascale

3 Cholesky-based Matrix Inversion and Generalized SymmetricEigenvalue Problem

4 N-Body Simulations

5 Seismic Applications

6 Conclusion

H. Ltaief HPC Swiss 48 / 77

Page 53: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Astrophysics and Turbulence Applications w/ R. Yokota

H. Ltaief HPC Swiss 49 / 77

Page 54: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Data Driven Fast Multipole Method

H. Ltaief HPC Swiss 50 / 77

Page 55: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Dual Tree Traversal

H. Ltaief HPC Swiss 51 / 77

Page 56: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Adjustable Granularity

H. Ltaief HPC Swiss 52 / 77

Page 57: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Strong Scaling w/ QUARK

1 2 4 8 161

2

4

8

16

Threads

Spee

dup

Q=100Q=1000Q=10000

H. Ltaief and R. Yokota – submitted to EuroPar’12

H. Ltaief HPC Swiss 53 / 77

Page 58: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Strong Scaling w/ QUARK

H. Ltaief and R. Yokota – submitted to EuroPar’12

H. Ltaief HPC Swiss 54 / 77

Page 59: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

What is next?

• Heterogeneous architecture w/ hardware accelerators(StarPU)

• Distributed memory systems (DAGuE)

• Implementation of the reduction operation

H. Ltaief HPC Swiss 55 / 77

Page 60: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Outline

1 Motivations

2 Matrices Over Runtime Systems at Exascale

3 Cholesky-based Matrix Inversion and Generalized SymmetricEigenvalue Problem

4 N-Body Simulations

5 Seismic Applications

6 Conclusion

H. Ltaief HPC Swiss 56 / 77

Page 61: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Seismic Acquisition

H. Ltaief HPC Swiss 57 / 77

Page 62: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Standard Stencil Time Integration

• Reverse Time Migration

• Acoustic Wave Equation with constant density

• Numerical solution: Finite Difference

• Isotropic Case: Axis directions (X,Y,Z)

• Tilted Transverse Isotropic (TTI) Case: Axis directions(X,Y,Z) & Cross Derivatives (XY, YZ, XZ)

H. Ltaief HPC Swiss 58 / 77

Page 63: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Standard Stencil Time Integration

• Isotropic stencil

• TTI stencil

* Slide courtesy of S. Feki, TOTAL HoustonH. Ltaief HPC Swiss 59 / 77

Page 64: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Standard Stencil Time Integration

• Explicit time integration scheme with domain decomposition• Order one in space and time

H. Ltaief HPC Swiss 60 / 77

Page 65: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Standard Stencil Time Integration

• Explicit time integration scheme with domain decomposition• Order one in space and time

H. Ltaief HPC Swiss 61 / 77

Page 66: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Standard Stencil Time Integration

• Explicit time integration scheme with domain decomposition• Order one in space and time

H. Ltaief HPC Swiss 62 / 77

Page 67: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Standard Stencil Time Integration

• Explicit time integration scheme with domain decomposition• Order one in space and time

H. Ltaief HPC Swiss 63 / 77

Page 68: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Standard Stencil Time Integration

• Explicit time integration scheme with domain decomposition• Order one in space and time

H. Ltaief HPC Swiss 64 / 77

Page 69: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Standard Stencil Time Integration

• Explicit time integration scheme with domain decomposition• Order one in space and time

H. Ltaief HPC Swiss 65 / 77

Page 70: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

New Stencil Time Integration

• Communication-avoiding by reducing halo exchangesfrequency

• Synchronization-reducing by instruction reordering

H. Ltaief HPC Swiss 66 / 77

Page 71: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

New Stencil Time Integration

• Communication-avoiding by reducing halo exchangesfrequency

• Synchronization-reducing by instruction reordering

H. Ltaief HPC Swiss 67 / 77

Page 72: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

New Stencil Time Integration

• Communication-avoiding by reducing halo exchangesfrequency

• Synchronization-reducing by instruction reordering

H. Ltaief HPC Swiss 68 / 77

Page 73: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

New Stencil Time Integration

• Communication-avoiding by reducing halo exchangesfrequency

• Synchronization-reducing by instruction reordering

H. Ltaief HPC Swiss 69 / 77

Page 74: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

New Stencil Time Integration

• Communication-avoiding by reducing halo exchangesfrequency

• Synchronization-reducing by instruction reordering

H. Ltaief HPC Swiss 70 / 77

Page 75: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

New Stencil Time Integration

• Communication-avoiding by reducing halo exchangesfrequency

• Synchronization-reducing by instruction reordering

H. Ltaief HPC Swiss 71 / 77

Page 76: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

New Stencil Time Integration

• Communication-avoiding by reducing halo exchangesfrequency

• Synchronization-reducing by instruction reordering

H. Ltaief HPC Swiss 72 / 77

Page 77: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Stencil kernels

• Seismic kernels are typically order 2 in time and order 8 inspace

• Two phases: calculate solutions inside a cone and thenoutside the cone

• The cone kernel becomes the fine-grain task to dynamicallyschedule

• No numerical instabilities added to the original scheme

• May have load imbalance due to PML high compute intensity

• Similar (to some extend) to Demmel’s approach on the Matrixpower kernel.

• Work in progress with A. Abdelfattah, PhD Student andSaudi Aramco

H. Ltaief HPC Swiss 73 / 77

Page 78: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Outline

1 Motivations

2 Matrices Over Runtime Systems at Exascale

3 Cholesky-based Matrix Inversion and Generalized SymmetricEigenvalue Problem

4 N-Body Simulations

5 Seismic Applications

6 Conclusion

H. Ltaief HPC Swiss 74 / 77

Page 79: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Conclusion

• Dataflow programming models could play a major role forexascale challenges

• Need efficient runtime with high productivity in mind

• Need new flexible/asynchronous algorithms

• Work potentially for dense as well as sparse computations

• Need to determine an appropriate granularity to hidescheduling overhead

H. Ltaief HPC Swiss 75 / 77

Page 80: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Just in case you are wondering what’s beyond ExaFlops...

H. Ltaief HPC Swiss 76 / 77

Page 81: Recipe for High Performance Computing Software and

Credits Motivations MORSE T1 T2 T3 Conclusion

Thank you!

H. Ltaief HPC Swiss 77 / 77