recipe for high performance computing software and
TRANSCRIPT
![Page 1: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/1.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Recipe for High Performance ComputingSoftware and Application at Exascale
Hatem LtaiefKAUST Supercomputing Lab
HPC Advisory Council Switzerland Conference
March 13-15, 2012Lugano, Switzerland
H. Ltaief HPC Swiss 1 / 77
![Page 2: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/2.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Credits
H. Ltaief HPC Swiss 2 / 77
![Page 3: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/3.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Credits
H. Ltaief HPC Swiss 2 / 77
![Page 4: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/4.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Outline
1 Motivations
2 Matrices Over Runtime Systems at Exascale
3 Cholesky-based Matrix Inversion and Generalized SymmetricEigenvalue Problem
4 N-Body Simulations
5 Seismic Applications
6 Conclusion
H. Ltaief HPC Swiss 3 / 77
![Page 5: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/5.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Outline
1 Motivations
2 Matrices Over Runtime Systems at Exascale
3 Cholesky-based Matrix Inversion and Generalized SymmetricEigenvalue Problem
4 N-Body Simulations
5 Seismic Applications
6 Conclusion
H. Ltaief HPC Swiss 4 / 77
![Page 6: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/6.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
The ”K” Computer
H. Ltaief HPC Swiss 5 / 77
![Page 7: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/7.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
HPC Recipe For Exascale Computing
H. Ltaief HPC Swiss 6 / 77
![Page 8: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/8.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
HPC Recipe For Exascale Computing
H. Ltaief HPC Swiss 7 / 77
![Page 9: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/9.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
HPC Recipe For Exascale Computing
Fine-grain
parallelism Data
Motio
n
Reduci
ng
Synch
roniza
tion
Reduci
ng
Auto-Tuning
Power
efficiency
Dynamic runtime systems
Hardware and SoftwareCo-design
1018
H. Ltaief HPC Swiss 8 / 77
![Page 10: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/10.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
A Look Back...
Software infrastructure and algorithmic design follow hardwareevolution in time:
• 70’s - LINPACK, vector operations:Level-1 BLAS operation
• 80’s - LAPACK, block,cache-friendly:Level-3 BLAS operation
• 90’s - ScaLAPACK, distributedmemory:PBLAS Message passing
• 00’s:• PLASMA, MAGMA, many
x86/cuda cores friendly:DAG scheduler, tile data layout, someextra kernels
H. Ltaief HPC Swiss 9 / 77
![Page 11: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/11.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
A Look Back...
Software infrastructure and algorithmic design follow hardwareevolution in time:
• 70’s - LINPACK, vector operations:Level-1 BLAS operation
• 80’s - LAPACK, block,cache-friendly:Level-3 BLAS operation
• 90’s - ScaLAPACK, distributedmemory:PBLAS Message passing
• 00’s:• PLASMA, MAGMA, many
x86/cuda cores friendly:DAG scheduler, tile data layout, someextra kernels
H. Ltaief HPC Swiss 9 / 77
![Page 12: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/12.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
A Look Back...
Software infrastructure and algorithmic design follow hardwareevolution in time:
• 70’s - LINPACK, vector operations:Level-1 BLAS operation
• 80’s - LAPACK, block,cache-friendly:Level-3 BLAS operation
• 90’s - ScaLAPACK, distributedmemory:PBLAS Message passing
• 00’s:• PLASMA, MAGMA, many
x86/cuda cores friendly:DAG scheduler, tile data layout, someextra kernels
H. Ltaief HPC Swiss 9 / 77
![Page 13: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/13.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
A Look Back...
Software infrastructure and algorithmic design follow hardwareevolution in time:
• 70’s - LINPACK, vector operations:Level-1 BLAS operation
• 80’s - LAPACK, block,cache-friendly:Level-3 BLAS operation
• 90’s - ScaLAPACK, distributedmemory:PBLAS Message passing
• 00’s:• PLASMA, MAGMA, many
x86/cuda cores friendly:DAG scheduler, tile data layout, someextra kernels
H. Ltaief HPC Swiss 9 / 77
![Page 14: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/14.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Block Algorithms
• Panel-Update Sequence
• Transformations are blocked/accumulated within the Panel(Level 2 BLAS)
• Transformations applied at once on the trailing submatrix(Level 3 BLAS)
• Parallelism hidden inside the BLAS
• Fork-join Model
H. Ltaief HPC Swiss 10 / 77
![Page 15: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/15.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
One-Sided Block Algorithms: LU
H. Ltaief HPC Swiss 11 / 77
![Page 16: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/16.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Block Algorithms: Fork-Join Paradigm
H. Ltaief HPC Swiss 12 / 77
![Page 17: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/17.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
PLASMA: Tile Algorithms
PLASMA: Parallel Linear Algebra for Scalable Multi-coreArchitectures=⇒ http://icl.cs.utk.edu/plasma
• Parallelism is brought to the fore
• May require the redesign of linear algebra algorithms
• Remove unnecessary synchronization points betweenPanel-Update sequences
• DAG execution where nodes represent tasks and edges definedependencies between them
• Dynamic runtime system environment QUARK
• Tile data layout translation
H. Ltaief HPC Swiss 13 / 77
![Page 18: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/18.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Tile Data Layout Format
LAPACK: column-major format PLASMA: tile format
H. Ltaief HPC Swiss 14 / 77
![Page 19: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/19.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
MAGMA: x86 + MultiGPUs
MAGMA: Matrix Algebra on GPU and Multicore Architectures=⇒ http://icl.cs.utk.edu/magma
• Lessons Learned from PLASMA!
• CUDA-based hybrid systems
• New high performance numerical kernels
• StarPU Runtime System (INRIA, Bordeaux)
• Both: x86 and GPUs =⇒ Hybrid Computations
• Similar to LAPACK in functionality
H. Ltaief HPC Swiss 15 / 77
![Page 20: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/20.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Dynamic Runtime System
• Conceptually similar to out-of-order processor scheduling
because it has:
• Dynamic runtime DAG scheduler
• Out-of-order execution flow of fine-grained tasks
• Task scheduling as soon as dependencies are satisfied
• Producer-Consumer
H. Ltaief HPC Swiss 16 / 77
![Page 21: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/21.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
DataFlow Programming
• Five decades OLD concept• Programming paradigm that models a program as a directed
graph of the data flowing between operations (cf. Wikipedia)• Think ”how things connect” rather than ”how things happen”• Assembly line• Inherently parallel
H. Ltaief HPC Swiss 17 / 77
![Page 22: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/22.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Outline
1 Motivations
2 Matrices Over Runtime Systems at Exascale
3 Cholesky-based Matrix Inversion and Generalized SymmetricEigenvalue Problem
4 N-Body Simulations
5 Seismic Applications
6 Conclusion
H. Ltaief HPC Swiss 18 / 77
![Page 23: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/23.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Matrices Over Runtime Systems at Exascale
• Mission statement: ”Design dense and sparse linear algebramethods that achieve the fastest possible time to an accuratesolution on large-scale Hybrid systems”.
• Runtime challenges due to the ever growing hardwarecomplexity.
• Algorithmic challenges to exploit the hardware capabilities atmost.
H. Ltaief HPC Swiss 19 / 77
![Page 24: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/24.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
QUARK
• From Sequential Nested-Loop Code to Parallel Execution
• Task-based parallelism
• Out-of-order dynamic scheduling
• Scheduling a Window of Tasks
• Data Locality and Cache Reuse
• High user-productivity
• Shipped within PLASMA but standalone project
H. Ltaief HPC Swiss 20 / 77
![Page 25: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/25.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
QUARK
H. Ltaief HPC Swiss 21 / 77
![Page 26: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/26.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
QUARK
int QUARK_core_dpotrf( Quark *quark, char uplo, int n,
double *A, int lda, int *info )
{
QUARK_Insert_Task( quark, TASK_core_dpotrf, 0x00,
sizeof(char), &uplo, VALUE,
sizeof(int), &n, VALUE,
sizeof(double)*n*n, A, INOUT | LOCALITY,
sizeof(int), &lda, VALUE,
sizeof(int), info, OUTPUT,
0);
}
void TASK_core_dpotrf(Quark *quark)
{
char uplo; int n; double *A; int lda; int *info;
quark_unpack_args_5( quark, uplo, n, A, lda, info );
dpotrf_( &uplo, &n, A, &lda, info );
}
H. Ltaief HPC Swiss 22 / 77
![Page 27: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/27.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
StarPU
• RunTime which provides:=⇒ Task scheduling=⇒ Memory management
• Supports:=⇒ SMP/MulticoreProcessors (x86, PPC, . . . )=⇒ NVIDIA GPUs (e.g.heterogeneous multi-GPU)=⇒ OpenCL devices=⇒ Cell Processors(experimental)
H. Ltaief HPC Swiss 23 / 77
![Page 28: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/28.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
StarPU
starpu_Insert_Task(&cl_dpotrf,
VALUE, &uplo, sizeof(char),
VALUE, &n, sizeof(int),
INOUT, Ahandle(k, k),
VALUE, &lda, sizeof(int),
OUTPUT, &info, sizeof(int)
CALLBACK, profiling?cl_dpotrf_callback:NULL, NULL,
0);
H. Ltaief HPC Swiss 24 / 77
![Page 29: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/29.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
SMPSs
• Compiler technology.
• Task parameters and directionality defined by the userthrough pragmas
• Translates C codes with pragma annotations to standard C99code
• Embedded Locality optimizations
• Data renaming feature to reduce dependencies, leaving onlythe true dependencies.
H. Ltaief HPC Swiss 25 / 77
![Page 30: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/30.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
SMPSs
#pragma css task input(A[NB][NB]) inout(T[NB][NB])void dsyrk(double *A, double *T);
#pragma css task inout(T[NB][NB])void dpotrf(double *T);
#pragma css task input(A[NB][NB], B[NB][NB]) inout(C[NB][NB])void dgemm(double *A, double *B, double *C);
#pragma css task input(T[NB][NB]) inout(B[NB][NB])void dtrsm(double *T, double *C);
#pragma css start for (k = 0; k < TILES; k++) {
for (n = 0; n < k; n++) dsyrk(A[k][n], A[k][k]); dpotrf(A[k][k]);
for (m = k+1; m < TILES; m++) { for (n = 0; n < k; n++) dgemm(A[k][n], A[m][n], A[m][k]); dtrsm(A[k][k], A[m][k]); }}#pragma css finish
H. Ltaief HPC Swiss 26 / 77
![Page 31: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/31.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Standardization???
• Efforts to define an API standard for these runtime systems.
• Difficult task...
• But worth the time and sacrifice when it comes to making endusers life easier.
H. Ltaief HPC Swiss 27 / 77
![Page 32: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/32.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
DAGuE
DAGuE: Directed Acyclic Graph Unified Environment=⇒ http://icl.cs.utk.edu/dague
• Compiler technology.
• Converting Sequential Code to a DAG representation.
• Parametrized DAG scheduler for distributed memory systems.
• Engine of DPLASMA library
H. Ltaief HPC Swiss 28 / 77
![Page 33: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/33.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
DAGuE
H. Ltaief HPC Swiss 29 / 77
![Page 34: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/34.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Outline
1 Motivations
2 Matrices Over Runtime Systems at Exascale
3 Cholesky-based Matrix Inversion and Generalized SymmetricEigenvalue Problem
4 N-Body Simulations
5 Seismic Applications
6 Conclusion
H. Ltaief HPC Swiss 30 / 77
![Page 35: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/35.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
A−1, Seriously???
• YES!
• Critical component of the variance-covariance matrixcomputation in statistics (cf. Higham, Accuracy and Stabilityof Numerical Algorithms, Second Edition, SIAM, 2002)
• A is a dense symmetric matrix
• Three steps:1 Cholesky factorization (DPOTRF)2 Inverting the Cholesky factor (DTRTRI)3 Calculating the product of the inverted Cholesky factor with its
transpose (DLAUUM)
• StarPU runtime used here
H. Ltaief HPC Swiss 31 / 77
![Page 36: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/36.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
A−1, Hybrid Architecture Targeted
• =⇒ PCI Interconnect 16X 64Gb/s, very thin pipe!
• =⇒ Fermi C2050 448 cuda cores 515 Gflop/s
H. Ltaief HPC Swiss 32 / 77
![Page 37: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/37.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
How to choose NB (Tile size)? Auto-Tuning??? :)
H. Ltaief HPC Swiss 33 / 77
![Page 38: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/38.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Auto-Tuning
H. Ltaief HPC Swiss 34 / 77
![Page 39: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/39.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Auto-Tuning
* Slide courtesy of J. Kurzak, UTK
H. Ltaief HPC Swiss 35 / 77
![Page 40: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/40.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
A−1, Preliminary Results
0.5 1 1.5 2 2.5
x 104
50
100
150
200
250
300
350
400
450
500
Matrix size
Gflo
p/s
Tile Hybrid CPU−GPUMAGMAPLASMALAPACK
H. Ibeid, D. Kaushik, D. Keyes and H. Ltaief, HIPC’11, IndiaH. Ltaief HPC Swiss 36 / 77
![Page 41: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/41.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
GSEVP: What we solve?
Ax = λBx
A,B ∈ Rn×n, x ∈ Rn, λ ∈ R or A,B ∈ Cn×n, x ∈ Cn, λ ∈ R
A = AT or A = AH A is symmetric or HermitianxBxH > 0 B is symmetric positive definite
H. Ltaief HPC Swiss 37 / 77
![Page 42: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/42.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
GSEVP: Why we solve it?
To obtain energy eigenstates in:
• Chemical cluster theory
• Electronic structure of semiconductors
• Ab-initio energy calculations of solids
H. Ltaief HPC Swiss 38 / 77
![Page 43: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/43.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
GSEVP: How to solve it?
Ax = λBxOperation Explanation LAPACK routine name
1 B = L× LT Cholesky factorization POTRF
2 C = L−1 × A× L−T application of triangular factors SYGSTor HEGST
3 T = QT × C × Q tridiagonal reduction SYEVD or HEEVD
4 Tx = λx QR iteration STERF
H. Ltaief HPC Swiss 39 / 77
![Page 44: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/44.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
All computational stages: separately
1:1
2:3
3:6
4:1
5:2
6:3
7:1
8:1
9:1
10:1
1:1
2:2
3:3
4:2
5:2
6:1
7:2
8:1
9:1
10:1
11:1
1:1
2:2
3:3
4:2
5:2
6:1
7:2
8:1
9:1
10:1
11:1
H. Ltaief HPC Swiss 40 / 77
![Page 45: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/45.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
All computational stages: combined
• Dependencies are tracked inside PLASMA by QUARK.
1:1
2:4
3:9
4:4
5:11
6:8
7:6
8:5
9:7
10:4
11:4
12:2
13:2
14:3
15:3
16:1
17:2
18:1
19:1
20:1
21:1
22:1
23:1
24:1
H. Ltaief HPC Swiss 41 / 77
![Page 46: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/46.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Combining stages: matrix view
H. Ltaief HPC Swiss 42 / 77
![Page 47: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/47.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Results on 4-socket AMD Magny Cours (48 cores)
2000 4000 6000 8000 10000 12000 14000 16000
500
1000
1500
2000
2500
Matrix size
Tim
e in
sec
onds
PLASMA xSYGST + two−stage TRDMKL xSYGST + MKL SBRMKL xSYGST + MKL TRDMKL xSYGST + Netlib SBRLAPACK xSYGST + LAPACK TRD
H. Ltaief, P. Luszczek, A. Haidar, J. Dongarra – ParCo’11, BelgiumH. Ltaief HPC Swiss 43 / 77
![Page 48: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/48.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Power Monitoring with PowerPack
H. Ltaief HPC Swiss 44 / 77
![Page 49: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/49.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Power Rate of LAPACK TRD
0
50
100
150
200
250
300
0 50 100 150 200 250 300 350
Pow
er
(Watts)
Time (seconds)
SystemCPU
MemoryMotherboad
Fan
H. Ltaief, P. Luszczek, J. Dongarra – EnaHPC’11, GermanyH. Ltaief HPC Swiss 45 / 77
![Page 50: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/50.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Power Rate of PLASMA TRD
0
50
100
150
200
250
300
0 50 100 150 200 250 300 350
Pow
er
(Watts)
Time (seconds)
SystemCPU
MemoryMotherboad
Fan
H. Ltaief, P. Luszczek, J. Dongarra – EnaHPC’11, GermanyH. Ltaief HPC Swiss 46 / 77
![Page 51: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/51.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
QUARK and DVFS
H. Ltaief HPC Swiss 47 / 77
![Page 52: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/52.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Outline
1 Motivations
2 Matrices Over Runtime Systems at Exascale
3 Cholesky-based Matrix Inversion and Generalized SymmetricEigenvalue Problem
4 N-Body Simulations
5 Seismic Applications
6 Conclusion
H. Ltaief HPC Swiss 48 / 77
![Page 53: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/53.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Astrophysics and Turbulence Applications w/ R. Yokota
H. Ltaief HPC Swiss 49 / 77
![Page 54: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/54.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Data Driven Fast Multipole Method
H. Ltaief HPC Swiss 50 / 77
![Page 55: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/55.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Dual Tree Traversal
H. Ltaief HPC Swiss 51 / 77
![Page 56: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/56.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Adjustable Granularity
H. Ltaief HPC Swiss 52 / 77
![Page 57: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/57.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Strong Scaling w/ QUARK
1 2 4 8 161
2
4
8
16
Threads
Spee
dup
Q=100Q=1000Q=10000
H. Ltaief and R. Yokota – submitted to EuroPar’12
H. Ltaief HPC Swiss 53 / 77
![Page 58: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/58.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Strong Scaling w/ QUARK
H. Ltaief and R. Yokota – submitted to EuroPar’12
H. Ltaief HPC Swiss 54 / 77
![Page 59: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/59.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
What is next?
• Heterogeneous architecture w/ hardware accelerators(StarPU)
• Distributed memory systems (DAGuE)
• Implementation of the reduction operation
H. Ltaief HPC Swiss 55 / 77
![Page 60: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/60.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Outline
1 Motivations
2 Matrices Over Runtime Systems at Exascale
3 Cholesky-based Matrix Inversion and Generalized SymmetricEigenvalue Problem
4 N-Body Simulations
5 Seismic Applications
6 Conclusion
H. Ltaief HPC Swiss 56 / 77
![Page 61: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/61.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Seismic Acquisition
H. Ltaief HPC Swiss 57 / 77
![Page 62: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/62.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Standard Stencil Time Integration
• Reverse Time Migration
• Acoustic Wave Equation with constant density
• Numerical solution: Finite Difference
• Isotropic Case: Axis directions (X,Y,Z)
• Tilted Transverse Isotropic (TTI) Case: Axis directions(X,Y,Z) & Cross Derivatives (XY, YZ, XZ)
H. Ltaief HPC Swiss 58 / 77
![Page 63: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/63.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Standard Stencil Time Integration
• Isotropic stencil
• TTI stencil
* Slide courtesy of S. Feki, TOTAL HoustonH. Ltaief HPC Swiss 59 / 77
![Page 64: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/64.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Standard Stencil Time Integration
• Explicit time integration scheme with domain decomposition• Order one in space and time
H. Ltaief HPC Swiss 60 / 77
![Page 65: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/65.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Standard Stencil Time Integration
• Explicit time integration scheme with domain decomposition• Order one in space and time
H. Ltaief HPC Swiss 61 / 77
![Page 66: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/66.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Standard Stencil Time Integration
• Explicit time integration scheme with domain decomposition• Order one in space and time
H. Ltaief HPC Swiss 62 / 77
![Page 67: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/67.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Standard Stencil Time Integration
• Explicit time integration scheme with domain decomposition• Order one in space and time
H. Ltaief HPC Swiss 63 / 77
![Page 68: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/68.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Standard Stencil Time Integration
• Explicit time integration scheme with domain decomposition• Order one in space and time
H. Ltaief HPC Swiss 64 / 77
![Page 69: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/69.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Standard Stencil Time Integration
• Explicit time integration scheme with domain decomposition• Order one in space and time
H. Ltaief HPC Swiss 65 / 77
![Page 70: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/70.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
New Stencil Time Integration
• Communication-avoiding by reducing halo exchangesfrequency
• Synchronization-reducing by instruction reordering
H. Ltaief HPC Swiss 66 / 77
![Page 71: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/71.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
New Stencil Time Integration
• Communication-avoiding by reducing halo exchangesfrequency
• Synchronization-reducing by instruction reordering
H. Ltaief HPC Swiss 67 / 77
![Page 72: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/72.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
New Stencil Time Integration
• Communication-avoiding by reducing halo exchangesfrequency
• Synchronization-reducing by instruction reordering
H. Ltaief HPC Swiss 68 / 77
![Page 73: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/73.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
New Stencil Time Integration
• Communication-avoiding by reducing halo exchangesfrequency
• Synchronization-reducing by instruction reordering
H. Ltaief HPC Swiss 69 / 77
![Page 74: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/74.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
New Stencil Time Integration
• Communication-avoiding by reducing halo exchangesfrequency
• Synchronization-reducing by instruction reordering
H. Ltaief HPC Swiss 70 / 77
![Page 75: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/75.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
New Stencil Time Integration
• Communication-avoiding by reducing halo exchangesfrequency
• Synchronization-reducing by instruction reordering
H. Ltaief HPC Swiss 71 / 77
![Page 76: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/76.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
New Stencil Time Integration
• Communication-avoiding by reducing halo exchangesfrequency
• Synchronization-reducing by instruction reordering
H. Ltaief HPC Swiss 72 / 77
![Page 77: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/77.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Stencil kernels
• Seismic kernels are typically order 2 in time and order 8 inspace
• Two phases: calculate solutions inside a cone and thenoutside the cone
• The cone kernel becomes the fine-grain task to dynamicallyschedule
• No numerical instabilities added to the original scheme
• May have load imbalance due to PML high compute intensity
• Similar (to some extend) to Demmel’s approach on the Matrixpower kernel.
• Work in progress with A. Abdelfattah, PhD Student andSaudi Aramco
H. Ltaief HPC Swiss 73 / 77
![Page 78: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/78.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Outline
1 Motivations
2 Matrices Over Runtime Systems at Exascale
3 Cholesky-based Matrix Inversion and Generalized SymmetricEigenvalue Problem
4 N-Body Simulations
5 Seismic Applications
6 Conclusion
H. Ltaief HPC Swiss 74 / 77
![Page 79: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/79.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Conclusion
• Dataflow programming models could play a major role forexascale challenges
• Need efficient runtime with high productivity in mind
• Need new flexible/asynchronous algorithms
• Work potentially for dense as well as sparse computations
• Need to determine an appropriate granularity to hidescheduling overhead
H. Ltaief HPC Swiss 75 / 77
![Page 80: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/80.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Just in case you are wondering what’s beyond ExaFlops...
H. Ltaief HPC Swiss 76 / 77
![Page 81: Recipe for High Performance Computing Software and](https://reader030.vdocuments.us/reader030/viewer/2022020916/61a2fc123404c75c2c2d9581/html5/thumbnails/81.jpg)
Credits Motivations MORSE T1 T2 T3 Conclusion
Thank you!
H. Ltaief HPC Swiss 77 / 77