task-based sparse linear solvers for modern … simple/double precision + real/complex arithmetic...
TRANSCRIPT
September 17, 2014
Task-based sparse linear solvers for modernarchitecturesP. Ramet (and HiePACS team)
4th HOSCAR workshopGramado, Brazil
P. RametHiePACS teamInria Bordeaux Sud-OuestLaBRI Bordeaux University
Guideline
Introduction
Sparse direct factorization - PaStiX
Sparse solver on heterogenous architectures
Hybrid methods - MaPHYS and HIPS
Low-rank compression - H-PaStiX
Conclusion
P. Ramet - HOSCAR workshop September 17, 2014- 2
Introduction
Guideline
Introduction
Sparse direct factorization - PaStiX
Sparse solver on heterogenous architectures
Hybrid methods - MaPHYS and HIPS
Low-rank compression - H-PaStiX
Conclusion
P. Ramet - HOSCAR workshop September 17, 2014- 3
1Introduction
Introduction
Mixed/Hybrid direct-iterative methods
The ”spectrum” of linear algebra solvers
I Robust/accurate for generalproblems
I BLAS-3 based implementationI Memory/CPU prohibitive for
large 3D problemsI Limited parallel scalability
I Problem dependent efficiency/controlledaccuracy
I Only mat-vec required, fine graincomputation
I Less memory consumption, possibletrade-off with CPU
I Attractive ”build-in” parallel features
P. Ramet - HOSCAR workshop September 17, 2014- 5
Direct
Guideline
Introduction
Sparse direct factorization - PaStiX
Sparse solver on heterogenous architectures
Hybrid methods - MaPHYS and HIPS
Low-rank compression - H-PaStiX
Conclusion
P. Ramet - HOSCAR workshop September 17, 2014- 6
2Sparse direct factorization -PaStiX
Direct
Major steps for solving sparse linear systems
1. Analysis: matrix is preprocessed to improve its structuralproperties (A′x ′ = b′ with A′ = PnPDr ADcQPT )
2. Factorization: matrix is factorized as A = LU, LLT or LDLT
3. Solve: the solution x is computed by means of forward andbackward substitutions
P. Ramet - HOSCAR workshop September 17, 2014- 8
Direct
Direct Method and Nested Dissection
P. Ramet - HOSCAR workshop September 17, 2014- 9
Direct
Supernodal methods
DefinitionA supernode (or supervariable) is a set of contiguous columns inthe factors L that share essentially the same sparsity structure.
I All algorithms (ordering, symbolic factor., factor., solve)generalized to block versions.
I Use of efficient matrix-matrix kernels (improve cache usage).I Same concept as supervariables for elimination tree/minimum
degree ordering.I Supernodes and pivoting: pivoting inside a supernode does
not increase fill-in.
P. Ramet - HOSCAR workshop September 17, 2014- 10
Direct
PaStiX main Features
I LLt, LDLt, LU : supernodal implementation (BLAS3)I Static pivoting + Refinement: CG/GMRES/BiCGstabI Column-block or Block mappingI Simple/Double precision + Real/Complex arithmeticI MPI/Threads (Cluster/Multicore/SMP/NUMA)I Multiple GPUs using DAG runtimesI Support external ordering library (PT-Scotch or METIS ...)I Multiple RHS (direct factorization)I Incomplete factorization with ILU(k) preconditionnerI Schur complement computationI C++/Fortran/Python/PETSc/Trilinos/FreeFem...
P. Ramet - HOSCAR workshop September 17, 2014- 11
Direct
Current works
I Astrid Casadei (PhD student) : memory optimization to builda Schur complement in PaStiX, tight coupling betweensparse direct and iterative solvers (HIPS) + graphpartitioning with balanced halo
I Xavier Lacoste (PhD student) and the MUMPS team : GPUoptimizations for sparse factorizations with StarPU
I Stojce Nakov (PhD student) : tight coupling between sparsedirect and iterative solvers (MaPHYS) + GPUoptimizations for GMRES with StarPU
I Mathieu Faverge (Assistant Professor) : redesigned PaStiXstatic/dynamic scheduling with PaRSEC in order to get ageneric framework for multicore/MPI/GPU/Out-of-Core +low-rank compression
P. Ramet - HOSCAR workshop September 17, 2014- 12
Direct
Direct Solver Highlights
Fusion - ITERPaStiX is used in the JOREK code developped by G. Huysmans atCEA/Cadarache in a fully implicit time evolution scheme for thenumerical simulations of the ELM (Edge Localized Mode)instabilities commonly observed in the standard tokamak operatingscenario.
MHD pellet injection simulated with the JOREK code
P. Ramet - HOSCAR workshop September 17, 2014- 13
Direct
Tasks algorithms
POTRFTRSMSYRKGEMM
(a) Task for a dense tile (b) Task for a sparse supernode
P. Ramet - HOSCAR workshop September 17, 2014- 14
Direct
DAG representation
POTRF
TRSM
T=>T
TRSM
T=>T
TRSM
T=>T TRSM
T=>T
GEMM
C=>B
GEMM
C=>B
GEMM
C=>B
SYRK
C=>A C=>A
SYRK
C=>A
GEMM
C=>B
GEMM
C=>B
GEMM
C=>C
SYRK
T=>T
C=>A C=>A
SYRK
C=>A
GEMM
C=>B
GEMM
C=>C
GEMM
C=>C
SYRK
T=>T
C=>A C=>AC=>A SYRK
C=>A
TRSM
C=>C
TRSM
C=>C
TRSM
C=>C
POTRF
T=>T
T=>T T=>TT=>T
C=>B
SYRK
C=>A C=>B C=>A C=>AC=>B
T=>T
GEMM
C=>C
SYRK
T=>T
SYRK
T=>T
C=>A C=>A C=>A
TRSM
C=>C POTRF
T=>T
T=>T
TRSM
T=>T
C=>C
C=>A C=>B
SYRK
T=>T
C=>AC=>A
POTRF
T=>T
TRSM
C=>C
T=>T
C=>A
POTRF
T=>T
(c) Dense DAG
panel(7)
gemm(19)
A=>A
gemm(20)
A=>A
gemm(21)
A=>Agemm(22)
A=>A
gemm(23)
A=>A
gemm(24)
A=>A
gemm(25)
A=>A
C=>C
C=>C
panel(8)
C=>A
gemm(27)
C=>C
C=>C
gemm(29)
C=>C
gemm(31)
C=>C A=>A
gemm(28)
A=>A
A=>A
gemm(30)
A=>A
A=>A
C=>C
panel(9)
C=>A
C=>C
gemm(33)
C=>C
gemm(36)
C=>C
panel(0)
gemm(1)
A=>A
gemm(2)
A=>Agemm(3)
A=>A
gemm(4)
A=>A
C=>C
panel(1)
C=>A
C=>C
gemm(6)
C=>C
A=>A
gemm(8)
C=>C
gemm(10)
C=>C
panel(5)
gemm(14)
A=>A
gemm(15)
A=>A
panel(6)
C=>A
gemm(17)
C=>C A=>A
C=>C
panel(2)
A=>A
gemm(12)
C=>C
panel(3)
A=>A
C=>C
panel(4)
A=>A
gemm(37)
C=>C
A=>A A=>A
gemm(34)
A=>A
gemm(35)
A=>A
A=>A
gemm(38)
A=>A C=>C
C=>C
panel(10)
C=>A
C=>C
gemm(40)
C=>C
panel(11)
C=>A
A=>A
(d) Sparse DAG
P. Ramet - HOSCAR workshop September 17, 2014- 15
Direct
Direct Solver Highlights (cluster of multicore)
RC3 matrix - complex double precisionN=730700 - NNA=41600758 - Fill-in=50
Facto 1 MPI 2 MPI 4 MPI 8 MPI1 thread 6820 3520 1900 18906 threads 1020 639 337 28712 threads 525 360 155 121Mem Gb 1 MPI 2 MPI 4 MPI 8 MPI1 thread 34 19,2 12,5 9,226 threads 34,3 19,5 12,8 9,6612 threads 34,6 19,7 13 9,14Solve 1 MPI 2 MPI 4 MPI 8 MPI1 thread 6,97 3,75 1,93 1,036 threads 2,5 1,43 0,78 0,5412 threads 1,33 0,93 0,66 0,59
P. Ramet - HOSCAR workshop September 17, 2014- 16
Direct
NUMA-Aware Allocation (upto 20% efficiency)
P. Ramet - HOSCAR workshop September 17, 2014- 17
Direct
Direct Solver Highlights (multicore)SGI NUMA 160-cores
Name N NNZA Fill ratio FactAudi 9.44×105 3.93×107 31.28 float LLT
10M 1.04×107 8.91×107 75.66 complex LDLT
10M 10 20 40 80 160Facto (s) 3020 1750 654 356 260Mem (Gb) 122 124 127 133 146Solve (s) 24.6 13.5 3.87 2.90 2.89
Audi 128 2x64 4x32 8x16Facto (s) 17.8 18.6 13.8 13.4Mem (Gb) 13.4 2x7.68 4x4.54 8x2.69Solve (s) 0.40 0.32 0.21 0.14
P. Ramet - HOSCAR workshop September 17, 2014- 18
Direct
Communication schemes (upto 10% efficiency)
P. Ramet - HOSCAR workshop September 17, 2014- 19
Direct
Dynamic scheduling (upto 15% efficiency)
PropertiesN 10 423 737NNZA 89 072 871NNZL 6 724 303 039OPC 4.41834e+13
4x32 8x32Static Scheduler 289 195Dynamic Scheduler 240 184
Table: Factorization time in seconds
I Electromagnetism problem in double complex from CEAI Cluster Vargas from IDRIS with 32 power6 per node
P. Ramet - HOSCAR workshop September 17, 2014- 20
Direct
Static Scheduling Gantt Diagram
I 10Million test case on 4 MPI process of 32 threads(color is level in the tree)
P. Ramet - HOSCAR workshop September 17, 2014- 21
Direct
Dynamic Scheduling Gantt Diagram
I Reduces time by 10-15%
P. Ramet - HOSCAR workshop September 17, 2014- 22
Direct
Automatic Performance Analysis ?
P. Ramet - HOSCAR workshop September 17, 2014- 23
Direct
Block ILU(k): supernode amalgamation algorithmDerive a block incomplete LU factorization from thesupernodal parallel direct solver
I Based on existing package PaStiXI Level-3 BLAS incomplete factorization implementationI Fill-in strategy based on level-fill among block structures
identified thanks to the quotient graphI Amalgamation strategy to enlarge block size
HighlightsI Handles efficiently high level-of-fillI Solving time faster than with scalar ILU(k)I Scalable parallel implementation
P. Ramet - HOSCAR workshop September 17, 2014- 24
Manycore
Guideline
Introduction
Sparse direct factorization - PaStiX
Sparse solver on heterogenous architectures
Hybrid methods - MaPHYS and HIPS
Low-rank compression - H-PaStiX
Conclusion
P. Ramet - HOSCAR workshop September 17, 2014- 25
3Sparse solver on heterogenousarchitectures
Manycore
Multiple layer approach
ALGORITHM
RUNTIME
KERNELS
GPU CPU
Governing ideas: Enable advancednumerical algorithms to be executed on ascalable unified runtime system forexploiting the full potential of futureexascale machines.Basics:
I Graph of tasksI Out-of-order schedulingI Fine granularity
P. Ramet - HOSCAR workshop September 17, 2014- 27
Manycore
DAG schedulers consideredStarPU
I RunTime Team – Inria Bordeaux Sud-OuestI C. Augonnet, R. Namyst, S. Thibault.I Dynamic Task DiscoveryI Computes cost models on the flyI Multiple kernels on the acceleratorsI Heterogeneous First-Time strategy
PaRSEC (formerly DAGuE)I ICL – University of Tennessee, KnoxvilleI G. Bosilca, A. Bouteiller, A. Danalys, T. HeraultI Parameterized Task GraphI Only the most compute intensive kernel on acceleratorsI Simple scheduling strategy based on computing capabilitiesI GPU multi-stream enabled
P. Ramet - HOSCAR workshop September 17, 2014- 28
Manycore
Supernodal sequential algorithm
forall the Supernode S1 dopanel (S1);/* update of the panel */
forall the extra diagonal block Bi of S1 doS2 ← supernode in front of (Bi );gemm (S1,S2);/* sparse GEMM Bk,k≥i × BT
i substracted from
S2 */
endend
P. Ramet - HOSCAR workshop September 17, 2014- 29
Manycore
StarPU Tasks submission
forall the Supernode S1 dosubmit panel (S1);/* update of the panel */
forall the extra diagonal block Bi of S1 doS2 ← supernode in front of (Bi );submit gemm (S1,S2);/* sparse GEMM Bk,k≥i × BT
i subtracted from S2*/
endwait for all tasks ();
end
P. Ramet - HOSCAR workshop September 17, 2014- 30
Manycore
PaRSEC’s parameterized task graph
panel(7)
gemm(19)
A=>A
gemm(20)
A=>A
gemm(21)
A=>Agemm(22)
A=>A
gemm(23)
A=>A
gemm(24)
A=>A
gemm(25)
A=>A
C=>C
C=>C
panel(8)
C=>A
gemm(27)
C=>C
C=>C
gemm(29)
C=>C
gemm(31)
C=>C A=>A
gemm(28)
A=>A
A=>A
gemm(30)
A=>A
A=>A
C=>C
panel(9)
C=>A
C=>C
gemm(33)
C=>C
gemm(36)
C=>C
panel(0)
gemm(1)
A=>A
gemm(2)
A=>Agemm(3)
A=>A
gemm(4)
A=>A
C=>C
panel(1)
C=>A
C=>C
gemm(6)
C=>C
A=>A
gemm(8)
C=>C
gemm(10)
C=>C
panel(5)
gemm(14)
A=>A
gemm(15)
A=>A
panel(6)
C=>A
gemm(17)
C=>C A=>A
C=>C
panel(2)
A=>A
gemm(12)
C=>C
panel(3)
A=>A
C=>C
panel(4)
A=>A
gemm(37)
C=>C
A=>A A=>A
gemm(34)
A=>A
gemm(35)
A=>A
A=>A
gemm(38)
A=>A C=>C
C=>C
panel(10)
C=>A
C=>C
gemm(40)
C=>C
panel(11)
C=>A
A=>A
Task Graph
1 p a n e l ( j )23 /∗ E x e c u t i o n Space ∗/4 j = 0 . . c b l k n b r−156 /∗ Task L o c a l i t y ( Owner Compute ) ∗/7 : A( j )89 /∗ Data d e p e n d e n c i e s ∗/
10 RW A <− ( l e a f ) ? A( j ) : C gemm( l a s t b r o w )11 −> A gemm( f i r s t b l o c k +1 . . l a s t b l o c k )12 −> A( j )
Panel Factorization in JDF Format
P. Ramet - HOSCAR workshop September 17, 2014- 31
Manycore
Matrices and Machines
Matrix Prec Method Size nnzA nnzL TFlopFilterV2 Z LU 0.6e+6 12e+6 536e+6 3.6Flan D LLT 1.6e+6 59e+6 1712e+6 5.3Audi D LLT 0.9e+6 39e+6 1325e+6 6.5MHD D LU 0.5e+6 24e+6 1133e+6 6.6Geo1438 D LLT 1.4e+6 32e+6 2768e+6 23Pmldf Z LDLT 1.0e+6 8e+6 1105e+6 28Hook D LU 1.5e+6 31e+6 4168e+6 35Serena D LDLT 1.4e+6 32e+6 3365e+6 47
Table: Matrix description (Z: double complex, D: double real)
Processors Frequency GPUs RAMWestmere Intel Xeon X5650 (2× 6) 2.67 GHz Tesla M2070 (×3) 36 GB
P. Ramet - HOSCAR workshop September 17, 2014- 32
Manycore
CPU scaling study: GFlop/s for numerical factorization
afshell10(D, LU)
FilterV2(Z, LU)
Flan(D, LLT )
audi(D, LLT )
MHD(D, LU)
Geo1438(D, LLT )
pmlDF(Z, LDLT )
HOOK(D, LU)0
20
40
60
80
Perfo
rman
ce(G
Flop
/s)
native StarPU PaRSEC
1 core 1 core 1 core
3 cores 3 cores 3 cores
6 cores 6 cores 6 cores
9 cores 9 cores 9 cores
12 cores 12 cores 12 cores
P. Ramet - HOSCAR workshop September 17, 2014- 33
Manycore
CPU scaling study: GFlop/s for numerical factorization
afshell10(D, LU)
FilterV2(Z, LU)
Flan(D, LLT )
audi(D, LLT )
MHD(D, LU)
Geo1438(D, LLT )
pmlDF(Z, LDLT )
HOOK(D, LU)0
20
40
60
80
Perfo
rman
ce(G
Flop
/s)
native StarPU PaRSEC
1 core 1 core 1 core
3 cores 3 cores 3 cores
6 cores 6 cores 6 cores
9 cores 9 cores 9 cores
12 cores 12 cores 12 cores
P. Ramet - HOSCAR workshop September 17, 2014- 33
Manycore
CPU scaling study: GFlop/s for numerical factorization
afshell10(D, LU)
FilterV2(Z, LU)
Flan(D, LLT )
audi(D, LLT )
MHD(D, LU)
Geo1438(D, LLT )
pmlDF(Z, LDLT )
HOOK(D, LU)0
20
40
60
80
Perfo
rman
ce(G
Flop
/s)
native StarPU PaRSEC
1 core 1 core 1 core
3 cores 3 cores 3 cores
6 cores 6 cores 6 cores
9 cores 9 cores 9 cores
12 cores 12 cores 12 cores
P. Ramet - HOSCAR workshop September 17, 2014- 33
Manycore
CPU scaling study: GFlop/s for numerical factorization
afshell10(D, LU)
FilterV2(Z, LU)
Flan(D, LLT )
audi(D, LLT )
MHD(D, LU)
Geo1438(D, LLT )
pmlDF(Z, LDLT )
HOOK(D, LU)0
20
40
60
80
Perfo
rman
ce(G
Flop
/s)
native StarPU PaRSEC
1 core 1 core 1 core
3 cores 3 cores 3 cores
6 cores 6 cores 6 cores
9 cores 9 cores 9 cores
12 cores 12 cores 12 cores
P. Ramet - HOSCAR workshop September 17, 2014- 33
Manycore
CPU scaling study: GFlop/s for numerical factorization
afshell10(D, LU)
FilterV2(Z, LU)
Flan(D, LLT )
audi(D, LLT )
MHD(D, LU)
Geo1438(D, LLT )
pmlDF(Z, LDLT )
HOOK(D, LU)0
20
40
60
80
Perfo
rman
ce(G
Flop
/s)
native StarPU PaRSEC
1 core 1 core 1 core
3 cores 3 cores 3 cores
6 cores 6 cores 6 cores
9 cores 9 cores 9 cores
12 cores 12 cores 12 cores
P. Ramet - HOSCAR workshop September 17, 2014- 33
Manycore
GPU scaling study : GFlop/s for numerical factorization
HOOK(D, LU)0
50
100
150
200
250
Perfo
rman
ce(G
Flop
/s)
Native: CPU only
StarPU: CPU only 1 GPU 2 GPU 3 GPU
PaRSEC 1 stream: CPU only 1 GPU 2 GPU 3 GPU
PaRSEC 3 streams: 1 GPU 2 GPU 3 GPU
P. Ramet - HOSCAR workshop September 17, 2014- 34
Manycore
GPU scaling study : GFlop/s for numerical factorization
HOOK(D, LU)0
50
100
150
200
250
Perfo
rman
ce(G
Flop
/s)
Native: CPU only
StarPU: CPU only 1 GPU 2 GPU 3 GPU
PaRSEC 1 stream: CPU only 1 GPU 2 GPU 3 GPU
PaRSEC 3 streams: 1 GPU 2 GPU 3 GPU
P. Ramet - HOSCAR workshop September 17, 2014- 34
Manycore
GPU scaling study : GFlop/s for numerical factorization
HOOK(D, LU)0
50
100
150
200
250
Perfo
rman
ce(G
Flop
/s)
Native: CPU only
StarPU: CPU only 1 GPU 2 GPU 3 GPU
PaRSEC 1 stream: CPU only 1 GPU 2 GPU 3 GPU
PaRSEC 3 streams: 1 GPU 2 GPU 3 GPU
P. Ramet - HOSCAR workshop September 17, 2014- 34
Manycore
GPU scaling study : GFlop/s for numerical factorization
HOOK(D, LU)0
50
100
150
200
250
Perfo
rman
ce(G
Flop
/s)
Native: CPU only
StarPU: CPU only 1 GPU 2 GPU 3 GPU
PaRSEC 1 stream: CPU only 1 GPU 2 GPU 3 GPU
PaRSEC 3 streams: 1 GPU 2 GPU 3 GPU
P. Ramet - HOSCAR workshop September 17, 2014- 34
Manycore
GPU scaling study : GFlop/s for numerical factorization
HOOK(D, LU)0
50
100
150
200
250
Perfo
rman
ce(G
Flop
/s)
Native: CPU only
StarPU: CPU only 1 GPU 2 GPU 3 GPU
PaRSEC 1 stream: CPU only 1 GPU 2 GPU 3 GPU
PaRSEC 3 streams: 1 GPU 2 GPU 3 GPU
P. Ramet - HOSCAR workshop September 17, 2014- 34
Manycore
GPU scaling study : GFlop/s for numerical factorization
afshell10(D, LU)
FilterV2(Z, LU)
Flan(D, LLT )
audi(D, LLT )
MHD(D, LU)
Geo1438(D, LLT )
pmlDF(Z, LDLT )
HOOK(D, LU)0
50
100
150
200
250
Perfo
rman
ce(G
Flop
/s)
Native: CPU only
StarPU: CPU only 1 GPU 2 GPU 3 GPU
PaRSEC 1 stream: CPU only 1 GPU 2 GPU 3 GPU
PaRSEC 3 streams: 1 GPU 2 GPU 3 GPU
P. Ramet - HOSCAR workshop September 17, 2014- 34
Manycore
Xeon Phi (from Max-Planck-Institut for Plasma Physic)[Phi = 4 Xeon cores, GPU = 6 Xeon cores]
P. Ramet - HOSCAR workshop September 17, 2014- 35
Manycore
Distributed implementation
C0
C1
C2
C3
P1P2
P. Ramet - HOSCAR workshop September 17, 2014- 36
Manycore
Distributed (preliminary) results (cluster of multicore)
afshell1
0(D, LU
)
FilterV
2(Z, LU
)
Flan(D
, LLT )
audi(D
, LLT )
MHD(D,
LU)
Geo1438
(D,LL
T )
pmlDF(Z,
LDLT )
HOOK(D,
LU)
Serena(
D, LDL
T )
0
50
100
150
200
250
300
Perform
ance
(GFlop/s)
native StarPU fan-in StarPU fan-out
4 MPI, 8 threads 4 MPI, 8 threads 4 MPI, 8 threads
8 MPI, 8 threads 8 MPI, 8 threads 8 MPI, 8 threads
16 MPI, 8 threads 16 MPI, 8 threads 16 MPI, 8 threads
32 MPI, 8 threads 32 MPI, 8 threads 32 MPI, 8 threads
P. Ramet - HOSCAR workshop September 17, 2014- 37
Hybrid
Guideline
Introduction
Sparse direct factorization - PaStiX
Sparse solver on heterogenous architectures
Hybrid methods - MaPHYS and HIPS
Low-rank compression - H-PaStiX
Conclusion
P. Ramet - HOSCAR workshop September 17, 2014- 38
4Hybrid methods - MaPHYS andHIPS
Hybrid Linear Solvers
Develop robust scalable parallel hybrid direct/iterative linearsolvers
I Exploit the efficiency and robustness of the sparse direct solversI Develop robust parallel preconditioners for iterative solversI Take advantage of scalable implementation of iterative solvers
Domain Decomposition (DD)
I Natural approach for PDE’sI Extend to general sparse matricesI Partition the problem into subdomainsI Use a direct solver on the subdomainsI Robust preconditioned iterative solver
Hybrid
Method used in MaPHYSI Partitioning the global matrix in
several local matrices
I MeTiS [G. Karypis and V.Kumar]
I Scotch [Pellegrini and al.]
I Local factorization
I MUMPS [P. Amestoy and al.](with Schur option)
I PaStiX [P. Ramet and al.](with Schur option andmulti-threaded version)
I Constructing of the preconditioner
I Mkl library
I Solving the reduced system
I Cg/Gmres/FGmres on thereduced system
0 cut edges
Ω
0 50 100 150 200 250 300 350 400
0
50
100
150
200
250
300
350
400
nz = 1920
Ω
P. Ramet - HOSCAR workshop September 17, 2014- 41
Hybrid
Method used in MaPHYSI Partitioning the global matrix in
several local matrices
I MeTiS [G. Karypis and V.Kumar]
I Scotch [Pellegrini and al.]
I Local factorization
I MUMPS [P. Amestoy and al.](with Schur option)
I PaStiX [P. Ramet and al.](with Schur option andmulti-threaded version)
I Constructing of the preconditioner
I Mkl library
I Solving the reduced system
I Cg/Gmres/FGmres on thereduced system
48 cut edges
Ω1
Ω2
Γ
0 50 100 150 200 250 300 350 400
0
50
100
150
200
250
300
350
400
nz = 1920
Ω1
Ω2
Γ
P. Ramet - HOSCAR workshop September 17, 2014- 41
Hybrid
Method used in MaPHYSI Partitioning the global matrix in
several local matrices
I MeTiS [G. Karypis and V.Kumar]
I Scotch [Pellegrini and al.]
I Local factorization
I MUMPS [P. Amestoy and al.](with Schur option)
I PaStiX [P. Ramet and al.](with Schur option andmulti-threaded version)
I Constructing of the preconditioner
I Mkl library
I Solving the reduced system
I Cg/Gmres/FGmres on thereduced system
88 cut edges
Ω1
Ω2
Ω3
Ω4
Γ
0 50 100 150 200 250 300 350 400
0
50
100
150
200
250
300
350
400
nz = 1920
Ω1
Ω2
Ω3
Ω4
Γ
P. Ramet - HOSCAR workshop September 17, 2014- 41
Hybrid
Method used in MaPHYSI Partitioning the global matrix in
several local matricesI MeTiS [G. Karypis and V.
Kumar]I Scotch [Pellegrini and al.]
I Local factorization
I MUMPS [P. Amestoy and al.](with Schur option)
I PaStiX [P. Ramet and al.](with Schur option andmulti-threaded version)
I Constructing of the preconditioner
I Mkl library
I Solving the reduced system
I Cg/Gmres/FGmres on thereduced system
88 cut edges
Ω1
Ω2
Ω3
Ω4
Γ
0 50 100 150 200 250 300 350 400
0
50
100
150
200
250
300
350
400
nz = 1920
Ω1
Ω2
Ω3
Ω4
Γ
P. Ramet - HOSCAR workshop September 17, 2014- 41
Hybrid
Method used in MaPHYSI Partitioning the global matrix in
several local matricesI MeTiS [G. Karypis and V.
Kumar]I Scotch [Pellegrini and al.]
I Local factorizationI MUMPS [P. Amestoy and al.]
(with Schur option)I PaStiX [P. Ramet and al.]
(with Schur option andmulti-threaded version)
I Constructing of the preconditioner
I Mkl library
I Solving the reduced system
I Cg/Gmres/FGmres on thereduced system
P. Ramet - HOSCAR workshop September 17, 2014- 41
Hybrid
Method used in MaPHYSI Partitioning the global matrix in
several local matricesI MeTiS [G. Karypis and V.
Kumar]I Scotch [Pellegrini and al.]
I Local factorizationI MUMPS [P. Amestoy and al.]
(with Schur option)I PaStiX [P. Ramet and al.]
(with Schur option andmulti-threaded version)
I Constructing of the preconditionerI Mkl library
I Solving the reduced system
I Cg/Gmres/FGmres on thereduced system
0 50 100 150 200 250 300 350 400
0
50
100
150
200
250
300
350
400
nz = 1920
Ω1
Ω2
Ω3
Ω4
Γ
P. Ramet - HOSCAR workshop September 17, 2014- 41
Hybrid
Method used in MaPHYSI Partitioning the global matrix in
several local matricesI MeTiS [G. Karypis and V.
Kumar]I Scotch [Pellegrini and al.]
I Local factorizationI MUMPS [P. Amestoy and al.]
(with Schur option)I PaStiX [P. Ramet and al.]
(with Schur option andmulti-threaded version)
I Constructing of the preconditionerI Mkl library
I Solving the reduced systemI Cg/Gmres/FGmres on the
reduced system
0 50 100 150 200 250 300 350 400
0
50
100
150
200
250
300
350
400
nz = 1920
Ω1
Ω2
Ω3
Ω4
Γ
P. Ramet - HOSCAR workshop September 17, 2014- 41
Hybrid
HIPS : hybrid direct-iterative solver
Based on a domain decomposition : interface one node-wide(no overlap in DD lingo)
(AB FE AC
)B : Interior nodes of subdomains (direct factorization).C : Interface nodes.
Special decomposition and ordering of the subset C :Goal : Building a global Schur complement preconditioner (ILU)from the local domain matrices only.
P. Ramet - HOSCAR workshop September 17, 2014- 42
Hybrid
HIPS: preconditioners
Main featuresI Iterative or “hybrid” direct/iterative method are implemented.I Mix direct supernodal (BLAS-3) and sparse ILUT
factorization in a seamless manner.I Memory/load balancing : distribute the domains on the
processors (domains > processors).
P. Ramet - HOSCAR workshop September 17, 2014- 43
H-PaStiX
Guideline
Introduction
Sparse direct factorization - PaStiX
Sparse solver on heterogenous architectures
Hybrid methods - MaPHYS and HIPS
Low-rank compression - H-PaStiX
Conclusion
P. Ramet - HOSCAR workshop September 17, 2014- 44
5Low-rank compression -H-PaStiX
H-PaStiX
Toward low rank compressions in supernodal solver
Many works on hierarchical matrices and direct solversI Eric Darve : Hierarchical matrices classifications (Building
O(N) Linear Solvers Using Nested Dissection)I Sherry Li : Multifrontal solver + HSS (Towards an
Optimal-Order Approximate Sparse Factorization ExploitingData-Sparseness in Separators)
I David Bindel : CHOLMOD + Low Rank (An Efficient Solverfor Sparse Linear Systems Based on Rank-Structured CholeskyFactorization)
I Jean-Yves L’Excellent : MUMPS + Block Low Rank
P. Ramet - HOSCAR workshop September 17, 2014- 46
H-PaStiX
Symbolic factorization
P. Ramet - HOSCAR workshop September 17, 2014- 47
H-PaStiX
Nested dissection, 2D mesh/matrix
P. Ramet - HOSCAR workshop September 17, 2014- 48
H-PaStiX
Computational cost (O(N2) for 3D PDE)
P. Ramet - HOSCAR workshop September 17, 2014- 49
H-PaStiX
Low-Rank Compression of Supernodes
P. Ramet - HOSCAR workshop September 17, 2014- 50
H-PaStiX
FastLA associate team between INRIA/Berkeley/Stanford
Supernodal Solver - Hierarchical Matrices O(N .loga(N))
1. Check the potential compression ratio on top level blocks2. Develop a prototype with:
I low-rank compression on the larger supernodesI compression tree built at each updateI complexity analysis of the approach
3. Study coupling between nested dissection and compressiontree ordering
Which algorithm to find low-rank approximation ?SVD, RR-LU, RR-QR, ACA, CUR, Random ...
Which family of hierarchical matrix ?H, H2, HODLR ...
P. Ramet - HOSCAR workshop September 17, 2014- 51
Conclusion
Guideline
Introduction
Sparse direct factorization - PaStiX
Sparse solver on heterogenous architectures
Hybrid methods - MaPHYS and HIPS
Low-rank compression - H-PaStiX
Conclusion
P. Ramet - HOSCAR workshop September 17, 2014- 52
6Conclusion
Conclusion
SoftwaresGraph/Mesh partitioner and ordering :
http://scotch.gforge.inria.fr
Sparse linear system solvers :
http://pastix.gforge.inria.fr
http://hips.gforge.inria.fr
https://wiki.bordeaux.inria.fr/maphys/doku.php
P. Ramet - HOSCAR workshop September 17, 2014- 54
Conclusion
Softwares
Fast Multipole Method :
http://scalfmm-public.gforge.inria.fr/
Matrices Over Runtime Systems (with University of Tenessee):
http://icl.cs.utk.edu/projectsdev/morse
P. Ramet - HOSCAR workshop September 17, 2014- 55
Conclusion
MerciQuestions ?
Pierre RametHiePACS
http://www.labri.fr/ ramet