supermatrix on heterogeneous platforms jianyu huang shpc, ut austin 1
TRANSCRIPT
5
FLAME Answer: SuperMatrix
ProgrammabilityUse tools provide by
FLAME
ParallelismDirected acyclic graph(DAG) scheduling
SuperMatrixlibflame
clBLAS
OpenCL
cuBLAS
CUDA
GP
U
CP
U/
MIC
ACML
C/assemb
ly/Fortra
n
MKL
C/Fortran/
Assembly
BLIS
C/Assemb
ly
Accelerator/O
ther platforms
BLIS
C/Assemb
ly
OpenMP/pthread
OpenMP/pthread
OpenMP/pthread
OpenMP/pthread
one ring to rule them all
6
FLAME Answer: SuperMatrix• Chan, E., Quintana-Ortí, E. S., Quintana-Ortí, G., and van de Geijn, R.. SuperMatrix out-of-order scheduling
of matrix operations for SMP and multi-core architectures. In SPAA'07: Proceedings of the Nineteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 116-125, San Diego, CA, USA, June 9-11, 2007.
• Chan, E., G. Van Zee, F., Quintana-Ortí, E. S., Quintana-Ortí, G., and van de Geijn, R.. Satisfying your dependencies with SuperMatrix. InCluster'07: Proceedings of the 2007 IEEE International Conference on Cluster Computing, pages 91-99, Austin, TX, USA, September 17-20, 2007.
• Chan, E., G. Van Zee, F., Bientinesi, P., Quintana-Ortí, E. S., Quintana-Ortí, G., and van de Geijn, R.. SuperMatrix: A multithreaded runtime scheduling system for algorithms-by-blocks. In PPoPP'08: Proceedings of the Thirteenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 123-132, Salt Lake City, UT, USA, February 20-23, 2008.
• Quintana-Orti, G., Igual, F. D., Quintana-Orti, E. S., van de Geijn, R.. Solving dense linear systems on platforms with multiple hardware accelerators. In PPoPP '09 Proceedings of the 14th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, 2009
• Quintana-Ortí, G., Quintana-Ortí, E.S., van de Geijn, R., G. Van Zee, F., and Chan, E.. Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Transactions on Mathematical Software, 36(3):14:1-14:26, July 2009.
• Chan, E.. “Application of Dependence Analysis and Runtime Data Flow Graph Scheduling to Matrix Computations.” Ph.D. dissertation, Department of Computer Science, The University of Texas at Austin
• Quintana-Ortí, G., Igual, F. D., Marqués, M., Quintana-Ortí, E. S., and van de Geijn, R.. "A Runtime System for Programming Out-of-Core Matrix Algorithms-by-Tiles on Multithreaded Architectures." ACM Transactions on Mathematical Software (TOMS) 38, no. 4 (2012): 25.
7
Parallel?• S0: D ← A*B
• S1: A → L * LT
• S2: B ← B * L-T
• S3: C ← C – B * BT
• S4: X ← L-1 * X
• Write After Read: (S0, S1)
Are you sure S1 and S2 cannot be parallelized?
• Read After Write: (S1, S4)
• Read After Write: (S1, S2)
• Read After Write: (S2, S3)
Can the code be parallelized?
• Read After Write: (S0, S1)
8
Parallel?• S0: D ← A*B
• S1: A → L * LT
• S2: B ← B * L-T
• S3: C ← C – B * BT
• S4: X ← L-1 * XHow to parallelize?
D A B
A L
B B L
C BT BC
X L B
9
Traditional Library Approach• S0: D ← A*B
• S1: A → L * LT
• S2: B ← B * L-T
• S3: C ← C – B * BT
• S4: X ← L-1 * XHow to parallelize?
• S0: ParGemm (A,B,D)
• S1: L = ParPortf(A)
• S2: ParTrsm(L,B)
• S3: ParSyrk(B,C)
• S4: ParTrsm(L,X)
10
Traditional Library ApproachImplemented with libflame and BLIS
• S0: D ← A*B• S1: A → L * LT
• S2: B ← B * L-T
• S3: C ← C – B * BT
• S4: X ← L-1 * X
Supported by parallel BLAS, LAPACK (multi-thread BLIS)
/*-----------------------------------------------*/ FLA_Gemm( FLA_NO_TRANSPOSE,FLA_NO_TRANSPOSE, FLA_ONE, A, B, FLA_ZERO, D ); FLA_Chol( FLA_LOWER_TRIANGULAR, A ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A, B ); FLA_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, B, FLA_ONE, C ); FLA_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, L, X ); /*-----------------------------------------------*/
11
Problem for Fine-grained Parallelism
libflame
BLIS
pthreads OpenMP
Fine-grained parallelism
libflame
BLIS
Coarse-grained parallelism
BLIS
pthreads OpenMP
• Introduce parallelism across instructions• Fit for the platform with multiple computation units.
• Synchronization point overhead• Not fit for multiple devices scenarios.
12
Coarse-grained Parallelism
libflame
BLIS
pthreads OpenMP
Fine-grained parallelism
libflame
BLIS
Coarse-grained parallelism
BLIS
SuperMatrix
• Introduce parallelism across instructions• Fit for the platform with multiple computation units.
13
SuperMatrix Approach• S0: D ← A*B
• S1: A → L * LT
• S2: B ← B * L-T
• S3: C ← C – B * BT
• S4: X ← L-1 * XHow to parallelize?
D A B
A L
B B L
C BT BC
X L B
14
SuperMatrix Approach• S0: D ← A*B
• S1: A → L * LT
• S2: B ← B * L-T
• S3: C ← C – B * BT
• S4: X ← L-1 * XHow to parallelize?
15
SuperMatrix Approach• S0: D ← A*B
• S1: A → L * LT
• S2: B ← B * L-T
• S3: C ← C – B * BT
• S4: X ← L-1 * XHow to parallelize? Partitioning/Algorithm-by-blocks!
16
SuperMatrix Approach• S0: D ← A*B
• S1: A → L * LT
• S2: B ← B * L-T
• S3: C ← C – B * BT
• S4: X ← L-1 * XHow to parallelize?
17
SuperMatrix Approach
• Construct the DAG across the instructions automatically• No need to annotate the task dependencies manually!
18
Traditional Library ApproachImplemented with libflame and BLIS
Supported by parallel BLAS, LAPACK (multi-thread BLIS)
/*-----------------------------------------------*/ FLA_Gemm( FLA_NO_TRANSPOSE,FLA_NO_TRANSPOSE, FLA_ONE, A, B, FLA_ZERO, D ); FLA_Chol( FLA_LOWER_TRIANGULAR, A ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A, B ); FLA_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, B, FLA_ONE, C ); FLA_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, L, X ); /*-----------------------------------------------*/
• S0: D ← A*B• S1: A → L * LT
• S2: B ← B * L-T
• S3: C ← C – B * BT
• S4: X ← L-1 * X
19
SuperMatrix ApproachImplemented with libflame and BLIS
• S0: D ← A*B• S1: A → L * LT
• S2: B ← B * L-T
• S3: C ← C – B * BT
• S4: X ← L-1 * X
/*-----------------------------------------------*/ FLASH_Gemm( FLA_NO_TRANSPOSE,FLA_NO_TRANSPOSE, FLA_ONE, A, B, FLA_ZERO, D ); FLASH_Chol( FLA_LOWER_TRIANGULAR, A ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A, B ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, B, FLA_ONE, C ); FLASH_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, L, X ); /*-----------------------------------------------*/
Tiny Code Change!
23
Challenges in Heterogeneous Platforms!
• S0: D ← A*AT
• S1: A → L * LT
• S2: B ← B * L-T
• S3: C ← C – B * BT
• S4: X ← L-1 * XWhat if there is one accelerator in your system?
• S0: ParGemm (A,AT,D)
• S1: L = ParPortf(A)
• S2: ParTrsm(L,B)
• S3: ParSyrk(B,C)
• S4: ParTrsm(L,X)
24
Challenges in Heterogeneous Platforms!
• S0: ParGemm (A,AT,D)
• S1: L = ParPortf(A)
• S2: ParTrsm(L,B)
• S3: ParSyrk(B,C)
• S4: ParTrsm(L,X)What if there are 4 GPUs and 8 CPU cores in your system?
/*-----------------------------*/ Memcpy(A, hA); Memcpy(D, hD); Memcpy(B, hB); Memcpy(C, hC); Memcpy(X, hX);/*-----------------------------*/
/*-----------------------------*/Memcpy(hX, X);/*-----------------------------*/
What if there is one accelerator in your system?
25
Adapting Original SuperMatrixto Heterogeneous Platforms
• Software Cache• Heterogeneous Scheduler• Asynchronous Memory Copy• Worker Task Performance Model
26
Naïve Approach
PCIE
C A B
Transfer data from host to device before execution
Transfer data from device to host upon execution
Execute the task on the device
No Data Reuse on the devices !
27
Software Cache
PCIE
Quintana-Ortí, G., et al. "Solving dense linear systems on platforms with multiple hardware accelerators." In PPoPP '09 Proceedings of the 14th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, 2009
C A B
No need to transfer data from host to device before execution if the data is already on the device
No need to transfer data from device to host upon execution if the data is not required by the host immediately
28
HEFT(Heterogeneous Earliest Finish Time)
Topcuoglu, H., Hariri, S., and Wu, M.. "Performance-effective and low-complexity task scheduling for heterogeneous computing." IEEE Transactions on Parallel and Distributed Systems, 13.3 (2002): 260-274.
Timeline
Task 2
Task 1
Task 3 Task 4
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 …
Task 5
Task 6
Where should we place Task 6?
EST(Earliest Start Time)
Task 6
EFT(Earliest Finish Time)
Available Time
3x3 Blocked Cholesky Decomposition
CHOL0A0,0 ← Chol( A0,0 )
CHOL0A0,0 ← Chol( A0,0 )
TRSM2A2,0 ← A2,0 A0,0
-T
TRSM2A2,0 ← A2,0 A0,0
-T
TRSM1A1,0 ←A1,0 A0,0
-T
TRSM1A1,0 ←A1,0 A0,0
-T
SYRK5A2,2 ← A2,2 – A2,0 A2,0
T
SYRK5A2,2 ← A2,2 – A2,0 A2,0
TGEMM4A2,1 ← A2,1 – A2,0 A1,0
T
GEMM4A2,1 ← A2,1 – A2,0 A1,0
T
SYRK3A1,1 ← A1,1 – A1,0 A1,0
T
SYRK3A1,1 ← A1,1 – A1,0 A1,0
T
CHOL6A1,1 ← Chol( A1,1 )
CHOL6A1,1 ← Chol( A1,1 )
TRSM7A2,1 ← A2,1 A1,1
-T
TRSM7A2,1 ← A2,1 A1,1
-T
SYRK8A2,2 ← A2,2 – A2,1 A2,1
T
SYRK8A2,2 ← A2,2 – A2,1 A2,1
T
CHOL9A2,2 ← Chol( A2,2 )
CHOL9A2,2 ← Chol( A2,2 ) 29
30
CHOL0A0,0 ← Chol( A0,0 )
CHOL0A0,0 ← Chol( A0,0 )
TRSM2A2,0 ← A2,0 A0,0
-T
TRSM2A2,0 ← A2,0 A0,0
-T
TRSM1A1,0 ←A1,0 A0,0
-T
TRSM1A1,0 ←A1,0 A0,0
-T
SYRK5A2,2 ← A2,2 – A2,0 A2,0
T
SYRK5A2,2 ← A2,2 – A2,0 A2,0
TGEMM4A2,1 ← A2,1 – A2,0 A1,0
T
GEMM4A2,1 ← A2,1 – A2,0 A1,0
T
SYRK3A1,1 ← A1,1 – A1,0 A1,0
T
SYRK3A1,1 ← A1,1 – A1,0 A1,0
T
CHOL6A1,1 ← Chol( A1,1 )
CHOL6A1,1 ← Chol( A1,1 )
TRSM7A2,1 ← A2,1 A1,1
-T
TRSM7A2,1 ← A2,1 A1,1
-T
SYRK8A2,2 ← A2,2 – A2,1 A2,1
T
SYRK8A2,2 ← A2,2 – A2,1 A2,1
T
CHOL9A2,2 ← Chol( A2,2 )
CHOL9A2,2 ← Chol( A2,2 )
CPU GPU0 GPU1
A00 1 0 0
A10 1 0 0
A11 1 0 0
A20 1 0 0
A21 1 0 0
A22 1 0 0
CPU GPU0 GPU1
CHOL0
TRSM1
TRSM2
SYRK3
GEMM4
SYRK5
CHOL6
TRSM7
SYRK8
CHOL9
Data Distribution
Scheduler
CPU GPU0 GPU1
Avail 0 0 0
EST
EFT
Priority
HEFT assignment table
31
CHOL0A0,0 ← Chol( A0,0 )
CHOL0A0,0 ← Chol( A0,0 )
TRSM2A2,0 ← A2,0 A0,0
-T
TRSM2A2,0 ← A2,0 A0,0
-T
TRSM1A1,0 ←A1,0 A0,0
-T
TRSM1A1,0 ←A1,0 A0,0
-T
SYRK5A2,2 ← A2,2 – A2,0 A2,0
T
SYRK5A2,2 ← A2,2 – A2,0 A2,0
TGEMM4A2,1 ← A2,1 – A2,0 A1,0
T
GEMM4A2,1 ← A2,1 – A2,0 A1,0
T
SYRK3A1,1 ← A1,1 – A1,0 A1,0
T
SYRK3A1,1 ← A1,1 – A1,0 A1,0
T
CHOL6A1,1 ← Chol( A1,1 )
CHOL6A1,1 ← Chol( A1,1 )
TRSM7A2,1 ← A2,1 A1,1
-T
TRSM7A2,1 ← A2,1 A1,1
-T
SYRK8A2,2 ← A2,2 – A2,1 A2,1
T
SYRK8A2,2 ← A2,2 – A2,1 A2,1
T
CHOL9A2,2 ← Chol( A2,2 )
CHOL9A2,2 ← Chol( A2,2 )
CPU GPU0 GPU1
A00 1 0 0
A10 1 0 0
A11 1 0 0
A20 1 0 0
A21 1 0 0
A22 1 0 0
CPU GPU0 GPU1
CHOL0 X
TRSM1
TRSM2
SYRK3
GEMM4
SYRK5
CHOL6
TRSM7
SYRK8
CHOL9
Data Distribution
Scheduler
CPU GPU0 GPU1
Avail 0 0 0
EST 0 1 1
EFT 1.5 2 2
Priority 1 2 3
HEFT assignment table
32
CHOL0A0,0 ← Chol( A0,0 )
CHOL0A0,0 ← Chol( A0,0 )
TRSM2A2,0 ← A2,0 A0,0
-T
TRSM2A2,0 ← A2,0 A0,0
-T
TRSM1A1,0 ←A1,0 A0,0
-T
TRSM1A1,0 ←A1,0 A0,0
-T
SYRK5A2,2 ← A2,2 – A2,0 A2,0
T
SYRK5A2,2 ← A2,2 – A2,0 A2,0
TGEMM4A2,1 ← A2,1 – A2,0 A1,0
T
GEMM4A2,1 ← A2,1 – A2,0 A1,0
T
SYRK3A1,1 ← A1,1 – A1,0 A1,0
T
SYRK3A1,1 ← A1,1 – A1,0 A1,0
T
CHOL6A1,1 ← Chol( A1,1 )
CHOL6A1,1 ← Chol( A1,1 )
TRSM7A2,1 ← A2,1 A1,1
-T
TRSM7A2,1 ← A2,1 A1,1
-T
SYRK8A2,2 ← A2,2 – A2,1 A2,1
T
SYRK8A2,2 ← A2,2 – A2,1 A2,1
T
CHOL9A2,2 ← Chol( A2,2 )
CHOL9A2,2 ← Chol( A2,2 )
CPU GPU0 GPU1
A00 1 1 0
A10 0 1 0
A11 1 0 0
A20 1 0 0
A21 1 0 0
A22 1 0 0
CPU GPU0 GPU1
CHOL0 X
TRSM1 X
TRSM2
SYRK3
GEMM4
SYRK5
CHOL6
TRSM7
SYRK8
CHOL9
Data Distribution
Scheduler
CPU GPU0 GPU1
Avail 1.5 0 0
EST 1.5 3.5 3.5
EFT 5.5 5 5
Priority 3 1 2
HEFT assignment table
33
CHOL0A0,0 ← Chol( A0,0 )
CHOL0A0,0 ← Chol( A0,0 )
TRSM2A2,0 ← A2,0 A0,0
-T
TRSM2A2,0 ← A2,0 A0,0
-T
TRSM1A1,0 ←A1,0 A0,0
-T
TRSM1A1,0 ←A1,0 A0,0
-T
SYRK5A2,2 ← A2,2 – A2,0 A2,0
T
SYRK5A2,2 ← A2,2 – A2,0 A2,0
TGEMM4A2,1 ← A2,1 – A2,0 A1,0
T
GEMM4A2,1 ← A2,1 – A2,0 A1,0
T
SYRK3A1,1 ← A1,1 – A1,0 A1,0
T
SYRK3A1,1 ← A1,1 – A1,0 A1,0
T
CHOL6A1,1 ← Chol( A1,1 )
CHOL6A1,1 ← Chol( A1,1 )
TRSM7A2,1 ← A2,1 A1,1
-T
TRSM7A2,1 ← A2,1 A1,1
-T
SYRK8A2,2 ← A2,2 – A2,1 A2,1
T
SYRK8A2,2 ← A2,2 – A2,1 A2,1
T
CHOL9A2,2 ← Chol( A2,2 )
CHOL9A2,2 ← Chol( A2,2 )
CPU GPU0 GPU1
A00 1 1 1
A10 0 1 0
A11 1 0 0
A20 0 0 1
A21 1 0 0
A22 1 0 0
CPU GPU0 GPU1
CHOL0 X
TRSM1 X
TRSM2 X
SYRK3
GEMM4
SYRK5
CHOL6
TRSM7
SYRK8
CHOL9
Data Distribution
Scheduler
CPU GPU0 GPU1
Avail 1.5 5 0
EST 1.5 5 3.5
EFT 5.5 6.5 5
Priority 2 3 1
HEFT assignment table
34
CHOL0A0,0 ← Chol( A0,0 )
CHOL0A0,0 ← Chol( A0,0 )
TRSM2A2,0 ← A2,0 A0,0
-T
TRSM2A2,0 ← A2,0 A0,0
-T
TRSM1A1,0 ←A1,0 A0,0
-T
TRSM1A1,0 ←A1,0 A0,0
-T
SYRK5A2,2 ← A2,2 – A2,0 A2,0
T
SYRK5A2,2 ← A2,2 – A2,0 A2,0
TGEMM4A2,1 ← A2,1 – A2,0 A1,0
T
GEMM4A2,1 ← A2,1 – A2,0 A1,0
T
SYRK3A1,1 ← A1,1 – A1,0 A1,0
T
SYRK3A1,1 ← A1,1 – A1,0 A1,0
T
CHOL6A1,1 ← Chol( A1,1 )
CHOL6A1,1 ← Chol( A1,1 )
TRSM7A2,1 ← A2,1 A1,1
-T
TRSM7A2,1 ← A2,1 A1,1
-T
SYRK8A2,2 ← A2,2 – A2,1 A2,1
T
SYRK8A2,2 ← A2,2 – A2,1 A2,1
T
CHOL9A2,2 ← Chol( A2,2 )
CHOL9A2,2 ← Chol( A2,2 )
CPU GPU0 GPU1
A00 1 1 1
A10 0 1 0
A11 0 1 0
A20 0 0 1
A21 1 0 0
A22 1 0 0
CPU GPU0 GPU1
CHOL0 X
TRSM1 X
TRSM2 X
SYRK3 X
GEMM4
SYRK5
CHOL6
TRSM7
SYRK8
CHOL9
Data Distribution
Scheduler
CPU GPU0 GPU1
Avail 1.5 5 5
EST 6 5 7
EFT 10 6.5 8.5
Priority 3 1 2
HEFT assignment table
35
CHOL0A0,0 ← Chol( A0,0 )
CHOL0A0,0 ← Chol( A0,0 )
TRSM2A2,0 ← A2,0 A0,0
-T
TRSM2A2,0 ← A2,0 A0,0
-T
TRSM1A1,0 ←A1,0 A0,0
-T
TRSM1A1,0 ←A1,0 A0,0
-T
SYRK5A2,2 ← A2,2 – A2,0 A2,0
T
SYRK5A2,2 ← A2,2 – A2,0 A2,0
TGEMM4A2,1 ← A2,1 – A2,0 A1,0
T
GEMM4A2,1 ← A2,1 – A2,0 A1,0
T
SYRK3A1,1 ← A1,1 – A1,0 A1,0
T
SYRK3A1,1 ← A1,1 – A1,0 A1,0
T
CHOL6A1,1 ← Chol( A1,1 )
CHOL6A1,1 ← Chol( A1,1 )
TRSM7A2,1 ← A2,1 A1,1
-T
TRSM7A2,1 ← A2,1 A1,1
-T
SYRK8A2,2 ← A2,2 – A2,1 A2,1
T
SYRK8A2,2 ← A2,2 – A2,1 A2,1
T
CHOL9A2,2 ← Chol( A2,2 )
CHOL9A2,2 ← Chol( A2,2 )
CPU GPU0 GPU1
A00 1 1 1
A10 0 1 0
A11 0 1 0
A20 1 1 1
A21 0 1 0
A22 1 0 0
CPU GPU0 GPU1
CHOL0 X
TRSM1 X
TRSM2 X
SYRK3 X
GEMM4 X
SYRK5
CHOL6
TRSM7
SYRK8
CHOL9
Data Distribution
Scheduler
CPU GPU0 GPU1
Avail 1.5 6.5 5
EST 6 7 7
EFT 14 10 10
Priority 3 1 2
HEFT assignment table
36
CHOL0A0,0 ← Chol( A0,0 )
CHOL0A0,0 ← Chol( A0,0 )
TRSM2A2,0 ← A2,0 A0,0
-T
TRSM2A2,0 ← A2,0 A0,0
-T
TRSM1A1,0 ←A1,0 A0,0
-T
TRSM1A1,0 ←A1,0 A0,0
-T
SYRK5A2,2 ← A2,2 – A2,0 A2,0
T
SYRK5A2,2 ← A2,2 – A2,0 A2,0
TGEMM4A2,1 ← A2,1 – A2,0 A1,0
T
GEMM4A2,1 ← A2,1 – A2,0 A1,0
T
SYRK3A1,1 ← A1,1 – A1,0 A1,0
T
SYRK3A1,1 ← A1,1 – A1,0 A1,0
T
CHOL6A1,1 ← Chol( A1,1 )
CHOL6A1,1 ← Chol( A1,1 )
TRSM7A2,1 ← A2,1 A1,1
-T
TRSM7A2,1 ← A2,1 A1,1
-T
SYRK8A2,2 ← A2,2 – A2,1 A2,1
T
SYRK8A2,2 ← A2,2 – A2,1 A2,1
T
CHOL9A2,2 ← Chol( A2,2 )
CHOL9A2,2 ← Chol( A2,2 )
CPU GPU0 GPU1
A00 1 1 1
A10 0 1 0
A11 0 1 0
A20 1 1 1
A21 0 1 0
A22 0 0 1
CPU GPU0 GPU1
CHOL0 X
TRSM1 X
TRSM2 X
SYRK3 X
GEMM4 X
SYRK5 X
CHOL6
TRSM7
SYRK8
CHOL9
Data Distribution
Scheduler
CPU GPU0 GPU1
Avail 1.5 10 5
EST 6 10 5
EFT 10 11.5 6.5
Priority 2 3 1
HEFT assignment table
37
CHOL0A0,0 ← Chol( A0,0 )
CHOL0A0,0 ← Chol( A0,0 )
TRSM2A2,0 ← A2,0 A0,0
-T
TRSM2A2,0 ← A2,0 A0,0
-T
TRSM1A1,0 ←A1,0 A0,0
-T
TRSM1A1,0 ←A1,0 A0,0
-T
SYRK5A2,2 ← A2,2 – A2,0 A2,0
T
SYRK5A2,2 ← A2,2 – A2,0 A2,0
TGEMM4A2,1 ← A2,1 – A2,0 A1,0
T
GEMM4A2,1 ← A2,1 – A2,0 A1,0
T
SYRK3A1,1 ← A1,1 – A1,0 A1,0
T
SYRK3A1,1 ← A1,1 – A1,0 A1,0
T
CHOL6A1,1 ← Chol( A1,1 )
CHOL6A1,1 ← Chol( A1,1 )
TRSM7A2,1 ← A2,1 A1,1
-T
TRSM7A2,1 ← A2,1 A1,1
-T
SYRK8A2,2 ← A2,2 – A2,1 A2,1
T
SYRK8A2,2 ← A2,2 – A2,1 A2,1
T
CHOL9A2,2 ← Chol( A2,2 )
CHOL9A2,2 ← Chol( A2,2 )
CPU GPU0 GPU1
A00 1 1 1
A10 0 1 0
A11 1 0 0
A20 1 1 1
A21 0 1 0
A22 0 0 1
CPU GPU0 GPU1
CHOL0 X
TRSM1 X
TRSM2 X
SYRK3 X
GEMM4 X
SYRK5 X
CHOL6 X
TRSM7
SYRK8
CHOL9
Data Distribution
Scheduler
CPU GPU0 GPU1
Avail 1.5 10 6.5
EST 7.5 10 8.5
EFT 9 11 9.5
Priority 1 3 2
HEFT assignment table
38
CHOL0A0,0 ← Chol( A0,0 )
CHOL0A0,0 ← Chol( A0,0 )
TRSM2A2,0 ← A2,0 A0,0
-T
TRSM2A2,0 ← A2,0 A0,0
-T
TRSM1A1,0 ←A1,0 A0,0
-T
TRSM1A1,0 ←A1,0 A0,0
-T
SYRK5A2,2 ← A2,2 – A2,0 A2,0
T
SYRK5A2,2 ← A2,2 – A2,0 A2,0
TGEMM4A2,1 ← A2,1 – A2,0 A1,0
T
GEMM4A2,1 ← A2,1 – A2,0 A1,0
T
SYRK3A1,1 ← A1,1 – A1,0 A1,0
T
SYRK3A1,1 ← A1,1 – A1,0 A1,0
T
CHOL6A1,1 ← Chol( A1,1 )
CHOL6A1,1 ← Chol( A1,1 )
TRSM7A2,1 ← A2,1 A1,1
-T
TRSM7A2,1 ← A2,1 A1,1
-T
SYRK8A2,2 ← A2,2 – A2,1 A2,1
T
SYRK8A2,2 ← A2,2 – A2,1 A2,1
T
CHOL9A2,2 ← Chol( A2,2 )
CHOL9A2,2 ← Chol( A2,2 )
CPU GPU0 GPU1
A00 1 1 1
A10 0 1 0
A11 1 1 0
A20 1 1 1
A21 0 1 0
A22 0 0 1
CPU GPU0 GPU1
CHOL0 X
TRSM1 X
TRSM2 X
SYRK3 X
GEMM4 X
SYRK5 X
CHOL6 X
TRSM7 X
SYRK8
CHOL9
Data Distribution
Scheduler
CPU GPU0 GPU1
Avail 9 10 6.5
EST 11 10 12
EFT 15 11.5 13.5
Priority 3 1 2
HEFT assignment table
39
CHOL0A0,0 ← Chol( A0,0 )
CHOL0A0,0 ← Chol( A0,0 )
TRSM2A2,0 ← A2,0 A0,0
-T
TRSM2A2,0 ← A2,0 A0,0
-T
TRSM1A1,0 ←A1,0 A0,0
-T
TRSM1A1,0 ←A1,0 A0,0
-T
SYRK5A2,2 ← A2,2 – A2,0 A2,0
T
SYRK5A2,2 ← A2,2 – A2,0 A2,0
TGEMM4A2,1 ← A2,1 – A2,0 A1,0
T
GEMM4A2,1 ← A2,1 – A2,0 A1,0
T
SYRK3A1,1 ← A1,1 – A1,0 A1,0
T
SYRK3A1,1 ← A1,1 – A1,0 A1,0
T
CHOL6A1,1 ← Chol( A1,1 )
CHOL6A1,1 ← Chol( A1,1 )
TRSM7A2,1 ← A2,1 A1,1
-T
TRSM7A2,1 ← A2,1 A1,1
-T
SYRK8A2,2 ← A2,2 – A2,1 A2,1
T
SYRK8A2,2 ← A2,2 – A2,1 A2,1
T
CHOL9A2,2 ← Chol( A2,2 )
CHOL9A2,2 ← Chol( A2,2 )
CPU GPU0 GPU1
A00 1 1 1
A10 0 1 0
A11 1 1 0
A20 1 1 1
A21 0 1 0
A22 0 1 0
CPU GPU0 GPU1
CHOL0 X
TRSM1 X
TRSM2 X
SYRK3 X
GEMM4 X
SYRK5 X
CHOL6 X
TRSM7 X
SYRK8 X
CHOL9
Data Distribution
Scheduler
CPU GPU0 GPU1
Avail 9 11.5 6.5
EST 12.5 11.5 13.5
EFT 16.5 13 15
Priority 3 1 2
HEFT assignment table
40
CHOL0A0,0 ← Chol( A0,0 )
CHOL0A0,0 ← Chol( A0,0 )
TRSM2A2,0 ← A2,0 A0,0
-T
TRSM2A2,0 ← A2,0 A0,0
-T
TRSM1A1,0 ←A1,0 A0,0
-T
TRSM1A1,0 ←A1,0 A0,0
-T
SYRK5A2,2 ← A2,2 – A2,0 A2,0
T
SYRK5A2,2 ← A2,2 – A2,0 A2,0
TGEMM4A2,1 ← A2,1 – A2,0 A1,0
T
GEMM4A2,1 ← A2,1 – A2,0 A1,0
T
SYRK3A1,1 ← A1,1 – A1,0 A1,0
T
SYRK3A1,1 ← A1,1 – A1,0 A1,0
T
CHOL6A1,1 ← Chol( A1,1 )
CHOL6A1,1 ← Chol( A1,1 )
TRSM7A2,1 ← A2,1 A1,1
-T
TRSM7A2,1 ← A2,1 A1,1
-T
SYRK8A2,2 ← A2,2 – A2,1 A2,1
T
SYRK8A2,2 ← A2,2 – A2,1 A2,1
T
CHOL9A2,2 ← Chol( A2,2 )
CHOL9A2,2 ← Chol( A2,2 )
CPU GPU0 GPU1
A00 1 1 1
A10 0 1 0
A11 1 1 0
A20 1 1 1
A21 0 1 0
A22 0 1 0
CPU GPU0 GPU1
CHOL0 X
TRSM1 X
TRSM2 X
SYRK3 X
GEMM4 X
SYRK5 X
CHOL6 X
TRSM7 X
SYRK8 X
CHOL9 X
Data Distribution
Scheduler
CPU GPU0 GPU1
Avail 9 13 6.5
EST 14 13 15
EFT 15.5 14 16
Priority 2 1 3
HEFT assignment table
41
SuperMatrix Approachon Heterogeneous Platforms
• S0: D ← A*B• S1: A → L * LT
• S2: B ← B * L-T
• S3: C ← C – B * BT
• S4: X ← L-1 * X
/*-----------------------------------------------*/ FLASH_Gemm( FLA_NO_TRANSPOSE,FLA_NO_TRANSPOSE, FLA_ONE, A, B, FLA_ZERO, D ); FLASH_Chol( FLA_LOWER_TRIANGULAR, A ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A, B ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, B, FLA_ONE, C ); FLASH_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, L, X ); /*-----------------------------------------------*/
No Code Change!
42
Performance6-core single-socket Xeon E5649 CPU + 1 GTX 480 GPU cardBLOCK SIZE: 1024
6-core single-socket Xeon E5649 CPU + 2 Tesla C2070 GPU cardBLOCK SIZE: 2048
43
SuperMatrixlibflame
clBLAS
OpenCL
cuBLAS
CUDA
GP
U
CP
U/
MIC
ACML
C/assemb
ly/Fortra
n
MKL
C/Fortran/
Assembly
BLIS
C/Assemb
ly
Accelerator/O
ther platforms
BLIS
C/Assemb
ly
Conclusion
OpenMP/pthread
OpenMP/pthread
OpenMP/pthread
OpenMP/pthread
one ring to rule them all
44
SuperMatrix Approachon Heterogeneous Platforms
• S0: D ← A*B• S1: A → L * LT
• S2: B ← B * L-T
• S3: C ← C – B * BT
• S4: X ← L-1 * X
/*-----------------------------------------------*/ FLASH_Gemm( FLA_NO_TRANSPOSE,FLA_NO_TRANSPOSE, FLA_ONE, A, B, FLA_ZERO, D ); FLASH_Chol( FLA_LOWER_TRIANGULAR, A ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A, B ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, B, FLA_ONE, C ); FLASH_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, L, X ); /*-----------------------------------------------*/
No Code Change!
45
Related WorkTarget Platform Lapack Project FLAME Project
Sequential LAPACK libflame
Sequential+multithreaded BLAS LAPACK libflame
Multicore/multithreaded PLASMA libflame+SuperMatrix
Multicore+out-of-order scheduling PLASMA+Quark libflame+SuperMatrix
CPU + single GPU MAGMA libflame+SuperMatrix
Multicore + multi-GPU DAGuE/StarPU/XKaapi
libflame+SuperMatrix