efficient parallel implementation of classical gram...
TRANSCRIPT
Sep. 7, 06 PMAA06 1
Efficient Parallel Implementation of Classical Gram-Schmidt Orthogonalization Using Matrix Multiplication
Graduate School of Systems and Information Engineering,University of Tsukuba, Japan
Takuya Yokozawa Daisuke Takahashi
Taisuke Boku Mitsuhisa Sato
Sep. 7, 06 PMAA06 2
Overview
Background and objectiveClassical Gram-Schmidt orthogonalization
Using matrix multiplicationProposed algorithm
Recursive CGSPerformance resultsConclusion and future works
Sep. 7, 06 PMAA06 3
Background
Gram-Schmidt orthogonalization process is one of the fundamental algorithms in linear algebraTwo basic computational variants of the Gram-Schmidt process exist:
Classical Gram-Schmidt algorithm (CGS) Modified Gram-Schmidt algorithm (MGS)
Often selected for practical applicationMore stable than the CGS algorithm
Sep. 7, 06 PMAA06 4
Background (cont’d)
MGS is stable but, Cannot be expressed by Level-2 BLASParallel implementation requires additional communication
CGS is unstable but,Can be expressed by Level-2 BLASCGS with DGKS correction [Daniel et al. ‘76] is one of the most efficient way to perform the orthogonalization process
In this work, we use the CGS
Sep. 7, 06 PMAA06 5
In this work
We study an efficient implementation of the CGS orthogonalization using matrix multiplicationWe propose a parallel recursive CGS algorithm to perform QR decompositionWe report some performance results on PC-cluster
Sep. 7, 06 PMAA06 6
Related works
Recursion leads to automatic variable blocking for dense linear-algebra algorithms [Gustavson 1997]
Parallel QR factorization[Elmroth and Gustavson 2000]
Recursive QR factorization of by matrixOn 4-way SMP computer
m n
Sep. 7, 06 PMAA06 7
Classical Gram-Schmidt algorithmfor matrix
The CGS orthogonalization can be performed by using Level-2 BLASSuitable for parallelizationThe CGS algorithm is not stable
jjj qqq =
ijijj qaqaq ),( −=
do end
1,n j =dojj aq =
11 do ,j-i =
do end),( ji aq
ja Raw data vector
jq
Inner product
Orthogonalized vector
jq Euclid norm
Sep. 7, 06 PMAA06 8
Naïve implementationCGS for by matrix is computed by
Matrix-vector multiplications (GEMV; Level-2 BLAS) normalizations (NRM2, SCAL; Level-1 BLAS)
66656146136126116166 qqq)qa(q)qa(q)qa(q)qa(q)qa(qaq =−−−−−= ,,,,,,555555555 ,,,,, qqq)qa(q)qa(q)qa(q)qa(qaq 41312111 =−−−−=
44444444 ,,,, qqq)qa(q)qa(q)qa(qaq 312111 =−−−=33332333 ,,, qqq)qa(q)qa(qaq 211 =−−=
2222122 qqq)qa(qaq =−= ,, 1
11111 , qqqaq ==
m n
mm2
GEMV
NRM2, SCAL
)( n21 aaaA Λ= is vector ia
Sep. 7, 06 PMAA06 9
CGS using matrix multiplicationsThe CGS orthogonalization of a matrix can be changed into a matrix multiplication. [Samukawa ’95]The orthogonalization processes of the can be separated
Depend on and Depend onNormalization
312111 )qa(q)qa(q)qa(qaq 44444 ⋅−⋅−⋅−= 444 qqq =
35125115155 )qa(q)qa(q)qa(qaq ⋅−⋅−⋅−=
36126116166 )qa(q)qa(q)qa(qaq ⋅−⋅−⋅−=
451 )qa(q ⋅−
561461 )qa(q)qa(q ⋅−⋅−
555 qqq =
666 qqq =⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛
⋅⋅⋅⋅⋅⋅⋅⋅⋅
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛−⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛=
)()()()()()()()()(
6
5
4
616161
515151
414141
3
2
1
6
5
4
aqaqaqaqaqaqaqaqaq
qqq
aaa
qqq
( )⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛
⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛−⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛= 654
3
2
1
3
2
1
6
5
4
aaaqqq
qqq
aaa
qqq
t
t
t
6
5
4
1q 2q 3q 4a 5a 6a4q 5q
4q 5q 6q
Sep. 7, 06 PMAA06 10
Concept of proposed algorithm
The CGS orthogonalization can also be extended with matrix multiplication into a recursive formulation
The parts of A and C are same structure as original CGS calculation spaceThe parts of Ab and Cb are computed by matrix multiplication
A
CB
Ab
Cb
Matrix-Vector Multiplication (GEMV)Matrix-Matrix Multiplication (GEMM)
Sep. 7, 06 PMAA06 11
ParallelizationWe use row-wise distribution
The proposed recursive algorithm use matrix multiplication and matrix vector multiplication to compute inner products
Column-wise distribution Whole elements of a vector are in one processorAdditional communication is required to compute inner products.
Row-wise distributionPart of elements of a vector are in each processorCorrective communication is required to sums up and distributes inner products
Sep. 7, 06 PMAA06 12
Recursive CGS
else
then) ( if NBm ≤)(GSrecursiveCbegin A,Q,k,m
)GEMM( 2222 ,S,AQ )),(m/(m/kT
)),(m/(m/k ++
)GEMM( 2222 )),(m/(m/k)),(m/(m/k ,QS,Q ++
if endend
kkk qqq =m, kkj ++= 1 do
)GEMV( wa ,,Q jT
kk,j−)GEMV( jkk,j ,,Q qw−
jjj qqq =do end
)2,GS(recursiveC mk,A,Q
)(GSrecursiveC 2m,2mA,Q,k +
: Blocking Size NB
m2m4m
NBNB
4m
2m
Matrix-Vector Multiplication (GEMV)Matrix-Matrix Multiplication (GEMM)
Sep. 7, 06 PMAA06 13
Performance Results
To evaluate the proposed recursive CGS algorithm, we compared
Proposed recursive CGS algorithmNaïve implementation of the CGS algorithm using Level-2 BLAS
The CGS orthogonalization processes were performed on double-precision real data
Sep. 7, 06 PMAA06 14
Evaluation EnvironmentA 32-node Xeon PC-cluster
Xeon 3GHz, 1GB DDR2 Memory1000Base-T Gigabit EthernetLAM/MPI 7.1.1Intel MKL 8.1 gcc 4.0.2Linux 2.6.17All programs were run in 64-bit mode
Sep. 7, 06 PMAA06 15
Performance Results on 32-node Xeon PC cluster
0
2
4
6
8
10
12
14
16
0 2000 4000 6000 8000Matrix Size
GFLO
PS
Recursive
Naïve
about up to 1.72 times
Sep. 7, 06 PMAA06 16
Discussion(1)
The recursive CGS algorithm improved the performance by up to about 1.72 times when matrix size is 8000Cache reusability is improved
Naive implementation uses Level-2 BLASThe recursive CGS uses Level-3 BLAS
Sep. 7, 06 PMAA06 17
Performance Results (Matrix size N=8000)
0
2
4
6
8
10
12
14
16
18
20
0 4 8 12 16 20 24 28 32 36Number of Processors
GF
LO
PS
Recursive Naïve
Up to about 3.6 times
Sep. 7, 06 PMAA06 18
Discussion(2)
We could improve the performance by up to about 3.6 times when matrix size is 8000
Scalability is limitedMPI_ALLREDUCE communication time dominates in total execution time For larger problem size, matrix multiplications should be distributed
Sep. 7, 06 PMAA06 19
Breakdown of performance resultsMatrix Size N=8000
0
400
800
1200
1600
1 2 4 8 16 32
Number of Porcessors
Tim
e [
Sec]
naive communication
recursive communication
naive computation
recursive computation
Sep. 7, 06 PMAA06 20
Conclusion
We proposed a recursive CGS algorithm using matrix multiplicationThe proposed algorithm improved the performance by naïve implementation
Cache reusability was improved since using matrix multiplication
The performance results demonstrate that the recursive CGS algorithm is efficient for reduce execution time of QR decomposition.
Sep. 7, 06 PMAA06 21
Future works
Improve scalability Using other data distribution algorithm
Development algorithm to compute optimal blocking size “NB”