rwth aachen :: high performance matrix computation
TRANSCRIPT
![Page 1: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/1.jpg)
1592653589793238462643383279502884197169399375491898018
4159265358979323846264338327950288419716939937549189801
1415926535897932384626433832795028841971693993754918980
926535897932384626433832795028841971693993754918980183
1592653589793238462643383279502884197169399375491898018
4159265358979323846264338327950288419716939937549189801
5926535897932384626433832795028841971693993754918980183
4159265358979323846264338327950288419716939937549189801
1592653589793238462643383279502884197169399375491898018
1592653589793238462643383279502884197169399375491898018
5926535897932384626433832795028841971693993754918980183
1415926535897932384626433832795028841971693993754918980
5926535897932384626433832795028841971693993754918980183
4159265358979323846264338327950288419716939937549189801
1592653589793238462643383279502884197169399375491898018
1415926535897932384626433832795028841971693993754918980
High Performance Linear Algebraon Data Parallel Co-Processors II
![Page 2: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/2.jpg)
2
[email protected] University of Toronto / Technische Universität Berlin
SMEM
L1
SMEM
L1
SMEM
L1
SMEM
L1
SMEM
L1
SMEM
L1
SMEM
L1
SMEM
L1
L2
Global Memory
Host
Input Assembler
Thread Scheduler
Data parallel co-processors
![Page 3: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/3.jpg)
3
[email protected] University of Toronto / Technische Universität Berlin
Memory latency
!
!
! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! !
, , ,! ! ! ! ! ! ! !! ! ! ! ( ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !!_Y(!&99L'(!8 ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! !!
!
! ! ! ! !
! ! ! ! ! ! ! ! ! !
!
! !
! ! ! ! ! ! ! !
! ! ( ! ! ! ! ! !
! ! ! ! ! ! ! !
( ! ! ( ! ! ! ! !
! ! ! ! ! ! ! ! ! !!
! !
! ! ( ! ! ! ! ! !!
! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !!d ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !!
, , , ,! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !! ! !
"#$! 0011"-.! 0311"-4! 2011"-.! "-./01!
&8!MF%9?!"IG9! 03! :/! C1! ;<;!
&=FE%(N!,*MJ! 02b! 0:b! 0Db! 02b!
'F9&=FE%(N! 2b! ;1b! 2b! D;b!
98+FN(U/! 2b! ;1b! 2b! <Db!
98+FN(U;1! ;1b! ;1b! 2b! ;1b!
98+FN(U;111! 1B2b! /B;b! ;B;b! ;B;b!
, , , , !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! Y(! ,*L=N%c8! &,PF(7(!
, ! ! ! ! ! ! ! ! !! ! ! !!
Y ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !!
, , , , , !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !
!
! ! ! ! !! !
! ! !
! ! !
! ! !!
! ! ! ! ! ! ! !! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! !
, , , , , ,! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
Vo
lko
v, V
., an
d J
. W
. D
emm
el.
Ben
chm
arki
ng
GP
Us
to T
un
e D
ense
Lin
ear
Alg
ebra
. 20
08.
![Page 4: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/4.jpg)
4
[email protected] University of Toronto / Technische Universität Berlin
Memory latency
!
!
! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! !
, , ,! ! ! ! ! ! ! !! ! ! ! ( ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !!_Y(!&99L'(!8 ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! !!
!
! ! ! ! !
! ! ! ! ! ! ! ! ! !
!
! !
! ! ! ! ! ! ! !
! ! ( ! ! ! ! ! !
! ! ! ! ! ! ! !
( ! ! ( ! ! ! ! !
! ! ! ! ! ! ! ! ! !!
! !
! ! ( ! ! ! ! ! !!
! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !!d ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !!
, , , ,! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !! ! !
"#$! 0011"-.! 0311"-4! 2011"-.! "-./01!
&8!MF%9?!"IG9! 03! :/! C1! ;<;!
&=FE%(N!,*MJ! 02b! 0:b! 0Db! 02b!
'F9&=FE%(N! 2b! ;1b! 2b! D;b!
98+FN(U/! 2b! ;1b! 2b! <Db!
98+FN(U;1! ;1b! ;1b! 2b! ;1b!
98+FN(U;111! 1B2b! /B;b! ;B;b! ;B;b!
, , , , !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! Y(! ,*L=N%c8! &,PF(7(!
, ! ! ! ! ! ! ! ! !! ! ! !!
Y ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !!
, , , , , !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !
!
! ! ! ! !! !
! ! !
! ! !
! ! !!
! ! ! ! ! ! ! !! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! !
, , , , , ,! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
Vo
lko
v, V
., an
d J
. W
. D
emm
el.
Ben
chm
arki
ng
GP
Us
to T
un
e D
ense
Lin
ear
Alg
ebra
. 20
08.
![Page 5: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/5.jpg)
5
[email protected] University of Toronto / Technische Universität Berlin
Linear algebra libraries
• CUBLAS: BLAS implementation for CUDA.
• CUSparse: Sparse matrix operations.
• CUFFT: FFT implementation for CUDA.
• CULA: Lapack implementation for CUDA.
• CURand: (Quasi) random number generation.
• NAG GPU library: NAG for GPUs.
• Thrust: STL style library for commonfunctionality (sort, reduce, ...).
![Page 6: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/6.jpg)
6
[email protected] University of Toronto / Technische Universität Berlin
BLAS performance
htt
p:/
/g
pg
pu
.org
/w
p/
wp
-co
nte
nt/
up
load
s/20
09/
11/
SC
09_C
UD
A_T
oo
ls_C
oh
en.p
df
![Page 7: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/7.jpg)
7
[email protected] University of Toronto / Technische Universität Berlin
LU factorization using BLAS!
!
"
#
$"
$#
%"
%#
&"
'( $%) %#' #$% $"%( %"() ("*' )$*% $'&)( &%+')
,-./012
345678!/-!09:4.
;9:4. -9<8/=5>985/:?@;A!1!,;A2!1!4285B9842
) ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! !! ! ! ! ! !
! ! ! ! ! ! ! !! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! F9!LMN&8(NB!-P(!'&98(+!8P+(&N!9MF%9!
* ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !!- ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !
, , , , , , ,! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! !!! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! </?111B!-PL9?! &8!0! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! !! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!:! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! !
, , , , , , ,! ! ! ! ( ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !
( ( ( ( ( ( !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !!
, , ,! ! ! ! ! ! ! ( ( ! ! ! ! ! !! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !
Vo
lko
v, V
., an
d J
. W
. D
emm
el.
Ben
chm
arki
ng
GP
Us
to T
un
e D
ense
Lin
ear
Alg
ebra
. 20
08.
![Page 8: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/8.jpg)
8
[email protected] University of Toronto / Technische Universität Berlin
CUFFT performance
htt
p:/
/g
pg
pu
.org
/w
p/
wp
-co
nte
nt/
up
load
s/20
09/
11/
SC
09_C
UD
A_T
oo
ls_C
oh
en.p
df
![Page 9: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/9.jpg)
9
[email protected] University of Toronto / Technische Universität Berlin
CUSparse performance
htt
p:/
/w
ww
.nv
idia
.co
m/
con
ten
t/G
TC
-201
0/p
dfs
/20
70_G
TC
2010
.pd
f
![Page 10: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/10.jpg)
10
[email protected] University of Toronto / Technische Universität Berlin
CUSparse performance
htt
p:/
/w
ww
.nv
idia
.co
m/
con
ten
t/G
TC
-201
0/p
dfs
/20
70_G
TC
2010
.pd
f
![Page 11: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/11.jpg)
11
[email protected] University of Toronto / Technische Universität Berlin
GEMM
• General weighted matrix-matrix multiplication
![Page 12: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/12.jpg)
12
[email protected] University of Toronto / Technische Universität Berlin
GEMM
• General weighted matrix-matrix multiplication
C = αAB+ βC
![Page 13: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/13.jpg)
13
[email protected] University of Toronto / Technische Universität Berlin
GEMM
• General weighted matrix-matrix multiplication
• Algorithm has to be blocked to avoid beingmemory bandwidth limited.
C = αAB+ βC
![Page 14: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/14.jpg)
14
[email protected] University of Toronto / Technische Universität Berlin
GEMM
• General weighted matrix-matrix multiplication
• Algorithm has to be blocked to avoid beingmemory bandwidth limited.
• Many implementation details have to be alignedwith the details of the hardware architecture ...
C = αAB+ βC
![Page 15: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/15.jpg)
15
[email protected] University of Toronto / Technische Universität Berlin
GEMM
70% l l d dd h d h d ( )
60%
70%
our implementation (60%)
multiply and add with an operand in shared memory (66%)
50%
k
30%
40%
onof
Pea CUBLAS 1.1 (37%)
20%Fracti
0%
10%
0%
64 128 256 512 1024 2048 4096N
htt
p:/
/w
ww
.cs.
ber
kel
ey.e
du
/~
vo
lko
v/
vo
lko
v08
-sc0
8tal
k.p
df
![Page 16: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/16.jpg)
16
[email protected] University of Toronto / Technische Universität Berlin
GEMM!
!
! ! ! !
( ! ! (! ! ! ! !
! ! ! ! ! ! ! !
! ! ! ! ! ! ! !
! ! ! !! !!
M ! ! ! ! !
! ! ! ! ! ! ! !
! ! ! ! ! ! ! !
! ! ! !
! ! ! !
! ! ! ! ! !
! ! ! !
! !! :/! 1!
` ! ! ! ! ! !
! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! !
"
#"
$""
$#"
%""
%#"
&""
&#"
(""
'( $%) %#' #$% $"%( %"() ("*'
,-./012
C
))"",DE!F$'!G!$H&#,3>I
*)"",DE!F$'!G!$H'+,3>I
,DE%)"!F&"!G!$H&",3>I
)'"",DJ!F(!G!$H(#,3>I@/=4%KL9MN!&H",3>
'!;
&);
'!;
.#;'!;
@/=4%!OL/N!%H'+,3> ).;
J,PQQNB98=5<42CGC
) ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !!
!
! !
!
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! !! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !!
= ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! !! ! ! ! ! ! !! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! !
, , , , ,! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
Vo
lko
v, V
., an
d J
. W
. D
emm
el.
Ben
chm
arki
ng
GP
Us
to T
un
e D
ense
Lin
ear
Alg
ebra
. 20
08.
![Page 17: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/17.jpg)
17
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
• Sought: all eigenvalues of tridiagonal,symmetric input matrix T.
![Page 18: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/18.jpg)
18
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
• Sought: all eigenvalues of tridiagonal,symmetric input matrix T.
• Ingredients: Gerschgorin intervaland Sturm counts .
G(T) = [l, u]) [ ]
C(y) = k
![Page 19: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/19.jpg)
19
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
• Sought: all eigenvalues of tridiagonal,symmetric input matrix T.
• Ingredients: Gerschgorin intervaland Sturm counts .
• Recipe: Bracket eigenvalues starting from theGerschgorin interval using Sturm counts.
G(T) = [l, u]) [ ]
C(y) = k
![Page 20: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/20.jpg)
20
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0 n
u
![Page 21: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/21.jpg)
21
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0 C(m1) n
um1
![Page 22: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/22.jpg)
22
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0
0 C(m1)
C(m1)
C(m1)
n
n
u
ul m1 m1
m1
![Page 23: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/23.jpg)
23
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0
0 C(m1)
C(m1)
C(m1)
n
n
u
ul m1
C(m21)
m21
C(m21)
m21m1
m1
![Page 24: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/24.jpg)
24
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0
0 C(m1)
C(m1)
C(m1)
n
n
u
ul m1
0 C(m21)
C(m21)
l m21
m21
C(m21)
m21
C(m21) C(m22)
m21 m22
m1
m1
C(m23) n
um23
C(m22) C(m23)
m23m22
![Page 25: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/25.jpg)
25
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0
0 C(m1)
C(m1)
C(m1)
n
n
u
ul m1
0 C(m21)
C(m21)
l m21
m21
C(m21)
m21
C(m21) C(m22)
m21 m22
m1
m1
C(m23) n
um23m31 m32 m32
C(m22) C(m23)
m23m22
![Page 26: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/26.jpg)
26
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0
0 C(m1)
C(m1)
C(m1)
n
n
u
ul m1
0 C(m21)
C(m21)
l m21
m21
C(m21)
m21
C(m21) C(m22)
m21 m22
m1
m1
C(m23) n
um23m31 m32 m32
C(m22) C(m23)
m23m22
![Page 27: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/27.jpg)
27
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0
0 C(m1)
C(m1)
C(m1)
n
n
u
ul m1
0 C(m21)
C(m21)
l m21
m21
C(m21)
m21
C(m21) C(m22)
m21 m22
m1
m1
C(m23) n
um23m31 m32 m32
C(m22) C(m23)
m23m22
C(mlk) C(mlk+1)
mlk+1mlk
![Page 28: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/28.jpg)
28
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
Count = 0;
d = 1;
for i = 1 to n
d = ai - x - (bi-1*bi-1)/d;
if d < 0
Count += 1;
endif
endfor
![Page 29: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/29.jpg)
29
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0 n
u
![Page 30: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/30.jpg)
30
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0 C(m1) n
um1
1X
![Page 31: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/31.jpg)
31
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0
0 C(m1)
C(m1)
C(m1)
n
n
u
ul m1 m1
m1
1X
![Page 32: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/32.jpg)
32
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0
0 C(m1)
C(m1)
C(m1)
n
n
u
ul m1
C(m21)
m21
C(m21)
m21m1
m1
1X
2X
![Page 33: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/33.jpg)
33
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0
0 C(m1)
C(m1)
C(m1)
n
n
u
ul m1
0 C(m21)
C(m21)
l m21
m21
C(m21)
m21
C(m21) C(m22)
m21 m22
m1
m1
C(m23) n
um23
C(m22) C(m23)
m23m22
1X
2X
![Page 34: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/34.jpg)
34
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0
0 C(m1)
C(m1)
C(m1)
n
n
u
ul m1
0 C(m21)
C(m21)
l m21
m21
C(m21)
m21m1
m1
C(m23) n
um23m31 m32 m32
C(m22) C(m23)
m23m22
1X
2X
3X
![Page 35: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/35.jpg)
35
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0
0 C(m1)
C(m1)
C(m1)
n
n
u
ul m1
0 C(m21)
C(m21)
l m21
m21
C(m21)
m21m1
m1
C(m23) n
um23m31 m32 m32
C(m22) C(m23)
m23m22
1X
2X
3X
![Page 36: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/36.jpg)
36
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0
0 C(m1)
C(m1)
C(m1)
n
n
u
ul m1
0 C(m21)
C(m21)
l m21
m21
C(m21)
m21m1
m1
C(m23) n
um23m31 m32 m32
C(m22) C(m23)
m23m22
C(mlk) C(mlk+1)
mlk+1mlk
1X
2X
3X
≈ nX
![Page 37: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/37.jpg)
37
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
• Strategy: one thread per tree node (or interval).
– Tree is processed level by level.
![Page 38: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/38.jpg)
38
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
• Strategy: one thread per tree node (or interval).
– Tree is processed level by level.
• Implementation “details”?
– Memory management? Can and should we useshared memory?
– How can we handle large matrices?
![Page 39: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/39.jpg)
39
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
1. Compute Gerschgorin interval.
![Page 40: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/40.jpg)
40
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
1. Compute Gerschgorin interval.
2. Move data to co-processor.
– Matrix data (ai, bi).
– Gerschgorin interval.
![Page 41: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/41.jpg)
41
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
1. Compute Gerschgorin interval.
2. Move data to co-processor.
– Matrix data (ai, bi).
– Gerschgorin interval.
3. Launch kernel.
![Page 42: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/42.jpg)
42
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
// move matrix data to shared mem
__shared__ float a[MAX_MAT_SIZE];
__shared__ float b[MAX_MAT_SIZE];
for( i=0; i<(mat_size/num_threads); ++i) {
a[tid] = a_g[tid];
b[tid] = b_g[tid];
}
__synchthreads();
![Page 43: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/43.jpg)
43
[email protected] University of Toronto / Technische Universität Berlin
struct SharedMem {
float left[MAX_THREADS_BLOCK];
float right[MAX_THREADS_BLOCK];
short left_count[MAX_THREADS_BLOCK];
short right_count[MAX_THREADS_BLOCK];
short compact[MAX_THREADS_BLOCK+1];
};
Bisection for eigenvalue spectrum
![Page 44: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/44.jpg)
44
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
// move Gerschgorin data to shared mem
__shared__ SharedMem smem;
if( 0 == tid) {
smem.left[0] = lg;
smem.right[0] = ug;
smem.left_count[0] = 0;
smem.right_count[0] = mat_size+1;
}
![Page 45: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/45.jpg)
45
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
// compute midpoint of interval
// compute Sturm count for midpoint
// write intervals to shared memory
smem.left[tid] = left;
smem.right[tid] = mid;
smem.left[tid+1] = mid;
smem.right[tid+1] = right;
...
}
![Page 46: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/46.jpg)
46
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
if( tid < num_active threads) {
float left = smem.left[tid];
float right = smem.right[tid];
...
// compute midpoint of interval
float mid = midpoint( left, right);
// compute Sturm count for midpoint
mid_count = sturm_count( left, right, mid);
![Page 47: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/47.jpg)
47
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
// compute Sturm count for midpoint
mid_count = sturm_count( left, right, mid);
// write intervals to shared memory
__syncthreads();
smem.left[2*tid] = left;
smem.right[2*tid] = mid;
smem.left[2*tid+1] = mid;
smem.right[2*tid+1] = right;
num_intervals *= 2;
}
![Page 48: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/48.jpg)
48
[email protected] University of Toronto / Technische Universität Berlin
Interlude: prefix sum (scan)
• Swiss army knife of data-parallel programming.
• Sort, max, min, reduce, ...
![Page 49: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/49.jpg)
49
[email protected] University of Toronto / Technische Universität Berlin
Interlude: prefix sum (scan)
• Swiss army knife of data-parallel programming.
• Sort, max, min, reduce, ...
• Divide-and-Conquer strategy.
– Traverse tree first bottom up and then top down.
– Bottom up: gather information.
– Top down: distribute information.
![Page 50: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/50.jpg)
50
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
// write intervals to shared memory
__syncthreads();
smem.left[2*tid] = left;
smem.right[2*tid] = mid;
smem.left[2*tid+1] = mid;
smem.right[2*tid+1] = right;
// compact intervals in shared memory
...
num_intervals = smem.compact[num_intervals];
}
![Page 51: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/51.jpg)
51
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
1. Read data from global to shared memory.
2. Until all intervals are converged:
– Read interval into registers.
– Bisect and compute Sturm count.
– Write to shared memory.
– Compact intervals.
3. Write eigenvalues to global memory.
![Page 52: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/52.jpg)
52
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
for( j=0; j<(mat_size/blockDim.x); ++j){
// read data with all threads in the block
// make sure the shared memory is not used
__syncthreads();
// read data for current block
if( addr < mat_size) {
d_data[tid] = g_d[addr];
lld_data[tid] = g_lld[addr];
}
__syncthreads();
![Page 53: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/53.jpg)
53
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
// read data for current block
if( addr < mat_size) {
a[tid] = a_g[addr];
b[tid] = b_g[addr];
}
__syncthreads();
// for all active intervals
if(tid < num_intervals) {
// Sturm count computation for current block
...
![Page 54: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/54.jpg)
54
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
0 500 1000 1500 2000 2500 3000 3500 40000
2000
4000
6000
8000
10000
12000
14000
16000
18000Execution Time
Matrix Size
Tim
e (
ms)
sstemr mean
sstemr min
sstemr max
mrrr_dp mean
mrrr_dp min
mrrr_dp max
![Page 55: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/55.jpg)
55
[email protected] University of Toronto / Technische Universität Berlin
How to get started?
• http://developer.nvidia.com/cuda/
– Cuda 3.0 has device emulation mode.
– SDK has many, many code examples.
– NVIDIA linear algebra libraries.
• http://www.cs.berkeley.edu/~volkov/
– Performance analysis of linear algebra on dataparallel co-processors.
![Page 56: RWTH Aachen :: High Performance Matrix Computation](https://reader033.vdocuments.us/reader033/viewer/2022061614/62a073f14e8dbc3e74157e4e/html5/thumbnails/56.jpg)
56
[email protected] University of Toronto / Technische Universität Berlin
Conclusion
• Linear algebra on data parallel co-processors canbe significantly more efficient than on CPUs.
• But requires architecture specific design andoptimizations.
• For many algorithms still unclear how toefficiently implement them on a data parallel co-processor.