parallel iterative solution of the hermite collocation equations on gpus
TRANSCRIPT
Parallel Iterative solution of the
Hermite Collocation Equations
on GPUs
Emmanuel N. Mathioudakis
Co-Authors : Elena Papadopoulou – Yiannis Saridakis – Nikolaos Vilanakis
TECHNICAL UNIVERSITY OF CRETE
DEPARTMENT OF SCIENCES
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
73100 CHANIA - CRETE - GREECE
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Talk Overview
Hermite Collocation for elliptic BVPs &
Derivation of the Collocation linear system
Development of a parallel algorithm for the
Schur Complement method on Shared
Memory Parallel Architectures
Parallel implementation on multicore computers
with GPUs
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
( , ) ( , ) , ( , )
( , ) ( , ) , ( , )
u x y f x y x y
u x y g x y x y
L
BBV
P
( , ) ( , ) 0 , ( , ):
( , ) ( , ) 0 , (,
, )
y y yx x xi j i j i j
y y yx x xi j i j i j
ai j
u f
u g
L
B
Hermite Collocation Method
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Hermite Collocation Method…
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Hermite Collocation Method…
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Red – Black Collocation Linear system
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Red – Black Collocation Linear system
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Red – Black Collocation Linear system
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Red – Black Collocation Linear system
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Red – Black Collocation Linear system
with 0
( , ) ( , ) ( , ) , ( , )
( , ) ( , ) , ( , )
u x y u x y f x y x y
u x y g x y x y
2M
od
el
Pro
ble
m
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Helmholtz Collocation
Matrix
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Red – Black Collocation Linear system
The collocation matrix is large, sparse and enjoys no pleasant
properties (e.g. symmetric, definite)
Iterative
+
Parallel
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Iterative Solution
with
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Iterative Solution
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Eigenvalues of Collocation matrix
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Schur Complement Iterative Solution
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Parallel Iterative Solution of Collocation Linear system
on Shared Memory Architectures
Uniform Load Balancing between core threads
Minimal Idle Cycles of core threads
Minimal Communication
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
case of ns = 2p
( )1Rx ( )
2Rx ( )
3Rx ( )
4Rx
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
case of ns = 2p
( )1
Bp
x
( )2
Bp
x
( )3
Bp
x
( )4
Bp
x
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
case of ns = 2p
V1
( )1Rx ( )
2Rx ( )
3Rx ( )
4Rx( )
1B
px
( )
2B
px
( )
3B
px
( )4
Bp
x
V2 V3 V4 V5 V6 V7 V8
V1 V2 V3 V4 V5
Mapping into a fixed size Architecture of N Cores
case of k = 2p/N even 2k virtual threads
l=(j-1)k+1,…,jk
l=2p+(j-1)k+1,…,2p+jk
j=1,…,N
( )Bl
x
( )Rl
x
V2 . . .
Pj
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Parallel Schur Complement Iterative Solution
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Parallel BiCGSTAB
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Parallel BiCGSTAB
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
The Dirichlet Helmholtz Problem
2100( 0.1)2with
( , ) 10 ( ) ( ) , ( , ) [0,1] [0,1]
( ) ( ) x
u x y x y x y
x x x e
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
HP SL390s
6 core [email protected]
24GB memory
Oracle Linux 6.3 x64
PGI 13.5 Fortran
PCI-e gen2 x16
HP SL390s – Tesla M2070 GPUs
+ 2 x
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Realization on HP SL390s Tesla GPU machine Iterations / Error measurements
ns BiCGSTAB
Iterations || b – Axn||2
256 294 6.06e-11
512 589 2.85e-11
1024 1161 1.39e-11
2048 3726 9.59e-12
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Realization on HP SL390s Tesla GPU machine Time measurements
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Realization on HP SL390s Tesla GPU machine Time measurements
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Realization on HP SL390s Tesla GPU machine Time measurements
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Realization on HP SL390s Tesla GPU machine Time measurements
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Realization on HP SL390s Tesla GPU machine Time measurements
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Realization on HP SL390s Tesla GPU machine Time measurements
Conclusions
• A new parallel algorithm implementing the Schur complement with BiCGSTAB iterative method for Hermite Collocation equations has been developed.
• The algorithm is realized on Shared Memory multi-core machines with GPUs .
• A performance acceleration of up to 30% is observed.
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY
Future work
• Design an efficient parallel Schur complement algorithm of the Hermite Collocation equations for Multiprocessor /Grid machines with GPUs.
TECHNICAL UNIVERSITY OF CRETE
APPLIED MATHEMATICS AND COMPUTERS LABORATORY