parareal acceleration of
TRANSCRIPT
Parareal Acceleration ofMatrix Multiplication
Toshiya Takami and Akira Nishida
Kyushu University, Japan
Aug. 30 - Sep. 2, 2011 ParCo2011, Ghent, Belgium
1
Contents
Introduction: Time-domain Decomposition
What is Parareal?
Parareal-in-Time Algorithm as a Perturbation
Application: Series by Matrix-Vector Multiplications
Convergence Property
Speed-up Ratio and Efficiency
Discussion: Applicability to Other Linear Calculations
Conclusion
2
Time Evolution
Time-evolution problems are widely solved in scientific
simulations described by discretized differential equations.
Parallel technique is usually applied through domain
decomposition in the space direction, where quantity
on the surface of each domain must be shared with its
neighbors.
On the other hand, efficient parallelism by the time-
domain decomposition seems difficult because of its
severe dependency on the previous state.
3
Time-domain Parallelism
Time-evolution is usually defined by strictly dependent
relations, which is difficult to be parallelized
‘’Parareal-in-Time’’ is one of the time-domain methods that
can be used in spite of such strict dependency.
x0
t = 0
・・・1 2
x1 x2 x3
3
F1 F2
・・・
・・・ xk−1 xk
k − 1 k
Fk−2F0 Fk−1F3 Fk−3Fk−4F4 F5
4 5 k − 2k − 3
xk−3 xk−2x4 x5
Domain Decomposition in Time Direction
xk+1 = Fk(xk)
J-L. Lions, Y. Maday, and G. Turinici, C. R. Acad. Sci., Ser. I: Math. 232, 661–668 (2001).
4
What is “Parareal”?
“Its main goal concerns real time problems, hence the
proposed terminology of ‘parareal’ algorithm.
“Parareal is not the first algorithm to propose the solution of
evolution problems in a time-parallel fashion”
Multiple Shooting Methods with Newton’s Iterations
Space-Time Multigrid Method
M. J. Gander and S. Vandewalle, SIAM J. Sci. Comput. 29, 556-578 (2007).
J-L. Lions, Y. Maday, and G. Turinici, C. R. Acad. Sci., Ser. I: Math. 232, 661–668 (2001).
5
Parareal-in-Time for Scientific Applications
Since the first proposal by J-L. Lion, et al., various time-
evolution problems has been analyzed.
PDE with a fluid-structure coupling:
Molecular Dynamics with Quantum Iteraction:
Quantum Control Problem:
Symplectic Integrator:
L. Baffico, S. Bernard, Y. Maday, G. Turinici, and G. Zérah,
Phys. Rev. E66, 057701 (2002).
G. Bal and Q. Wu, LNCSE 60, 401–408 (Springer, 2008).
C. Farhat and M. Chandesris, Int. J. Numer. Meth. Engng. 58, 1397-1434 (2003).
Y. Maday, G. Turinici, Int. J. Quant. Chem. 93, 223-228 (2003).
6
Sequence Defined byParareal-in-Time
Instead of the sequence defined by ,
consider an approximated one by an iterative relation,
where is a coarse evolution that approximates the
original one , and r is the order of the approximation.
The exact operators can be calculated in parallel.
In the limit , it has been shown that the approximate
sequence converges to the original one:
which is called the Parareal-in-Time method.
{xk}{x(r)
k }
x(r+1)k+1 = Gk(x(r+1)
k ) + Fk(x(r)k )− Gk(x(r)
k )
Gk(·)Fk(·)
r →∞{x(r)
k }→ {xk}
xk+1 = Fk(xk)
J-L. Lions, Y. Maday, and G. Turinici, C. R. Acad. Sci., Ser. I: Math. 232, 661–668 (2001).
Fk(·)
7
Parareal-in-Timeas a Perturbation (1)
While convergence of the Parareal-in-Time method is
already shown as the multiple-shooting method, a simple
explanation is given here using a linear problem defined by
bounded operators:
consider a sequence of vectors
defined by
where we assume and are bounded.
Convergence of the sequence above can be explained as a
perturbation expansion and the spectral radius of operators.
{xk}xk+1 = Fkxk
= [Gk + (Fk − Gk)]xk
Gk Fk − Gk
8
Parareal-in-Timeas a Perturbation (2)
For linear problems, the operator is separated into a
coarse operator and its correction ,
If we introduce as r-th order approximation of
Thus, the Parareal-in-Time calculations for linear problems
can be analyzed as a higher order perturbation.
xk = [Gk−1 + (Fk−1 − Gk−1)]xk−1 =k−1∏
j=0
[Gj + (Fj − Gj)]x0
Fj
Gj Fj − Gj
x(r+1)k+1 = [Gk + (Fk − Gk)]x(r+1)
k
= Gkx(r+1)k + (Fk − Gk)x(r+1)
k
≈ Gkx(r+1)k + (Fk − Gk)x(r)
k
{x(r)k } {xk}
9
How to Calculatethe Parareal Sequence
The parallel procedure of this calculation is:
are parallel in k, but are sequantial.
k-direction is first, r-direction is sequential
x0 x1 x2 x3 xk−2 xk−1 xk
r = 1
r = 2
r = 3
G
F − G
x(3)k
G
F − G
G
F − GG
F − G
G
F − GG
F − G
xk−3
F − G
G
F − G
G
F − G
G
F − G
G
F − G
F − GGGG
G
G
r = 0
x(2)k−1
x(1)k−2
...
...
...
...
...G
F − GG
F − GG
F − GG
x4
Order of Perturbation
Time
GkFk − Gk
10
Series defined by Matrix- Vector Multiplication
A linear time-evolution defined by matrix multiplications,
can be represented as an expanded sum in the order of .
The third order approximation:
ε
x(3)k =
[Gk + ε
∑(Terms with a F )
+ε2∑
(Terms with two F ’s)
+ε3∑
(Terms with three F ’s)]x0
xk = (G + εF )xk−1 = (G + εF )kx0
11
Residual Errorsfrom Spectral Radius
Errors can be estimated from the rest of expanded terms:
Error from the exact sequence is bounded by
By the Stirling’s formula, the magnitude of r-th term is
k∑
j=r+1
εj∑
(Terms with F j)x0
∣∣∣x(r)k − x0
∣∣∣ρ(Gk) |x0| ≤
k∑
j=r+1
(kj
) [ερ(F )ρ(G)
]j
(kr
) [ερ(F )ρ(G)
]r
=
√k
2π(k − r)r
(ek
r
)r [ερ(F )ρ(G)
]r
12
Example 1:Real-Symmetric Matrix
Consider a sequence of the real vector defined by a real-
symmetrix matrix , where G is diagonal and F is a
random matrix with elements satisfying
N=1024, !=0.01 and in this case,
we can set
dashed: error of (r+1)-th term
(a)
Norm
ali
zed
Err
or
Length of Sequence
-19
-17
-15
-13
-11
-9
-7
-5
10
10
10
10
10
10
10
10
r=5
r=10
r=15
0 50 100 150 200
{xk}G + εF
〈|Fii|2〉 =1
2N, 〈|Fij |2〉 =
14N
ρ(F ) ≈ ρ(G) ≈ 1
-1 1
eigenvalues of F
13
Example 2:Unitary Matrix
Unitary time-evolution is often used in quantum mechanics:
where H is a Hermitian matrix satisfying
N=1024, !=0.01 and in this case,
we can set
dashed: error of (r+1)-th term
(b)
No
rma
lize
d E
rro
r
Length of Sequence
-13
-12
-11
-10
-9
-8
-7
-6
-5
r=10
r=20
r=30
r=40
r=50
10
10
10
10
10
10
10
10
10
0 200 400 600 800 1000
xk = exp[∆t
i! H
]xk−1 ≈ (I − iεH)k x0
-1 1
eigenvalues of H
〈|Hii|2〉 = 〈|Hij |2〉 =1
4N
ρ(H) ≈ 1
14
Implementation by MPI
We can implement the Parareal-in-Time iteration on parallel computers by the use of MPI_Send/Recv.
The original number of calculations, , is compared with the parareal one, .
G G G
G G G
G G G
G G
G
G
G
G
F
G
Process 1:
Process 2:
Process 3:
Process 4:
Iterate r times
P resources
KTg + Tc(P − 1) rK(Tf + Tg)/P
F F
F F F
F F F
F F FComputational Time
K(Tg + Tf )KTg + Tc(P − 1) + rK(Tg + Tf )/P
E. Aubanel, Parallel Computing 37, 172-182 (2011).
15
Estimation ofSpeed-up Ratio
Speed-up Ratio will be represented in the function,
where and .
Property of the Speed-up:
Max. Efficiency: small P
Max. Speed-up: Sp
eed
up
Rati
o
Total Number of Ranks (P)
Limit of Speedup
K>>P>>1 K~P
Max
. E
ffic
iency
(1/r
)
0
1/T
0 K
S(r, K, P ) =K(Tg + Tf )
KTg + Tc(P − 1) +rK(Tf + Tg)
P
=P
r + TP +P (P − 1)
Kt
T ≡ Tg/(Tg + Tf ) t ≡ Tc/(Tg + Tf )
K ! P ! 1
16
Parallel Performanceon a Cluster Machine
The performance of the Parareal method is studied on a cluster with 768 cores (Westmere 12 cores x 64 nodes).
When we increase the order r of the parareal iterations.
Large matrix: almost linear
Small matrix: not linear
(512 parallel execution)
(a)
Tim
e (
sec)
Parareal Iterations
Normal Method
Normal Method
Parareal (N=1024)
Parareal (N=8192)
1
2
5
10
20
50
100
200
500
1000
0 10 20 30 40 50
17
Speed-up Ratio
The dashed line represents
Actual speed-up is almost half of the value by S(r,K,P).
Efficiency is very low, which is bounded by 1/r.
N=8192: linear speed-up
N=1024: saturated
(b)
Sp
eed
up
Rati
o
Total Number of Rank
Linea
r Spee
dup
r=10
r=20
r=40
N=8192
N=1024
0.2
0.5
1
2
5
10
20
50
100
1 10 100 1000
S(r, K, P ) =P
r + TP +P (P − 1)
Kt
情報処理学会研究報告IPSJ SIG Technical Report
(a)
Tim
e (
sec)
Parareal Iterations
Normal Method
Normal Method
Parareal (N=1024)
Parareal (N=8192)
1
2
5
10
20
50
100
200
500
1000
0 10 20 30 40 50
(b)
Sp
eed
up
Rati
o
Total Number of Rank
Linea
r Spee
dup
r=10
r=20
r=40
N=8192
N=1024
0.2
0.5
1
2
5
10
20
50
100
1 10 100 1000
図 6 Parareal-in-Time 法と通常計算の速度比較: (a) r 次計算を 512 並列で実行した場合の経過時間,(b) 並列計算でのスピードアップ比.黒のマークは小規模行列の問題 (1024 × 1024),白のマークは比較的大規模な行列の問題 (8192 × 8192) を表す.四角,丸,三角は,それぞれ,r = 10, 20, 40 次の計算に対応する.
表 1 並列計算によるスピードアップ比
P 16 32 64 128 256 512 768
N = 1024 r = 10 0.71 0.88 1.27 1.61 1.62 1.86 1.73
r = 20 0.36 0.61 0.84 1.16 1.38 1.55 1.49
r = 40 0.19 0.35 0.56 0.80 0.89 1.25 1.31
N = 8192 r = 10 0.69 1.35 2.70 5.41 10.68 21.38 30.97
r = 20 0.36 0.71 1.42 2.83 5.56 11.20 16.63
r = 40 0.19 0.37 0.73 1.44 2.80 5.82 8.52
計算時間の比 T は,G と F の計算量を比べることで見積もることは可能だが,実際の実行時間は他の様々な要因から予測が難しい場合もある.この手法が有効か,そうでないかを決定するには,データ量などに応じた現実の計算性能を正しく考慮する必要がある.
4.2 並列計算でのスピードアップ比行列演算を Parareal-in-Time法で並列化した場合の経過時間を,768 CPU (12 cores ×
72 nodes) を持つ並列クラスタで測定した.結果を図 6 に示すが,ここでは,量子力学の時間発展をモデル化したユニタリ行列 (第 3.2 節) を対象とし,長さが 1536 の複素ベクトル列 (K = 1536) の計算に対して測定した.図 6(a) では,512 並列実行時の計算時間を,Parareal-in-Time法の次数に対してプロットした.水平の線は,並列化していない従来法での時間を表す.N = 8192 の場合,次数 r = 5 から 50 に対して,14.7 秒から 130.1 秒の計算時間になっており,図ではわかりにくいが r にほぼ線形の関係にある.また,1 CPUで
Large MatrixSmall Matrix
Long Time Series
Short Time Series
Distributed-parallelLinear Library
Multi-threadLinear Library
EigenvalueProblem Parareal-in-Time
Acceleration
図 7 Target problem sizes of the parareal-in-time and usual schemes.
の従来法での計算が 602.5 秒であるので,スピードアップは 41.0 倍から 4.6 倍である.一方,小さい行列 (N = 1024)に対して,スピードアップ比は 2.0 程度と小さくなっている.図 6(b)には,次数 r = 10, 20, 40 に対するスピードアップ比を示す.フラットMPIによる分散並列化で,16プロセスから 768プロセスまでの結果である.N = 8192 の場合に,t = 0 と通信の時間を無視した時の,式 (21)の値を図中に点線で示すが,実測値との乖離が大きい.また,表 1に,これらのデータに対応するスピードアップ比の数値を示しておく.スピードアップ比は並列ランク数に比べて低いが,N = 8192 の場合はリニアに伸びていることがわかる.つまり,特定の次数のParareal-in-Time法は,一定以上大きい行列の計算に対してはかなり高効率で並列実行できることがわかる.しかし,N = 1024 の場合にはリニアな伸びは見られない.これは,式 (21)中の t に相当する通信時間の割合が,N = 1024
の時は無視できない事によると思われる.4.3 Parareal-in-Time法の適用範囲上記の測定と解析の結果,Parareal-in-Time法が効率的に適用できる行列のサイズと時系列の長さについて,大まかに図 7に示すような分類が出来る.通常の線形ライブラリでは,行列の各要素に対する演算の独立性を利用して,複数のスレッド,あるいは,複数のプロセスで並列化されている.行列の対角化が O(N3) 程度の演算であることより,小さな行列演算で定義された非常に長い時系列を対象とする場合は,固有状態表現を通して解かれるべきである.大きな行列に対しては O(N3) の計算量は大きすぎるため,Parareal-in-Time
法による時間方向並列化は,通常の要素方向の並列化法と合わせて解かれるべきである.
6 c© 2011 Information Processing Society of Japan
18
Applicability to General Iterative Methods
What types of problems can be accelerated by the Parareal?
Since the usual row/column parallelism is effectively used in linear libraries, this algorithm should be applied to
Long Time Series
Large Matrices
How about other algorithm?
K ! P ! 1
cost(G) cost(F )
19
Parareal SOR (preliminary result 1)
In order to solve a linear equation ,
the SOR (successive over-relaxation) method is used:
The parareal accelerated versions are
Model 1: is defined as an identity operator:
Model 2: is defined by ! = 0 :
[D + ε(L + U)]x = b
xk+1 = (1− w)xk + w(D + εU)−1[b− εLxk]
x(r+1)k+1 = x(r+1)
k + w{
(D + εU)−1[b− εLx(r)k ]− x(r)
k
}G
Gx(r+1)
k+1 = (1− w)x(r+1)k + wD−1b
+w{
(D + εU)−1[b− εLx(r)k ]−D−1b
}
= (1− w)x(r+1)k + w(D + εU)−1[b− εLx(r)
k ]
20
Parareal SOR (preliminary result 2)
The parareal converges to the exact sequence only for w<1, while w is usually chosen 1<w<2 (graphs below: w=0.2).
Applicable only for the case of slow convergence?
Difference
Iterations
-8
-7
-6
-5
-4
-3
-2
-1
0
10
10
10
10
10
10
10
10
10
r=10 r=20 r=30 r=40 r=50
0 20 40 60 80 100
Model 1 Model 2
Difference
Iterations
-8
-7
-6
-5
-4
-3
-2
-1
0
10
10
10
10
10
10
10
10
10
r=10
r=20
r=30
r=40
0 50 100 150 200
21
Conclusion
The Parareal-in-Time method for linear evolution problems
was analyzed as a perturbation expansion.
Remaining errors were obtained through spectral radius
of the coarse and fine transformation.
A simple linear transformation (matrix multiplication) was
accelerated by the Parareal-in-Time algorithm.
Convergence properties were analyzed.
The speed-ups were compared to the theoretical estimate.
22
Further Studies
How to include this algorithm into parallel linear libraries.
interfaces, implementations, tuning, ...
hybrid parallel implementations with the parareal-in-time
and the usual row/column parallelisms
How wide is this algorithm applied to linear calculations
such as iterative solvers of linear equations: SOR, CG, etc?
Application of this method in massively parallel computers
such as K-computer, Kobe, Japan.
23
Thank you for your attention.
24