accelerated kinetic monte carlo methods: hierarchical ... · accelerated kinetic monte carlo...

Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions

Accelerated kinetic Monte Carlo methods:Hierarchical Parallel Algorithms & Coarse-Graining

Markos Katsoulakis

University of Massachusetts & University of Crete

Funding:NSF-DMS, NSF-CMMI, U.S. DOE and E.C. FP7

Overview

1 Stochastic Lattice Systems & Applications

2 Kinetic Monte Carlo (KMC) methods

3 Coarse Graining (CG)

4 Hierarchical Parallel Algorithms

5 Benchmarks, examples and simulations

Stochastic Lattice Systems

Surface processes

Provide information forpattern formation, chemical reactions, phase transitions

ΛN = 1N Zd ∩ [0, 1)d

Lattice size N >> 1

Configurationsσ ∈ ΣN := I ΛN

I = {0, 1} or I = {−1, 1}

Equilibrium Theory

Hamiltonian: HN(σ) = −12

∑x 6=y J(x , y)σ(x)σ(y) + h

∑x σ(x)

- h: external field

- J: potential with interaction range L; V : R→ R hascompact support,

J(x − y) =1

LV(x − y

Nearest neighbor models (as truncations) and possiblycombinations short-/long- range interactions.

Fitted potentials to Molecular Dynamics simulations or data,e.g. Morse potentials

Gibbs States

At the inverse temperature β = 1kT :

µΛ,β(σ = σ0) =1

ZΛ,βexp

{− βHN(σ0)}PN(σ = σ0)

[Probability of the configuration σ0]

Partition function: ZΛ,β =∑

σ0exp

{− βHN(σ0)}PN(σ = σ0)

Prior distribution (no interactions, hight temp.):

PN(σ = σ0) = Πx∈ΛP(σ(x) = σ0(x))

Kinetic Monte Carlo (KMC)

Dynamics

Adsorption/Desorption/Reactions/Surface diffusion

Markov Chain modeling with state space Σ = all configurations σ

Generator: ∂tEf (σ) = E∑x∈Λ

c(x , σ)[f (σx)− f (σ)]︸︷︷︸LN f (σ)

Multi-site updates σx for most systems, e.g.

Suchorski et al ChemPhysChem (2010)

Dynamics

c(x , σ)[f (σx)− f (σ)]︸︷︷︸LN f (σ)

Dynamics

c(x , σ)[f (σx)− f (σ)]︸︷︷︸LN f (σ)

TransitionProbability p(x,y)

No depend. onthe Past 1,...,k-1

Present State=x

Possible FutureState=y

Possible FutureState=z

Possible FutureState=w

Past States=x_kk=1,2,...,k-1

ResidenceTime: τ_x

expon.distributed:λ(x)

Kinetic Monte Carlo: Arrhenius dynamics

Transition rate to the gas phase: c(x , σ) ∼ d0 exp[− βU(x , σ)

]Energy barrier: U(x , σ) =

∑z 6=x J(x − z)σ(z)− h.

-Exponential clock: for each configuration σ,

λ(σ) = d1(N −∑x∈

σ(x)) +∑x∈

d0σ(x)e−βU(x ,σ).

-Transition rates σ 7→ σ′ = σx :

c(x , σ) = λ(σ)p(σ, σx) = d1(1− σ(x)) + d0σ(x)e−βU(x ,σ)

∑z 6=x J(x − z)σ(z)− h.

λ(σ) = d1(N −∑x∈

σ(x)) +∑x∈

∑z 6=x J(x − z)σ(z)− h.

λ(σ) = d1(N −∑x∈

σ(x)) +∑x∈

References: Gillespie (chemical reactions); Bortz, Kalos, Lebowitz(Ising-type systems)The pseudo-algorithm suggests:

divide lattice sites x into classes of equal rates

pick a class using the relative weights

pick from each class a site x uniformly and update theconfiguration

However: For complex interactions (e.g. long-range)

U(x , σ) =∑z 6=x

J(x − z)σ(z)− h

yields a very high number of classes making the algorithmimpractical.

Towards accelerating molecular simulations

Bottlenecks in molecular simulation of extended systems.

Cannot simulate realistic spatio-temporal scales:

1µm2 ≈ 10, 0002 lattice

Difficult to carry out ”systems tasks” for engineering applications:

sensitivity analysis, optimization, control

1µm2 ≈ 10, 0002 lattice

Coarse-Graining: from microscopics to mesoscopics

Spatial acceleration methods

Microscopic lattice

Coarse lattice

Time (s)

Non-uniform mesh

1 Chatterjee et al., JCP 121, 11420 (2004); PRE 71, 0267021 (2005)2 Chatterjee and Vlachos, JCP 124, 0641101 (2006)

0.40 0.44 0.48 0.52

Average lattice coverage, !

KMCMultiscale CGMC

CGMC-MFCGMC-QC

t = 10 s

• Spatial adaptivity1

- Error estimates guide mesh refinement• Multiscale MC methods for high accuracy2

– Higher order closures– Multigrid

• Multicomponent interacting systems

c(x , t) ≈ local average vN(x , t) =1

|Bx |∑y∈Bx

σt(y), as N →∞ ,

”Closure”: when does c = c(x , t) solve a PDE/Stoch. PDE?

E.g. local mean-field limits, Connections to Cahn-Hilliard (S)PDEfor attractive interactions (J > 0)Lebowitz, Orlandi, Presutti JSP ’91; Giacomin, Lebowitz, Phys. Rev. Lett. ’96; K. , Vlachos, Phys. Rev. Lett. ’00;

J. Chem. Phys. ’03.

Microscopic lattice

Coarse lattice

Time (s)

Non-uniform mesh

0.40 0.44 0.48 0.52

KMCMultiscale CGMC

CGMC-MFCGMC-QC

t = 10 s

|Bx |∑y∈Bx

σt(y), as N →∞ ,

Microscopic lattice

Coarse lattice

Time (s)

Non-uniform mesh

0.40 0.44 0.48 0.52

KMCMultiscale CGMC

CGMC-MFCGMC-QC

t = 10 s

|Bx |∑y∈Bx

σt(y), as N →∞ ,

Hierarchical Coarse-Graining

1. Coarse-graining ofpolymers; DPD methods

Briels, et. al. J.Chem.Phys. ’01;Doi et. al. J.Chem.Phys. ’02;Kremer et. al. Macromolecules ’06;Muller-Plathe Chem.Phys.Chem ’00;Laaksonen et. al. Soft Matter ’03, etc.

Recent related work on simulating

bio-membranes: Deserno et. al. Nature ’07.

2. Stochastic latticedynamics/ KMCK., Majda, Vlachos, PNAS’03;K., Plechac, Sopasakis, SIAM Num. Anal.’06;Are, K., Plechac, Rey-Bellet SIAMJ.Sci.Comp. ’08;

Sinno et al. J.Chem.Phys.’08.

Microscopic lattice

Coarse lattice

Time (s)

Non-uniform mesh

0.40 0.44 0.48 0.52

KMCMultiscale CGMC

CGMC-MFCGMC-QC

t = 10 s

Coarse Graining in Lattice Systems

Divide lattice of size N into M cellswith q-particles in each cell

di!usion

adsorption desorption

block spin !(k) =!

x!Ck"(x)Coarse cells

Coarse map:

T : ΣN → ΣM

σ 7→ η := {η(k) =∑x∈Ck

σ(x)}

Renormalization Group map:

H(η) = − 1

∫exp{−βHN(σ)}P(dσ|η)

di!usion

block spin !(k) =!

Coarse map:

T : ΣN → ΣM

σ 7→ η := {η(k) =∑x∈Ck

σ(x)}

H(η) = − 1

di!usion

block spin !(k) =!

Coarse map:

T : ΣN → ΣM

σ 7→ η := {η(k) =∑x∈Ck

σ(x)}

H(η) = − 1

1-D example: n.n. Ising model

Approximation of RG map: H(η) by H0(η) computable:

HN(σ) =∑k

Hk(σ) +∑k

Wk,k+1(σ)

Hk : energy for the cell k with free boundary conditionsWk,k+1: short-range interactions between cell k and cell k + 1.

e−βHN PN(dσ|η) =∏

k: odd

[e−(Wk−1,k+Wk,k+1)e−Hk Pk(dσk |η(k))

]×∏

k: even

e−Hk Pk(dσk |η(k))1D Operator Splitting

Algorithm(OpSpl)

1. Apply SSA on white cells in parallel until synchronization time 2. Apply SSA on black cells in parallel until synchronization time 3. Goto 1

A simple example

-When Wk,k+1 are disregarded (e.g. high temps), there areintra-cell interactions, but there are no CG cell correlations:

H(0)m (η) =

U(0)k (ηk) = −

∫e−βHk (σ)Pk(dσk |η(k))

Sampling over a single coarse cell with free boundary conditionsInverse Monte Carlo method: Laaksonen et. al. Soft Matter ’03.

1D Operator Splitting Algorithm(OpSpl)

Multi-body terms in Coarse GrainingK., Plechac , Rey-Bellet, Tsagkarogiannis ESAIM M2AN ’07, preprint ’10

A simple example

-When Wk,k+1 are disregarded (e.g. high temps), there areintra-cell interactions, but there are no CG cell correlations:

H(0)m (η) =

U(0)k (ηk) = −

∫e−βHk (σ)Pk(dσk |η(k))

Sampling over a single coarse cell with free boundary conditionsInverse Monte Carlo method: Laaksonen et. al. Soft Matter ’03.

Multi-body terms in Coarse GrainingK., Plechac , Rey-Bellet, Tsagkarogiannis ESAIM M2AN ’07, preprint ’10

Coarse Graining- Approximations heuristics

• CG Hamiltonian–Renormalization Group Map: N = mq

H(η) = − 1

• Correction terms around a first ”good guess” H(0)m :

Hm(η) = H(0)m (η)− 1

βlog E [exp

(− β(HN − H

(0)m ))|η] , m = N,N−1, ...

• Expansion of exp (β∆H) and log:

= E [∆H|η] + E [(∆H)2|η]− E [∆H|η]2 + O((∆H)3)

formal calculations inadequate since:

∆H ≡ HN − H(0)m = N · O(ε)

• the role of fluctuations and extensivity.

• Rigorous analysis – Cluster expansion: around H(0)m

K., Plechac , Rey-Bellet, Tsagkarogiannis ESAIM M2AN ’07

H(η) = − 1

Hm(η) = H(0)m (η)− 1

βlog E [exp

(− β(HN − H

(0)m ))|η] , m = N,N−1, ...

= E [∆H|η] + E [(∆H)2|η]− E [∆H|η]2 + O((∆H)3)

∆H ≡ HN − H(0)m = N · O(ε)

H(η) = − 1

Hm(η) = H(0)m (η)− 1

βlog E [exp

(− β(HN − H

(0)m ))|η] , m = N,N−1, ...

= E [∆H|η] + E [(∆H)2|η]− E [∆H|η]2 + O((∆H)3)

∆H ≡ HN − H(0)m = N · O(ε)

H(η) = − 1

Hm(η) = H(0)m (η)− 1

βlog E [exp

(− β(HN − H

(0)m ))|η] , m = N,N−1, ...

= E [∆H|η] + E [(∆H)2|η]− E [∆H|η]2 + O((∆H)3)

∆H ≡ HN − H(0)m = N · O(ε)

H(η) = − 1

Hm(η) = H(0)m (η)− 1

βlog E [exp

(− β(HN − H

(0)m ))|η] , m = N,N−1, ...

= E [∆H|η] + E [(∆H)2|η]− E [∆H|η]2 + O((∆H)3)

∆H ≡ HN − H(0)m = N · O(ε)

H(η) = − 1

Hm(η) = H(0)m (η)− 1

βlog E [exp

(− β(HN − H

(0)m ))|η] , m = N,N−1, ...

= E [∆H|η] + E [(∆H)2|η]− E [∆H|η]2 + O((∆H)3)

∆H ≡ HN − H(0)m = N · O(ε)

Coarse Graining, Long-range Interactions

-Very costly with KMC, but easy to CG:

HN(σ) = −1

∑x∈ΛN

∑y 6=x

J(x − y)σ(x)σ(y) + h∑x

J(x − y) =1

( |x − y |L

), x , y ∈ ΛN

Jk,l =1

∑x∈Ck

∑y∈Cl

J(x − y),

Microscopic lattice

Coarse lattice

Time (s)

Non-uniform mesh

0.40 0.44 0.48 0.52

KMCMultiscale CGMC

CGMC-MFCGMC-QC

t = 10 s

H(0)(η) = −1

∑l 6=k

Jk,lη(k)η(l)−1

Jk,k(η(k)−q)+h∑k

HN(σ) = −1

∑x∈ΛN

∑y 6=x

J(x − y) =1

( |x − y |L

), x , y ∈ ΛN

Jk,l =1

∑x∈Ck

∑y∈Cl

J(x − y),

Microscopic lattice

Coarse lattice

Time (s)

Non-uniform mesh

0.40 0.44 0.48 0.52

KMCMultiscale CGMC

CGMC-MFCGMC-QC

t = 10 s

H(0)(η) = −1

∑l 6=k

Jk,lη(k)η(l)−1

HN(σ) = −1

∑x∈ΛN

∑y 6=x

J(x − y) =1

( |x − y |L

), x , y ∈ ΛN

Jk,l =1

∑x∈Ck

∑y∈Cl

J(x − y),

Microscopic lattice

Coarse lattice

Time (s)

Non-uniform mesh

0.40 0.44 0.48 0.52

KMCMultiscale CGMC

CGMC-MFCGMC-QC

t = 10 s

H(0)(η) = −1

∑l 6=k

Jk,lη(k)η(l)−1

Multi-body terms in Coarse Graining

Corrections to H(0): Hm(η) = H(0)m (η) + H

(1)m (η) + ...

H(1)(η) = β∑k1

∑k2>k1

∑k3>k2

[j2k1k2k3

(−E1(k1)E2(k2)E1(k3) + ...

• “Moments” of interaction potential J:

j2k1k2k3

x∈Ck1

∑y∈Ck2

∑z∈Ck3

(J(x−y)− J(k1, k2))(J(y−z)− J(k2, k3))

Typically omitted but essential tocapture phase transitions, hysteresis

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

h −external field

N=1024; q=8; ! Jo = 5.0

0 0.02 0.04 0.06 0.08 0.1

−0.1

−0.05

MCq=8 with corrections & potential splittingq=8 with corrections & no potential splittingq=8

(1)m (η) + ...

H(1)(η) = β∑k1

∑k2>k1

∑k3>k2

[j2k1k2k3

(−E1(k1)E2(k2)E1(k3) + ...

j2k1k2k3

x∈Ck1

∑y∈Ck2

∑z∈Ck3

(J(x−y)− J(k1, k2))(J(y−z)− J(k2, k3))

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

h −external field

N=1024; q=8; ! Jo = 5.0

0 0.02 0.04 0.06 0.08 0.1

−0.1

−0.05

(1)m (η) + ...

H(1)(η) = β∑k1

∑k2>k1

∑k3>k2

[j2k1k2k3

(−E1(k1)E2(k2)E1(k3) + ...

j2k1k2k3

x∈Ck1

∑y∈Ck2

∑z∈Ck3

(J(x−y)− J(k1, k2))(J(y−z)− J(k2, k3))

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

h −external field

N=1024; q=8; ! Jo = 5.0

0 0.02 0.04 0.06 0.08 0.1

−0.1

−0.05

Loss of Information & Coarse Graining

Relative entropy: R(µ|ν) =∫

dµdν

Theorem [Error estimates]:1. For ε = β q

L‖∇J‖1,

NR(µ(p)|µ) = O(εp+2)

2. Cluster expansions → a posteriori expansion for the relativeentropy:

NR(µ(p)|µ) = Eµ(0) [I (η)] + log

(Eµ(0) [e−I (η)]

)+ O(ε3)

The error indicator I (.) is given by the terms H(1), H(2) anddepends only on the coarse variable η ∼ µ(0).

Loss of Information & Coarse Graining

Relative entropy: R(µ|ν) =∫

dµdν

Theorem [Error estimates]:1. For ε = β q

L‖∇J‖1,

NR(µ(p)|µ) = O(εp+2)

2. Cluster expansions → a posteriori expansion for the relativeentropy:

NR(µ(p)|µ) = Eµ(0) [I (η)] + log

(Eµ(0) [e−I (η)]

)+ O(ε3)

The error indicator I (.) is given by the terms H(1), H(2) anddepends only on the coarse variable η ∼ µ(0).

Parallel KMC Simulation in Lattice Systems

Markovian Dynamics: Adsorption/Desorption/Reaction/ Diffusion

Generator: ∂tEf (σ) = E∑

x∈Λ c(x , σ)[f (σx)− f (σ)].

TransitionProbability p(x,y)

No depend. onthe Past 1,...,k-1

Present State=x

Possible FutureState=y

Possible FutureState=z

Possible FutureState=w

Past States=x_kk=1,2,...,k-1

ResidenceTime: τ_x

expon.distributed:λ(x)

Parallel KMC Simulation in Lattice Systems

Lubachevsky, JCP ’88, Korniss, Novotny, Rikvold JCP ’01,...

Main idea in geometric parallelization: break up the lattice insmaller sub-lattices.

Run KMC on each sub-lattice in separate processors andcommunicate across boundaries.

However: asynchronous updates in neighboring sites acrossprocesses in standard CTMC implementations

parallel n-fold way with block size l serial

l 16 32 64 128 256 512 1024 (!)!t 3.7 6.1 9.2 12.6 15.4 17.4 18.5 19.9

TABLE I. Mean time increments (in MCSP) for the serial and the parallel n-fold way algorithms with di"erent block sizel at T=0.7Tc, |H |/J=0.2857. They are approximately independent of the full system size L and NPE. (!) The mean timeincrement for the serial algorithm is approximately independent of L.

|H |/J0.1587 0.2222 0.2857 0.3492 0.4127

0.6 serial - 81.5 61.4 46.4 36.3parallel - 23.4 21.4 19.3 17.4

T/Tc 0.7 serial 33.8 25.4 19.9 16.5 14.3parallel 16.8 14.5 12.6 11.1 10.1

0.8 serial 12.5 10.4 9.2 8.5 7.9parallel 9.2 8.0 7.4 6.9 6.5

TABLE II. Mean time increments (in MCSP) for the serial and the parallel n-fold way algorithms for di"erent temperaturesand magnetic fields (NPE=64, l=128).

PE0 PE1 PE2

PE3 PE4 PE5

PE6 PE7

FIG. 1. Schematic diagram of the spatial decomposition of the system and its mapping onto a parallel machine. Here L=12and l=4. Each of the NPE=(L/l)2=9 processing elements (PEs) carries l2=16 spins, confined by solid lines. The spins on theboundary are separated from those in the kernel by dashed lines.

Uniformization and Parallel Simulation in Lattice Systems

One solution is ”uniformization”: same process indistribution, however we pick clock with uniform rate λ∗ suchthat:

maxxλ(x) ≤ λ∗ ,

and new skeleton process

p∗(x , y) =

1− λ(x)

λ∗ if x=y

λ(x)λ∗ p(x , y) if x 6= y

p∗(x , y) introduces rejections in the algorithm.

asynchronous algorithms, unless a boundary event occurs

Parallel Simulation in Lattice Systems

However,

we still have excessive communication between processors inthe case of complex interactions:communication (boundary) regions can be ”wide”, in contrastto the n.n. case:

l 16 32 64 128 256 512 1024 (!)!t 3.7 6.1 9.2 12.6 15.4 17.4 18.5 19.9

|H |/J0.1587 0.2222 0.2857 0.3492 0.4127

0.6 serial - 81.5 61.4 46.4 36.3parallel - 23.4 21.4 19.3 17.4

0.8 serial 12.5 10.4 9.2 8.5 7.9parallel 9.2 8.0 7.4 6.9 6.5

PE0 PE1 PE2

PE3 PE4 PE5

PE6 PE7

many rejections for poor upper bounds λ∗ ≥ maxx λ(x) .

Variants: partially rejection-free methods

However,

l 16 32 64 128 256 512 1024 (!)!t 3.7 6.1 9.2 12.6 15.4 17.4 18.5 19.9

|H |/J0.1587 0.2222 0.2857 0.3492 0.4127

0.6 serial - 81.5 61.4 46.4 36.3parallel - 23.4 21.4 19.3 17.4

0.8 serial 12.5 10.4 9.2 8.5 7.9parallel 9.2 8.0 7.4 6.9 6.5

PE0 PE1 PE2

PE3 PE4 PE5

PE6 PE7

11many rejections for poor upper bounds λ∗ ≥ maxx λ(x) .

However,

l 16 32 64 128 256 512 1024 (!)!t 3.7 6.1 9.2 12.6 15.4 17.4 18.5 19.9

|H |/J0.1587 0.2222 0.2857 0.3492 0.4127

0.6 serial - 81.5 61.4 46.4 36.3parallel - 23.4 21.4 19.3 17.4

0.8 serial 12.5 10.4 9.2 8.5 7.9parallel 9.2 8.0 7.4 6.9 6.5

PE0 PE1 PE2

PE3 PE4 PE5

PE6 PE7

11many rejections for poor upper bounds λ∗ ≥ maxx λ(x) .

Synchronous algorithms-Exact SimulationJ. Phys.: Condens. Matter 21 (2009) 084214 G Nandipati et al

Figure 6. Schematic diagram of square decomposition for Np = 9.Solid lines correspond to processor domains.

carrying out our parallel KMC simulations of coarsening wehave tested and developed several different parallel algorithmsin order to determine which is the most efficient. In particular,we have studied a recently developed rigorous algorithm, the‘optimistic synchronous relaxation’ (OSR) algorithm as wellas a modified version of this algorithm which we refer toas ‘optimistic synchronous relaxation with pseudo-rollback’(OSRPR). In addition, we have tested the recently developedsemi-rigorous synchronous sublattice (SL) algorithm. Finally,to reduce the number of events corresponding to boundaryevents between processors we have also developed a newmethod, which we refer to as ‘dynamic boundary allocation’(DBA). Below we discuss each of these methods and some ofthe details of their implementation.

4.1. Optimistic synchronous relaxation (OSR) algorithm

One of the first rigorous algorithms for parallel discrete-event simulations was the synchronous relaxation algorithmdeveloped by Lubachevsky [9]. We note that the applicationof this algorithm to KMC simulations as well as its scalingas a function of the number of processors Np has beenrecently studied by Shim and Amar [11]. However, since thisalgorithm is relatively complex and requires multiple iterationsfor each cycle, Merrick and Fichthorn have recently developeda similar but simpler algorithm which they refer to as optimisticsynchronous relaxation (OSR) [12].

Figure 6 shows a typical decomposition of a squaresystem into Np square regions, where Np is the number ofprocessors. Also indicated in figure 6 are the boundary and‘ghost’ regions for the central processor, where the boundaryregion is defined as that portion of the processor’s domain inwhich a change may affect neighboring processors. Similarly,the ghost region corresponds to that part of the neighboringprocessors’ domains which can affect a given processor. Thus,in general the width of the boundary and ghost regions must beat least equal to the range of interaction.

As shown in figure 7, in the OSR algorithm in eachcycle all processors start with the same initial time and then

Figure 7. Time evolution of events for OSR and OSRPR algorithmswith G = 4. Dashed lines correspond to selected events, while thedashed line with an X corresponds to an event exceeding tmin (see thetext). In OSR this event is discarded while in OSRPR this event isadded to the next cycle.

simultaneously and independently carry out KMC events intheir domains until either the number of KMC events reachesa pre-determined fixed number G, or one of the eventscorresponds to a ‘boundary event’, i.e. an event which modifiesthe boundary region of the given processor and which can thusaffect events in neighboring processors. Defining the time forthe last event in each processor as tlast, a global communicationis then carried out to determine the time tmin corresponding tothe minimum of tlast over all processors. Each processor then‘rolls back’ or undoes all KMC events which occur after tmin.If there are no boundary events then the processors all move onto the next cycle with the new starting time corresponding totmin. However, if tmin corresponds to a boundary event, then anadditional communication is needed to update the ghost and/orboundary regions of all processors affected by the boundaryevent.

We note that typically the OSR algorithm requires 2–3 global communications each cycle, one to determine tmin,another to determine if the event with tmin corresponded to aboundary event, and a third to update the boundary regionsof the affected processors if there was a boundary event.To reduce the number of global communications we haveencoded the processor identity as well as whether or not thelast event was a boundary event, along with the least advancedtime of each processor before doing a global communicationto determine tmin. This was done by replacing tlast with anumber whose most significant figures corresponded to tlast butwhose least significant figures contained information about theprocessor ID and whether or not that processor had a boundaryevent6. Thus, in our implementation of the OSR algorithm onlyone global communication was needed if tlast corresponded to anon-boundary event, while two communications were neededif it was a boundary event.

6 In this method, the time each processor advances from its previous cycleis multiplied by a very large number to form the integer part of the doubleprecision packed number. The ratio of the processor ID to the total numberof processors used Np is then added to the decimal part if there is a boundaryevent in that processor. If there is no boundary event in that processor a decimalnumber is added such that it does not correspond to any processor identity. Inour implementation the multiplying number was 1020, which leads to goodaccuracy.

Shim, Amar, PRB ’05, Merrick, Fichthorn, PRE ’07, etc

Synchronous algorithm: uniform time window for eachprocessor unless a boundary event occurs.

Resolve conflicts at boundary regions by communicating withneighboring processors and restart cycle.

Global communication overhead in each cycle.

Previous methods rely on exact simulation of the stochasticprocess.

Synchronous algorithms-Exact SimulationJ. Phys.: Condens. Matter 21 (2009) 084214 G Nandipati et al

Figure 6. Schematic diagram of square decomposition for Np = 9.Solid lines correspond to processor domains.

carrying out our parallel KMC simulations of coarsening wehave tested and developed several different parallel algorithmsin order to determine which is the most efficient. In particular,we have studied a recently developed rigorous algorithm, the‘optimistic synchronous relaxation’ (OSR) algorithm as wellas a modified version of this algorithm which we refer toas ‘optimistic synchronous relaxation with pseudo-rollback’(OSRPR). In addition, we have tested the recently developedsemi-rigorous synchronous sublattice (SL) algorithm. Finally,to reduce the number of events corresponding to boundaryevents between processors we have also developed a newmethod, which we refer to as ‘dynamic boundary allocation’(DBA). Below we discuss each of these methods and some ofthe details of their implementation.

4.1. Optimistic synchronous relaxation (OSR) algorithm

One of the first rigorous algorithms for parallel discrete-event simulations was the synchronous relaxation algorithmdeveloped by Lubachevsky [9]. We note that the applicationof this algorithm to KMC simulations as well as its scalingas a function of the number of processors Np has beenrecently studied by Shim and Amar [11]. However, since thisalgorithm is relatively complex and requires multiple iterationsfor each cycle, Merrick and Fichthorn have recently developeda similar but simpler algorithm which they refer to as optimisticsynchronous relaxation (OSR) [12].

Figure 6 shows a typical decomposition of a squaresystem into Np square regions, where Np is the number ofprocessors. Also indicated in figure 6 are the boundary and‘ghost’ regions for the central processor, where the boundaryregion is defined as that portion of the processor’s domain inwhich a change may affect neighboring processors. Similarly,the ghost region corresponds to that part of the neighboringprocessors’ domains which can affect a given processor. Thus,in general the width of the boundary and ghost regions must beat least equal to the range of interaction.

As shown in figure 7, in the OSR algorithm in eachcycle all processors start with the same initial time and then

Figure 7. Time evolution of events for OSR and OSRPR algorithmswith G = 4. Dashed lines correspond to selected events, while thedashed line with an X corresponds to an event exceeding tmin (see thetext). In OSR this event is discarded while in OSRPR this event isadded to the next cycle.

simultaneously and independently carry out KMC events intheir domains until either the number of KMC events reachesa pre-determined fixed number G, or one of the eventscorresponds to a ‘boundary event’, i.e. an event which modifiesthe boundary region of the given processor and which can thusaffect events in neighboring processors. Defining the time forthe last event in each processor as tlast, a global communicationis then carried out to determine the time tmin corresponding tothe minimum of tlast over all processors. Each processor then‘rolls back’ or undoes all KMC events which occur after tmin.If there are no boundary events then the processors all move onto the next cycle with the new starting time corresponding totmin. However, if tmin corresponds to a boundary event, then anadditional communication is needed to update the ghost and/orboundary regions of all processors affected by the boundaryevent.

We note that typically the OSR algorithm requires 2–3 global communications each cycle, one to determine tmin,another to determine if the event with tmin corresponded to aboundary event, and a third to update the boundary regionsof the affected processors if there was a boundary event.To reduce the number of global communications we haveencoded the processor identity as well as whether or not thelast event was a boundary event, along with the least advancedtime of each processor before doing a global communicationto determine tmin. This was done by replacing tlast with anumber whose most significant figures corresponded to tlast butwhose least significant figures contained information about theprocessor ID and whether or not that processor had a boundaryevent6. Thus, in our implementation of the OSR algorithm onlyone global communication was needed if tlast corresponded to anon-boundary event, while two communications were neededif it was a boundary event.

6 In this method, the time each processor advances from its previous cycleis multiplied by a very large number to form the integer part of the doubleprecision packed number. The ratio of the processor ID to the total numberof processors used Np is then added to the decimal part if there is a boundaryevent in that processor. If there is no boundary event in that processor a decimalnumber is added such that it does not correspond to any processor identity. Inour implementation the multiplying number was 1020, which leads to goodaccuracy.

Shim, Amar, PRB ’05, Merrick, Fichthorn, PRE ’07, etc

Synchronous algorithm: uniform time window for eachprocessor unless a boundary event occurs.

Resolve conflicts at boundary regions by communicating withneighboring processors and restart cycle.

Global communication overhead in each cycle.

Previous methods rely on exact simulation of the stochasticprocess.

Synchronous algorithms: Sub-Lattice MethodJ. Phys.: Condens. Matter 21 (2009) 084214 G Nandipati et al

Figure 8. Comparison between parallel results using the OSRPRalgorithm with square decomposition (Np = 4) and serial results fora fractal model with D/F = 105 and G = 7.

We note that, in the OSR algorithm for a givenconfiguration, there is an optimal value of G which takes intoaccount the tradeoffs between communication time (which iswasted if there are no boundary events) and rollbacks. While,in general, an adaptive method could be used to attempt tooptimize the value of G from cycle to cycle, in practice we havefound it more efficient to simply use trial and error to find theoptimal fixed value of G for our simulation (see section 4.4).

4.2. Optimistic synchronous relaxation with pseudo-rollback(OSRPR) algorithm

In the OSR algorithm each processor discards all KMC eventswhich occur after tmin. However, this is unnecessary if thereare no boundary events in any of the processors. Therefore, wehave considered a variation of the OSR algorithm (optimisticsynchronous relaxation with pseudo-rollback) in which, whenthere are no boundary events in the system, those events thatwould have been discarded are added to the next cycle. Thiscan reduce the loss of computational time due to undoing andthen ‘redoing’ events and thus enhances the computationalefficiency. As a test of the OSRPR algorithm, we have carriedout parallel simulations using this algorithm for a ‘fractal’model of irreversible submonolayer growth in which onlymonomer deposition and diffusion processes are included [11],with Np = 4. As expected, there is excellent agreementbetween serial and parallel results for the island and monomerdensities (see figure 8).

4.3. Synchronous sublattice (SL) algorithm

In order to maximize the parallel efficiency we have alsocarried out simulations using the semi-rigorous synchronoussublattice (SL) algorithm recently developed by Shim andAmar [13]. To avoid conflicts between processors, in the SLalgorithm each processor domain is divided into subregionsor sublattices (see figure 9). A complete synchronous cycle

Figure 9. Schematic diagram of strip decomposition for Np = 2.Each processor domain is subdivided into A and B sublattices.Boundary and ghost regions for B sublattice of processor 1 are alsoshown.

Figure 10. Time evolution in the SL algorithm. Dashed linescorrespond to selected events, while the dashed line with an Xcorresponds to an event which is rejected since it exceeds the cycletime ! .

corresponding to a cycle time ! is then as follows. Atthe beginning of a cycle, each processor’s local time isinitialized to zero. One of the sublattices (A or B) isthen randomly selected so that all processors operate onthe same sublattice during the cycle. Each processor thensimultaneously and independently carries out KMC events inthe selected sublattice until the time of the next event exceedsthe time interval ! (see figure 10). The processors thencommunicate any necessary changes (boundary events) withtheir neighboring processors, update their event rates and moveon to the next cycle using a new randomly chosen sublattice.We note that, in order to ensure accuracy, the cycle time musttypically be less than or equal to the inverse of the fastestpossible single-event rate in the system [13].

Since it only requires local communication, the scalingbehavior of the SL algorithm is significantly better than for theOSR and OSRPR algorithms. As a result, it has been shown tobe relatively efficient in parallel KMC simulations of a varietyof models of growth [13, 17] and island coarsening [18]. Inaddition, while it is not exact, in simulations of a variety ofmodels [13, 17, 18] it was found that, unless the processor sizeis extremely small (smaller than a ‘diffusion length’) or thecycle time is too large, there is essentially perfect agreement

Shim, Amar, PRB ’07

Synchronous algorithm: fixed time window (cycle).

Random choice of sublattice and restart of cycle.

No global communication overhead in each cycle.

Relies on approximation of the stochastic process.

Synchronous algorithms: Sub-Lattice MethodJ. Phys.: Condens. Matter 21 (2009) 084214 G Nandipati et al

Figure 8. Comparison between parallel results using the OSRPRalgorithm with square decomposition (Np = 4) and serial results fora fractal model with D/F = 105 and G = 7.

We note that, in the OSR algorithm for a givenconfiguration, there is an optimal value of G which takes intoaccount the tradeoffs between communication time (which iswasted if there are no boundary events) and rollbacks. While,in general, an adaptive method could be used to attempt tooptimize the value of G from cycle to cycle, in practice we havefound it more efficient to simply use trial and error to find theoptimal fixed value of G for our simulation (see section 4.4).

4.2. Optimistic synchronous relaxation with pseudo-rollback(OSRPR) algorithm

In the OSR algorithm each processor discards all KMC eventswhich occur after tmin. However, this is unnecessary if thereare no boundary events in any of the processors. Therefore, wehave considered a variation of the OSR algorithm (optimisticsynchronous relaxation with pseudo-rollback) in which, whenthere are no boundary events in the system, those events thatwould have been discarded are added to the next cycle. Thiscan reduce the loss of computational time due to undoing andthen ‘redoing’ events and thus enhances the computationalefficiency. As a test of the OSRPR algorithm, we have carriedout parallel simulations using this algorithm for a ‘fractal’model of irreversible submonolayer growth in which onlymonomer deposition and diffusion processes are included [11],with Np = 4. As expected, there is excellent agreementbetween serial and parallel results for the island and monomerdensities (see figure 8).

4.3. Synchronous sublattice (SL) algorithm

In order to maximize the parallel efficiency we have alsocarried out simulations using the semi-rigorous synchronoussublattice (SL) algorithm recently developed by Shim andAmar [13]. To avoid conflicts between processors, in the SLalgorithm each processor domain is divided into subregionsor sublattices (see figure 9). A complete synchronous cycle

Figure 9. Schematic diagram of strip decomposition for Np = 2.Each processor domain is subdivided into A and B sublattices.Boundary and ghost regions for B sublattice of processor 1 are alsoshown.

Figure 10. Time evolution in the SL algorithm. Dashed linescorrespond to selected events, while the dashed line with an Xcorresponds to an event which is rejected since it exceeds the cycletime ! .

corresponding to a cycle time ! is then as follows. Atthe beginning of a cycle, each processor’s local time isinitialized to zero. One of the sublattices (A or B) isthen randomly selected so that all processors operate onthe same sublattice during the cycle. Each processor thensimultaneously and independently carries out KMC events inthe selected sublattice until the time of the next event exceedsthe time interval ! (see figure 10). The processors thencommunicate any necessary changes (boundary events) withtheir neighboring processors, update their event rates and moveon to the next cycle using a new randomly chosen sublattice.We note that, in order to ensure accuracy, the cycle time musttypically be less than or equal to the inverse of the fastestpossible single-event rate in the system [13].

Since it only requires local communication, the scalingbehavior of the SL algorithm is significantly better than for theOSR and OSRPR algorithms. As a result, it has been shown tobe relatively efficient in parallel KMC simulations of a varietyof models of growth [13, 17] and island coarsening [18]. Inaddition, while it is not exact, in simulations of a variety ofmodels [13, 17, 18] it was found that, unless the processor sizeis extremely small (smaller than a ‘diffusion length’) or thecycle time is too large, there is essentially perfect agreement

Shim, Amar, PRB ’07

Synchronous algorithm: fixed time window (cycle).

Random choice of sublattice and restart of cycle.

No global communication overhead in each cycle.

Relies on approximation of the stochastic process.

Hierarchical Parallel KMC algorithms

M.K., P. Plechac (U of TN and ORNL), G. Arabatzis (U of Crete)and L.

Xu (CS, UDel), preprint, (2010)

Markovian Dynamics: Adsorption/Desorption/Reaction/ Diffusion

Adsorption/Desorption/Reaction Generator: LN f (σ) =P

x∈Λ c(x, σ)[f (σx ) − f (σ)].

Suchorski et al ChemPhysChem (2010)Complex behavior: bistability, oscillations, chaos, patterning, etc.

Analogy to coarse-graining

Decompose particle system in parts communicating minimally;thus, local info is represented by suitable coarse variables, orcomputed on separate processors within a parallel architecture.

Example: A 1-D equilibrium calculationHs

N (σ) =P

k Hsk (σ) +

Pk Wk,k+1(σ)

Hsk : short-range Hamiltonian for the k-CG cell with free boundary conditions

Wk,k+1: short-range interactions between k- and k + 1-CG cells

e−βHsN PN(dσ|η) =

∏k: odd

[e−(Wk−1,k+Wk,k+1)e−Hs

k Pk(dσk |η(k))]×∏

k: even

e−Hk Pk(dσk |η(k))1D Operator Splitting

Algorithm(OpSpl)

Analogy to coarse-graining

Simplified CG: When Wk,k+1 are disregarded, there are no CG cellcorrelations–but there are intra-cell–and the CG Hamiltonian is

H(s,0)m (η) =

U(s,0)k (ηk) = −

∫e−βHs

k (σ)Pk(dσk |η(k))

i.e. independent sampling over each coarse cell with free boundaryconditions

Parallelization: trivial, no communication between CG cells.

Step 1: Generator decomposition

LN f (σ) =∑x∈Λ

c(x , σ)[f (σx)− f (σ)]∑k

∑x∈Ck

c(x , σ)[f (σx)− f (σ)]

∑k: odd

Lk f (σ)︷︸︸︷∑x∈Ck

c(x , σ)[f (σx)− f (σ)] +∑

k: even

∑x∈Ck

c(x , σ)[...]

:= LO f (σ) + LE f (σ)1D Operator Splitting

Algorithm(OpSpl)

2D OpSpl

Step 1: Generator decomposition

LN f (σ) =∑x∈Λ

c(x , σ)[f (σx)− f (σ)]∑k

∑x∈Ck

c(x , σ)[f (σx)− f (σ)]

∑k: odd

Lk f (σ)︷︸︸︷∑x∈Ck

c(x , σ)[f (σx)− f (σ)] +∑

k: even

∑x∈Ck

c(x , σ)[...]

:= LO f (σ) + LE f (σ)1D Operator Splitting

Algorithm(OpSpl)

2D OpSpl

Step 2: Trotter product & Fractional Step Approximation

Trotter product for semigroups (Proc. AMS (1958)):

limh→0

(eAh/2eBh/2

)[t/h]f = e(A+B)t f

Random Trotter product formula for jump processes:Kurtz, (Proc. AMS (1972))

Approximation of the Markov semigroup based on (random)Trotter Theorem (Lie or Strang splitting):

eLN∆t ≈ eLO∆t/2eLE ∆t/2

For short range interactions, the processes ∼ Lk areindependent and can be simulated on separate processors:

eLN∆t ≈ eLO∆teLE ∆t ≈∏

k: odd

k-th︷︸︸︷eLk∆t

∏k: even

processor︷︸︸︷eLk∆t

limh→0

(eAh/2eBh/2

)[t/h]f = e(A+B)t f

k: odd

∏k: even

limh→0

(eAh/2eBh/2

)[t/h]f = e(A+B)t f

k: odd

∏k: even

limh→0

(eAh/2eBh/2

)[t/h]f = e(A+B)t f

k: odd

∏k: even

Benchmarks and Error Analysis

Algorithm and ResultsApplication Performance

!"#"$%&'()*+&,-($./!#0012$3"$3+456&5$./7%289:;12$<"="$>+&56)($./<?;1

!"#$%&"#'(") #**)+ ",- ").*/'#$&+

o Kinetic Monte Carlo methods amenable to parallelization on

GPU clusters

o Benchmark model defined and accuracy tested

o Simulation of real chemical processes (oxidation)

o Distributed (MPI) version implemented

o 1000x speed-up compared to standard implementation

o Controlled approximation of the original Markov jump

process

3@(A$!"#"$%&'()*+&,-( ./!#0012$3"$3+456&5./7%289:;12$ <"="$>+&56)($./<?;13)('B)5A$?"$%&++-C-&DD&,- ./7%EF@G01

Application Performance

0-H*+&'-)D$)I$)J-B&'-)D$KL)54(($)D$'64$M<$+&''-54"Domain decomposition depicted together with the workload on GPU cells (bottom figure).

<ND&H-5$+)&B$O&+&D5-DCA$4J&HK+4$)I$&D$&+C)L-'6H$-D$P<"

0*,#-1 2%3 4"/"))%) ").*/'#$&+ 5*/ 6',%#'( !*,#%0"/)* +'&7)"#'*,+

Figure: Phase diagram of critical 2D Ising model used as a benchmark for accuracy (Onsager solution).

Figure: No load balancing

Figure: With load balancing

Phase diagram of critical 2D Ising model used as a benchmark for accuracy (Onsager solution).

Flexibility in choosing suitable decompositions

Controlled error approximations for observables with similar toolboxas in CG

K., Plechac, Sopasakis, SIAM Num. Anal. ’06

Algorithm Performance

105 106

GPU and Sequential code execution time

N (lattice size = N2)

dt=0.01 (fermi)

dt=0.1

dt=0.01 (tesla)

dt=0.1

GPU simulation with various architectures (e.g. Fermi) vs. DNS

Dynamic load balancing

!"#"$%&'()*+&,-($./!#0012$3"$3+456&5$./7%289:;12$<"="$>+&56)($./<?;1

!"#$%&"#'(") #**)+ ",- ").*/'#$&+

GPU clusters

process

3@(A$!"#"$%&'()*+&,-( ./!#0012$3"$3+456&5./7%289:;12$ <"="$>+&56)($./<?;13)('B)5A$?"$%&++-C-&DD&,- ./7%EF@G01

0*,#-1 2%3 4"/"))%) ").*/'#$&+ 5*/ 6',%#'( !*,#%0"/)* +'&7)"#'*,+

Load balancing controlled by number of

jumps executed on each sub-domain

Mass Transport to a uniform histogram

Fractional Step approximation allows for tuning the balancing

!"#"$%&'()*+&,-($./!#0012$3"$3+456&5$./7%289:;12$<"="$>+&56)($./<?;1

!"#$%&"#'(") #**)+ ",- ").*/'#$&+

GPU clusters

process

3@(A$!"#"$%&'()*+&,-( ./!#0012$3"$3+456&5./7%289:;12$ <"="$>+&56)($./<?;13)('B)5A$?"$%&++-C-&DD&,- ./7%EF@G01

0*,#-1 2%3 4"/"))%) ").*/'#$&+ 5*/ 6',%#'( !*,#%0"/)* +'&7)"#'*,+

Hierarchical Structure and multiple GPUs2D OpSpl on multi-GPUs

Hierarchical methods are well suited for current architectures whichhave sophisticated memory hierarchies, e.g. GPUs.

Hierarchical lattice partitioning on GPU cluster: macro-, meso-,micro-cells

MPI/OpenMP communication between GPUs

Hierarchical lattice partitioning on GPU cluster:

macro-, meso-,micro-cells

Hierarchical Structure-Algorithm Performance

N (lattice size = N2)

GPU and Sequential code execution time

seq−1

seq−2

dt=0.01 mpi−64

dt=0.1

dt=0.01 (fermi)

dt=0.1

2D unimolecular reaction/diffuson particle system

Up to 105x speed-up compared to standard implementation

With 64 GPUs can simulate with relative ease 108 particles ( approx. 1µm2).

Hierarchical Structure & Multiple Scales

1. Micromechanisms with (very) different time scales, e.g. fastdiffusion in CO oxidation on Pt [Suchorski et al Phys. Rev. Lett. ’99]

ε−1Ldiff + Lreaction , ε << 1

Combine the hierarchical structure with established uses ofTrotter products for molecular systems with fast/slowmechanisms.Molecular Dynamics: Tuckerman et al J. Chem. Phys. ’92.

2. Optimizing the algorithms: Computation vs. CommunicationK., Plechac, Xu, Taufer (CS, UDel)

Central cells

!"#$%&'"()*+,"'-*.-%%/*0(*',-*10$(2&34

Central cells

!"#$%&'"()*+,"'-*.-%%/*0(*',-*10$(2&34

Central cells

!"#$%&'"()*+,"'-*.-%%/*0(*',-*10$(2&34

Conclusions - Further Work

Kinetic Monte Carlo methods amenable to parallelization onGPU clusters

Benchmark model defined and accuracy tested

Distributed (MPI) version implemented

Controlled approximation of the original Markov jump process

Capability to simulate realistic chemical processes at largespatiotemporal scales

However challenges remain....

Systems in surface process with short and long rangeinteractions, patterning, etc.

Optimizing the algorithms: Computation vs. Communication

Central cells

!"#$%&'"()*+,"'-*.-%%/*0(*',-*10$(2&34

However challenges remain....

Systems in surface process with short and long rangeinteractions, patterning, etc.

Optimizing the algorithms: Computation vs. Communication

Central cells

!"#$%&'"()*+,"'-*.-%%/*0(*',-*10$(2&34

accelerated kinetic monte carlo methods: hierarchical ... · accelerated kinetic monte carlo...

Documents