parallelization of tau-leap coarse-grained monte …...parallelization of tau-leap coarse-grained...

Parallelization of Tau-Leap Coarse-Grained Monte Carlo Simulations on GPUs

Lifan Xu, Michela Taufer, Stuart Collins, Dionisios G. Vlachos

Global Computing LabUniversity of Delaware

Multiscale Modeling: Challenges

Lifan Xu, GCLab@UD

CFDMesoscopic Theory/CGMC

KMC, DSMCLB, DPD,

MD, extended trajectory calc.

10-2

LengthScale (m)

Time Scale (s)

Quantum Mechanics TST

Lab Scale

10-10

10-9

10-7

10-6

10-12 10-9 10-3 103

Downscaling or Top-down info traffic

Upscaling or Bottom-up info traffic

Our Goal: increase time and length scale for the CGMC by using GPUs

2

By courtesy ofDionisios G. Vlachos

Presenter

Presentation Notes

Multiscale models are very important for studying physical and chemical phenomena in nature. The challenges are linking phenomena and models across scales and increasing length and time scale at each leavel. The ultra goal is to develop algorithms and software for multiscale modeling that benefit from GPUs to address these challenges.

Outline

• Background CGMC

GPU

• CGMC implementation on single GPU Optimize memory use

Optimize algorithm

• CGMC implementation on multiple GPUs Combine OpenMP and CUDA

• Accuracy

• Conclusions

• Future work

3Lifan Xu, GCLab@UD

Presenter

Presentation Notes

Before to tell you how to benefit from GPU, let me give you a short overview of the rest of my talk.

Coarse Grained Kinetic Monte Carlo (CGMC)

• Study phenomena such as catalysis, crystal growth, and surfacediffusion

• Group neighboring microscopic sites (molecules) together into“coarse-grained” cells

Lifan Xu, GCLab@UD 48 X 8 microscopic sites 2 X 2 cells

Mapping

Presenter

Presentation Notes

Coarse grained mote carlo is a very important tool for studying phenomena like crystal growth and surface diffusion. Initially, we have a bunch of microscopic sites which can be occupied by different type of molecules. In CGMC, we group neighboring microscopic sites into cells. And then we don’t need to care about the exact position of each molecule, all we care is the number and species of molecules in each cell.

Terminology

• Coverage A 2D matrix contains number of different species of molecules for each

cell

• Events Reaction

Diffusion

• Probability A list of probabilities of all possible events on every cell

Probability is calculated based on neighboring cell’s coverage

Probability has to be updated if the coverage of its neighbor changed

Events are selected based on probabilities

5Lifan Xu, GCLab@UD

Presenter

Presentation Notes

Here are some terms I want introduce first.

Reaction

6Lifan Xu, GCLab@UD

Reaction

Diffusion

7Lifan Xu, GCLab@UD

Reaction

CGMC Algorithm

8Lifan Xu, GCLab@UD

Presenter

Presentation Notes

coverage is the coverage of each cell, that is number of different type of molecules in each cell. probability is the probabilities for all possible events. To calculate probability for a cell, it must know its four side neighbors’ coverage.

GPU Overview (I)

9Lifan Xu, GCLab@UD

• NVIDIA Tesla C1060: 30 Streaming Multiprocessor (1-N) 8 Scalar Processors/SM (1-M) 30, 8-way SIMD cores = 240 PEs

• Massively parallel multithreaded Up to 30720 active threads handled

by thread execution manager

• Processing power 933 GigaFLOPS(single precision) 78 GigaFLOPS(double precision)

From: CUDA Programming Guide, NVIDIA

Presenter

Presentation Notes

I have a couple slides about GPUs.

GPU Overview (II)

10Lifan Xu, GCLab@UD

• Memory types: Read/write per thread

• Registers• Local memory

Read/write per block• Shared memory

Read/write per grid• Global memory

• Read-only per grid• Constant memory• Texture memory

• Communication among devices and with CPU Through PCI Express bus

From CUDA Programming Guide, NVIDIA

Presenter

Presentation Notes

Our goal in this paper is to reduce global memory access and make use of shared memory.

Multi-thread Approach


Read probability from global memory

…………

One τ leap

Select events

Update coverage inglobal memory

Update probability inglobal memory

Read coverage from global memory

GPU threads, where each thread equals to one cell

Thread synchronization

Presenter

Presentation Notes

Remember in CGMC, we have a lot of cells, so in GPU, we use one thread to simulate events in one cell. All the threads will do the same thing in parallel. Each thread will read its coverage and probability from global memory then select and execute events. After that, each thread has to write its new coverage to global memory. After all the threads have finished the update, each thread will read its side neighbors’ coverage in order to update its probability list.

Memory Optimization


Read probability from global memory

…………

One τ leap

Select and execute events

Update coverage inshared memory

Update probability inglobal memory

Read coverage from shared memory

GPU threads, where each thread equals to one cell

Thread synchronization

Write coverage toglobal memory

Presenter

Presentation Notes

Since in every leap, many events can happen. Each event will cause at least two global memory access to update coverage. So we move coverage to shared memory to reduce global memory access. Since shared memory is only available for threads in the same block, it’s not available for all the threads, we have to write coverage back to global memory after the update. So that the other thread can go to global memory to fetch the other cells’ coverage.

Performance


Platforms

GPU –Tesla C1060 Intel(R) Xeon(R) CPU X5450(CPU)

Number of cores 240 Number of cores 4

Global memory 4GB Memory 4GB

Clock rate 1.44 GHz Clock rate 3 GHz

Presenter

Presentation Notes

Review format

Performance


Sign

ifica

nt in

crea

se in

perf

orm

ance

Time scale: CGMC simulations can be 100X faster than a CPU implementation

Presenter

Presentation Notes

Here is the performance result. The horizontal axis is number of cells. The vertical axis is events per millisecond, the higher the better.

Rethinking the CGMC Algorithm for GPUs


• Redesign of cell structures: Macro-cell = set of cells

Number of cells per macro-cell is flexible

• Redesign of event execution: One thread per macro-cell (coarse-grained parallelism) or one block per macro-cell (fine-

grained parallelism)

Events replicated across macro-cells

6

2

7

10

5

1

13

2

5

4

3-layer macro-cell2-layer macro-cellsCells in a 4x4 system

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Presenter

Presentation Notes

However, in CGMC, after each leap, all the threads have to write their coverage to global memory, and then each thread go to global memory to fetch its neighbors’ coverage to update its probability list. And this is very expensive especially when you are using multiple GPUs for large molecular systems.

Synchronization Frequency

16

Leap 2Leap 1 Leap 3 Leap 4

Synchronization

Synchronization

Lifan Xu, GCLab@UD

First layer Second layer Third layer

Presenter

Presentation Notes

First layer, second layer, third layer Here, I have two examples to show the synchronization frequency for multi-layer implementation. In the first leap, for a 2-layer implementation, each thread has to simulate events in five cells: a central cell and its four side neighbors, so that by the end of first leap, for the central cell in each thread, it has the neighbors’ coverage locally, so each thread can update the probability list for its central cell without any global memory access. Similar for the 3-layer implementation except we have 13 cells in each macro cell. So in the second leap, for 2-layer implementation, each thread can go on the simulation for its central cell. And in 3-layer implementation, only five cells can continue the simulation. By the end of second leap, threads in 2-layer implementation has to do a synchronization so that each thread can get the latest coverage of its neighbors. And in 3-layer implementation, the central cell can continue simulation without any global memory access and synchronization. So in the forth leap, threads in 2-layer implementation can simulate five cells again, and the 3-layer threads have to synchronize with each other.

Event Replication

Macro-cell 1(block 0)

Macro-cell 2(block 1)

Cell 1(thread 0)

Cell 2(thread 1)

… Cell 2(thread 5)

Cell 1(thread 6)

Cell 3(thread 7)

…

Leap 1E1(A--) E1(A++) E1(A++) E1(A--)

E1(A--) E1(A++) E1(A++) E1(A--)

Leap 2 E1(A--) E1(A++)

17

E1: A[cell 1] -> A[cell 2]

A is a molecule specie

Macro-cell 1 Macro-cell 2

Lifan Xu, GCLab@UD

Presenter

Presentation Notes

I want to give an example about event replication calculate that E1 will happen two times in leap 1. Thread for cell 1 in Macro-cell 1 will execute A-- for each E1 Thread for cell 2 in Macro-cell 1 will execute A++ Similar for threads in charge of Macro-cell 2 So at the end of first leap, in macro-cell1, we can recalculate the probability of E1 by using coverage of cell1 and cell2. at the same time, in macro-cell2, we are also recalculating the probability of E1 by using coverage of cell1 and cell2. So in second leap, we select E1 can only happen once based on its probability. We execute E1 in macrocell1 and macrocell2 and then synchronization.

Random Number Generator

• Generate the random numbers on the CPU and transfer them to the GPU Simple and easy

Uniform distribution

• Have a dedicated kernel generate numbers and store them in global memory Many global memory access

• Have each thread generate random numbers as needed Wallace Gaussian Generator

Shared memory requirement


Random Number Generator

1. Generate as many random numbers as possible events on CPU

2. Store random numbers in texture memory as read only

3. Start one leap simulation1. Each event fetches its own random number in texture memory

2. Select events

3. Execute and update

4. End of one leap simulation

5. Go to 1


Coarse-Grained vs. Fine-Grained

• 2-layer coarse-grained Each thread is in charge of one macro-cell

Each thread simulates five cells in every first leap, one central cell in every second leap

• 2-layer fine-grained Each block is in charge of one macro-cell

Each block has five threads

Each thread simulates one cell in every first leap

Five thread simulate the central cell together in every second leap


Presenter

Presentation Notes

We can have different type of parallelism.

Performance


0

5000

10000

15000

20000

25000

30000

35000

64 128 256 512 1024 2048 4096 8192 16384 32768

1-layer shared memory 2-layer coarse-grained 2-layer fine-grained 3-layer fine-grainedNumber of Cells

Even

ts/m

s

Small molecular systems

large molecular systems

Presenter

Presentation Notes

For very large molecular system, one single GPU is not enough

Multi-GPU Implementation

• Limit in number of cells for a single GPU implementation that can be processed in parallel Use multiple GPUs

• Synchronization between multiple GPUs is costly Take advantage of 2-layer fine-grained parallelism

• Combine OpenMP with CUDA for multi-GPU programming: Use portable pinned memory for communication between CPU threads

Use mapped pinned memory for data copy between CPU and GPU(zero copy)

Use write-combined memory for fast data access


Pinned Memory

• Portable pinned memory(PPM) Available for all host threads

Can be freed by any host thread

• Mapped pinned memory(MPM) Allocate memory on host side

Memory can be mapped to device’s address space

Two addresses: one in host memory and one in device memory

No explicit memory copy between CPU and GPU

• Write-combined pinned memory(WCM) Transfers across the PCI Express bus,

Buffer is not snooped

High transfer performance but no data coherency guarantee


Combine OpenMP and CUDA


Presenter

Presentation Notes

One CPU thread generates random numbers while the other CPU threads run GPU simulations in parallel At the end of each leap, the RNG thread stores random numbers in pinned memory At the end of each leap, the GPU threads stores its own coverage and probability in pinned memory Synchronization among all the CPU threads All GPU threads go to pinned memory fetching random numbers, coverage, and probability Start next leap

Multi-GPU Implementation

25

Lifan Xu, GCLab@UD

CGMCParameters:

RNcoverage

probability

Select Events

Execute Events

Update coverage

Update probability

Select Events

Execute Events

Update coverage

Update probability

Leap 1 Leap 2

Select Events

Execute Events

Update coverage

Update probability

Select Events

Execute Events

Update coverage

Update probability

Leap 1 Leap 2

GPU 1

CPU

Pinned Memory

RNGRandom

Numbers

GPU 2

25

For 2-layer implementation

Performance with Different Number of GPUs

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

2048 4096 8192 16384 32768 65536 81920

1-layer 1-GPU 2-layer 2-GPU 2-layer 4-GPU26Lifan Xu, GCLab@UD

Number of Cells

Even

ts/m

s

Presenter

Presentation Notes

For one GPU, 1-layer parallelism performs the best; For two and four GPUs, 2-layer parallelism performs the best.

Performance with Different Layers on 4 GPUs

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

2048 4096 8192 16384 32768 65536 81920

1-layer OpenMP+PPM+RNG 2-layer OpenMP+PPM+RNG 3-layer OpenMP+PPM+RNG27Lifan Xu, GCLab@UD

Number of Cells

Even

ts/m

s

Accuracy


• Simulations of three different species of molecules, A, B, and C A change to B and vice-versa

2 B molecules change to C and vice-versa

A, B, and C can diffuse

• Number of molecules in system reaching the equilibrium and in equilibrium

2100021200214002160021800220002220022400

1 20 39 58 77 96 115

134

153

172

191

210

229

248

267

286

21000

21200

21400

21600

21800

22000

2220022400

1 20 39 58 77 96 115

134

153

172

191

210

229

248

267

286

21000

21200

21400

21600

21800

22000

22200

22400

1 20 39 58 77 96 115

134

153

172

191

210

229

248

267

286

GPU

CPU

-32

CPU

-64

Presenter

Presentation Notes

I got great performance, but my work will be useless if the result is not accurate.

Conclusion

• We present a parallel implementation of the tau-leap method for CGMC simulations on single and multi-GPUs

• We identify the most efficient parallelism granularity for the simulations given a molecular system with a certain size 42X speedup for small systems using 2-layer fine-grained thread

parallelism

100X speedup for average systems using 1-layer parallelism

120X speedup for large systems using 2-layer fine-grained thread parallelism across multiple GPUs

• We increase both time scale (up to 120 times) and length scale (up to 100,000)


Future Work

• Study molecular systems with heterogeneous distributions

• Combine MPI, OpenMP, and CUDA to use multiple GPUS across different machines

• Use MPI to implement CGMC parallelization on CPU cluster Compare the performance of GPU cluster and CPU cluster

• Link phenomena and models across scales i.e., from QM to MD and to CGMC


Acknowledgements

31

Sponsors:

Collaborators:

Dionisios G. Vlachos and Stuart Collins (UDel)

More questions: [email protected]

GCL Members:

Trilce Estrada Boyu Zhang

Abel Licon Narayan Ganesan

Lifan Xu Philip Saponaro

Maria Ruiz Michela Taufer

GCL members in Spring 2010

Lifan Xu, GCLab@UD

parallelization of tau-leap coarse-grained monte …...parallelization of tau-leap coarse-grained...

Documents