efficient simulation of agent-based models on multi-gpu & multi-core clusters kalyan s....

Efficient Simulation ofAgent-based Models onMulti-GPU & Multi-Core Clusters

Kalyan S. Perumalla, Ph.D.

Senior R&D ManagerOak Ridge National Laboratory

Adjunct ProfessorGeorgia Institute of Technology

SimuTools, Malaga, Spain

March 16, 2010

2 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

In a Nut Shell

B2R Algorithm

Hierarchical Hardware

• Multi-GPU• Multi-core• Network

Agent-based Model Execution

• Large scale• Fine-grained

Challenges• Latency spectrum• Unified recursive

solution

Multi-Node

Node (Multi-GPU)

GPU

Block ........

........ Thread

........

Block0,0

P0,0

Block0,1

P0,1

Block0,2

P0,2

Block1,0

P1,0

Block1,1

P1,1

Block1,2

P1,2

Block2,0

P2,0

Block2,1

P2,1

Block2,2

P2,2

B+2R

R R

Dramatic improvements in speed


Outline

•Definition, Examples, Larger sizes, Demo, Time stepped, Parallel style

ABMS•Mu

lti-GPU, Multi-CPU, MPI, CUDA, Access times, Latency problem

Computational Hierarchy

•Basic idea, Hierarchical framework, Analysis equations, Cubic nature, Implementation

B2R Algorithm

•CUDA, Pthreads, MPI, Lens cluster, Game of Life, Leadership, R vs. Improvement

Performance Study

•Multi-GPU per node, OpenCL, More benchmarks, Unstructured inter-agent graphs

Future Work


ABMS: Motivating Demonstrations

Agent Based Modeling and Simulation (ABMS)

• Game of Life

• Afghan Leadership

GOL LDR


GPU-based ABMS References

Examples: K. S. Perumalla and B. Aaby, "Data

Parallel Execution Challenges and Runtime Performance of Agent Simulations on GPUs," in Agent-Directed Simulation Symposium, 2008

R. D'Souza, M. Lysenko, and K. Rehmani, "SugarScape on Steroids: Simulating Over a Million Agents at Interactive Rates," in AGENT Conference on Complex Interaction and Social Emergence, 2007

Examples: K. S. Perumalla and B. Aaby, "Data

Parallel Execution Challenges and Runtime Performance of Agent Simulations on GPUs," in Agent-Directed Simulation Symposium, 2008

R. D'Souza, M. Lysenko, and K. Rehmani, "SugarScape on Steroids: Simulating Over a Million Agents at Interactive Rates," in AGENT Conference on Complex Interaction and Social Emergence, 2007


Hierarchical GPU System Hardware

Multi-Node

Node (Multi-GPU)

GPU

Block ........

........ Thread

........


Computation Kernels on each GPUE.g., CUDA Threads

• Host initiates “launch” of many SIMD threads

• Threads get “scheduled” in batches on GPU hardware

• CUDA claims extremely efficient thread-launch implementation– Millions of CUDA threads at once


GPU Memory Types (CUDA)

GPU memory comes in several flavors Registers

Local Memory

Shared Memory

Constant Memory

Global Memory

Texture Memory

An important challenge is organizing the application to make most effective use of hierarchy

GPU memory comes in several flavors Registers

Local Memory

Shared Memory

Constant Memory

Global Memory

Texture Memory

An important challenge is organizing the application to make most effective use of hierarchy


GPU Communication Latencies (CUDA)

Memory Type Speed Scope Lifetime SizeRegisters Fastest (4 cycles) Thread KernelShared Memory Very fast (4 -? cycles) Block Thread

Global Memory 100x slower (400- cycles) Device ProcessLocal Memory 150x slower (600 cycles) Block ThreadTexture Memory Fast (10s of cycles) Device ProcessConstant Memory Fairly fast (read-only) Device Process


CUDA + MPI

• An economical cluster solution– Affordable GPUs, each providing one-node CUDA– MPI on giga-bit Ethernet for inter-node comm.

• Memory speed-constrained system– Inter-memory transfers can dominate runtime– Runtime overhead can be severe

• Need a way to tie CUDA and MPI– Algorithmic solution needed– Need to overcome latency challenge


Analogous Networked Multi-core System

Multi-Socket

Multi-Core

Thread

Multi-Node

........

........


Parallel Execution: Conventional Method

Block0,0

P0,0

Block0,1

P0,1

Block0,2

P0,2

Block1,0

P1,0

Block1,1

P1,1

Block1,2

P1,2

Block2,0

P2,0

Block2,1

P2,1

Block2,2

P2,2

B


Latency Challenge: Conventional Method

• High latency between GPU and CPU memories– CUDA inter-memory data

transfer primitives

• Very high latency across CPU memories– MPI communication for data

transfers

• Naïve method gives very poor computation to communication ratio– Slow-downs instead of

speedups

• Need latency resilient method …


Our Solution: B2R Method

Block0,0

P0,0

Block0,1

P0,1

Block0,2

P0,2

Block1,0

P1,0

Block1,1

P1,1

Block1,2

P1,2

Block2,0

P2,0

Block2,1

P2,1

Block2,2

P2,2

BR R


B2R AlgorithmLet Te be total number of iterations in the simulation 1 For all blocks Blockij in the given agent grid G 1.1 Let (tli, tlj) be the top left index of Blockij 1.2 Let (bri, brj) be the bottom right index of Blockij 1.3 For t=0 to Te/R 1.4 For r=R-1 down to 0 1.5 Update( tli-r, tlj-r, bri+r, brj+r ) 1.6 Communicate( tli, tlj, bri, brj, r ) 1.7 Barrier()

B

R B+2R

R

Direction of error propagation in R

iterations

B×B sub-block mapped

to processing element p

R layers of lagging cells


Total Runtime Cost: Analytical Form

3 2 2 24 2[ (2 2) ( 2 ) ] [8 ]3 3CPUF a R B R B B R B b BR c

3 2 2 2 24 2[ (2 2) ( 2 ) ] [ ]3 3 2GPUF a R B R B B R Bb cRB

At any level in the hierarchy, total runtime F is given by:

Most interesting aspectCubic in R!


Implications of being Cubic in RTo

tal E

xecu

tion

Tim

e

R

• Benefits with B2R not immediately seen for small R– In fact, degradation for

small R!

• Dramatic improvement possible after small R– Our experiments confirm

this trend!

• Too large is too bad too– Can’t profit indefinitely!


Sub-division Across LevelsE.g., MPI to Blocks to Threads

Multi-Node

Node (Multi-GPU)

GPU

Block ........

........ Thread

........

MPI: Rm

Block: Rb

Thread: Rt


Hierarchy and Recursive Use of B & R

B2R can be applied at all levels!• A different R can be chosen at

every level, E.g.– Rb for block-level R

– Rt for thread-level R

• Simple constraints exist for possible values of R– Between R and B– Between R’s at different levels– Details in our paper

E.g., CUDA Hierarchy

Multi-Node

Node (Multi-GPU)

GPU

Block ........

........ Thread

........


B2R Implementation within CUDA

Split into b×b logical blocks

Global memory

.…

Shared memory per block

B+2R

b×b blocks

R state updates

R state updates


Performance

Multi-Node GPU GOL - 16 mil Agents

0

50

100

150

1 2 4 8MPI Level R (Rm)

Sp

eed

up

Rt=1 Rt=2 Rt=4

Over 100× speedup with MPI+CUDA

Speedup relative to naïve method with no latency-hiding

Multi-Node GPU LDR - 16 mil Agents

0

10

20

30

40

2 4 8

MPI Level R (Rm)

Sp

eed

up

Rt=2 Rt=4 Rt=8


Multi-GPU MPI+CUDA – Game of Life

2 4 80%

500%

1000%

1500%

2000%

2500%

3000%

3500%Multi-Node GPU LDR - 16 mil Agents

Rt=1 Rt=2 Rt=4 Rt=2 Rt=4Rt=8

MPI Level R (Rm)

Impr

ovem

ent L

evel


Multi-core MPI+pthreads– Game of Life

1 2 3 40%

50%

100%

150%

200%

250%

Multi-Node CPU GOL - 1 bil Agents

Rt=1 Rt=2 Rt=4

MPI Level R (Rm)

Impr

ovem

ent L

evel


Multi-core MPI+Pthreads – Game of Life

1 10 100

1000

1000

0

0%

50%

100%

150%

200%

250%Multi-Node CPU GOL - 1 Billion Agents

MPI Level R (Rm)

Impr

ovem

ent L

evel


Multi-core MPI+pthreads – Leadership

2 4 80%

50%

100%

150%

200%

250%Multi-Node CPU LDR - 1 bil Agents

Rt=1 Rt=2 Rt=4Rt=2 Rt=4

MPI Level R (Rm)

Impr

ovem

ent L

evel


Summary

• B2R Algorithm applies across heterogeneous, hierarchical platforms– Deep GPU hierarchies– Deep CPU multi-core systems

• Cubic nature of runtime dependence on R is a a remarkable aspect– A maximum and minimum exist– Optimal (minimum) can be dramatically low

• Results show clear performance improvement– Up to 150x in the best case (fine grained)


Future Work

• Generate cross-platform code– E.g, Implement in OpenCL

• Add to CUDA-MPI levels– Multi-GPU per node

• Implement and test with more benchmarks– E.g., From existing ABMS

suites NetLogo & Repast

• Generalize to unstructured inter-agent graphs– E.g., Social networks

• Potential to apply to other domains– E.g., Stencil computations

Thank you!Questions?

Additional material at our webpage:

Discrete Computing Systems

www.ornl.gov/~2ip

efficient simulation of agent-based models on multi-gpu & multi-core clusters kalyan s....

Documents

gpu hardware cuda

agent conference

efficient simulation

simd threads threads

parallel style abms

speed slide

cuda threads host initiates

simulation symposium