efficient simulation of agent-based models on multi-gpu & multi-core clusters kalyan s....

28
Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor Georgia Institute of Technology SimuTools, Malaga, Spain March 16, 2010

Upload: marshall-stanforth

Post on 30-Mar-2015

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory

Efficient Simulation ofAgent-based Models onMulti-GPU & Multi-Core Clusters

Kalyan S. Perumalla, Ph.D.

Senior R&D ManagerOak Ridge National Laboratory

Adjunct ProfessorGeorgia Institute of Technology

SimuTools, Malaga, Spain

March 16, 2010

Page 2: Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory

2 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

In a Nut Shell

B2R Algorithm

Hierarchical Hardware

• Multi-GPU• Multi-core• Network

Agent-based Model Execution

• Large scale• Fine-grained

Challenges• Latency spectrum• Unified recursive

solution

Multi-Node

Node (Multi-GPU)

GPU

Block ........

........ Thread

........

Block0,0

P0,0

Block0,1

P0,1

Block0,2

P0,2

Block1,0

P1,0

Block1,1

P1,1

Block1,2

P1,2

Block2,0

P2,0

Block2,1

P2,1

Block2,2

P2,2

B+2R

R R

Dramatic improvements in speed

Page 3: Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory

3 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Outline

•Definition, Examples, Larger sizes, Demo, Time stepped, Parallel style

ABMS•Mu

lti-GPU, Multi-CPU, MPI, CUDA, Access times, Latency problem

Computational Hierarchy

•Basic idea, Hierarchical framework, Analysis equations, Cubic nature, Implementation

B2R Algorithm

•CUDA, Pthreads, MPI, Lens cluster, Game of Life, Leadership, R vs. Improvement

Performance Study

•Multi-GPU per node, OpenCL, More benchmarks, Unstructured inter-agent graphs

Future Work

Page 4: Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory

4 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

ABMS: Motivating Demonstrations

Agent Based Modeling and Simulation (ABMS)

• Game of Life

• Afghan Leadership

GOL LDR

Page 5: Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory

5 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

GPU-based ABMS References

Examples: K. S. Perumalla and B. Aaby, "Data

Parallel Execution Challenges and Runtime Performance of Agent Simulations on GPUs," in Agent-Directed Simulation Symposium, 2008

R. D'Souza, M. Lysenko, and K. Rehmani, "SugarScape on Steroids: Simulating Over a Million Agents at Interactive Rates," in AGENT Conference on Complex Interaction and Social Emergence, 2007

Examples: K. S. Perumalla and B. Aaby, "Data

Parallel Execution Challenges and Runtime Performance of Agent Simulations on GPUs," in Agent-Directed Simulation Symposium, 2008

R. D'Souza, M. Lysenko, and K. Rehmani, "SugarScape on Steroids: Simulating Over a Million Agents at Interactive Rates," in AGENT Conference on Complex Interaction and Social Emergence, 2007

Page 6: Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory

6 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Hierarchical GPU System Hardware

Multi-Node

Node (Multi-GPU)

GPU

Block ........

........ Thread

........

Page 7: Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory

7 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Computation Kernels on each GPUE.g., CUDA Threads

• Host initiates “launch” of many SIMD threads

• Threads get “scheduled” in batches on GPU hardware

• CUDA claims extremely efficient thread-launch implementation– Millions of CUDA threads at once

Page 8: Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory

8 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

GPU Memory Types (CUDA)

GPU memory comes in several flavors Registers

Local Memory

Shared Memory

Constant Memory

Global Memory

Texture Memory

An important challenge is organizing the application to make most effective use of hierarchy

GPU memory comes in several flavors Registers

Local Memory

Shared Memory

Constant Memory

Global Memory

Texture Memory

An important challenge is organizing the application to make most effective use of hierarchy

Page 9: Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory

9 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

GPU Communication Latencies (CUDA)

Memory Type Speed Scope Lifetime SizeRegisters Fastest (4 cycles) Thread KernelShared Memory Very fast (4 -? cycles) Block Thread

Global Memory 100x slower (400- cycles) Device ProcessLocal Memory 150x slower (600 cycles) Block ThreadTexture Memory Fast (10s of cycles) Device ProcessConstant Memory Fairly fast (read-only) Device Process

Page 10: Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory

10 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

CUDA + MPI

• An economical cluster solution– Affordable GPUs, each providing one-node CUDA– MPI on giga-bit Ethernet for inter-node comm.

• Memory speed-constrained system– Inter-memory transfers can dominate runtime– Runtime overhead can be severe

• Need a way to tie CUDA and MPI– Algorithmic solution needed– Need to overcome latency challenge

Page 11: Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory

11 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Analogous Networked Multi-core System

Multi-Socket

Multi-Core

Thread

Multi-Node

........

........

Page 12: Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory

12 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Parallel Execution: Conventional Method

Block0,0

P0,0

Block0,1

P0,1

Block0,2

P0,2

Block1,0

P1,0

Block1,1

P1,1

Block1,2

P1,2

Block2,0

P2,0

Block2,1

P2,1

Block2,2

P2,2

B

Page 13: Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory

13 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Latency Challenge: Conventional Method

• High latency between GPU and CPU memories– CUDA inter-memory data

transfer primitives

• Very high latency across CPU memories– MPI communication for data

transfers

• Naïve method gives very poor computation to communication ratio– Slow-downs instead of

speedups

• Need latency resilient method …

Page 14: Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory

14 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Our Solution: B2R Method

Block0,0

P0,0

Block0,1

P0,1

Block0,2

P0,2

Block1,0

P1,0

Block1,1

P1,1

Block1,2

P1,2

Block2,0

P2,0

Block2,1

P2,1

Block2,2

P2,2

BR R

Page 15: Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory

15 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

B2R AlgorithmLet Te be total number of iterations in the simulation 1 For all blocks Blockij in the given agent grid G 1.1 Let (tli, tlj) be the top left index of Blockij 1.2 Let (bri, brj) be the bottom right index of Blockij 1.3 For t=0 to Te/R 1.4 For r=R-1 down to 0 1.5 Update( tli-r, tlj-r, bri+r, brj+r ) 1.6 Communicate( tli, tlj, bri, brj, r ) 1.7 Barrier()

B

R B+2R

R

Direction of error propagation in R

iterations

B×B sub-block mapped

to processing element p

R layers of lagging cells

Page 16: Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory

16 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Total Runtime Cost: Analytical Form

3 2 2 24 2[ (2 2) ( 2 ) ] [8 ]3 3CPUF a R B R B B R B b BR c

3 2 2 2 24 2[ (2 2) ( 2 ) ] [ ]3 3 2GPUF a R B R B B R Bb cRB

At any level in the hierarchy, total runtime F is given by:

Most interesting aspectCubic in R!

Page 17: Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory

17 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Implications of being Cubic in RTo

tal E

xecu

tion

Tim

e

R

• Benefits with B2R not immediately seen for small R– In fact, degradation for

small R!

• Dramatic improvement possible after small R– Our experiments confirm

this trend!

• Too large is too bad too– Can’t profit indefinitely!

Page 18: Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory

18 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Sub-division Across LevelsE.g., MPI to Blocks to Threads

Multi-Node

Node (Multi-GPU)

GPU

Block ........

........ Thread

........

MPI: Rm

Block: Rb

Thread: Rt

Page 19: Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory

19 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Hierarchy and Recursive Use of B & R

B2R can be applied at all levels!• A different R can be chosen at

every level, E.g.– Rb for block-level R

– Rt for thread-level R

• Simple constraints exist for possible values of R– Between R and B– Between R’s at different levels– Details in our paper

E.g., CUDA Hierarchy

Multi-Node

Node (Multi-GPU)

GPU

Block ........

........ Thread

........

Page 20: Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory

20 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

B2R Implementation within CUDA

Split into b×b logical blocks

Global memory

.…

Shared memory per block

B+2R

b×b blocks

R state updates

R state updates

Page 21: Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory

21 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Performance

Multi-Node GPU GOL - 16 mil Agents

0

50

100

150

1 2 4 8MPI Level R (Rm)

Sp

eed

up

Rt=1 Rt=2 Rt=4

Over 100× speedup with MPI+CUDA

Speedup relative to naïve method with no latency-hiding

Multi-Node GPU LDR - 16 mil Agents

0

10

20

30

40

2 4 8

MPI Level R (Rm)

Sp

eed

up

Rt=2 Rt=4 Rt=8

Page 22: Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory

22 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Multi-GPU MPI+CUDA – Game of Life

2 4 80%

500%

1000%

1500%

2000%

2500%

3000%

3500%Multi-Node GPU LDR - 16 mil Agents

Rt=1 Rt=2 Rt=4 Rt=2 Rt=4Rt=8

MPI Level R (Rm)

Impr

ovem

ent L

evel

Page 23: Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory

23 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Multi-core MPI+pthreads– Game of Life

1 2 3 40%

50%

100%

150%

200%

250%

Multi-Node CPU GOL - 1 bil Agents

Rt=1 Rt=2 Rt=4

MPI Level R (Rm)

Impr

ovem

ent L

evel

Page 24: Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory

24 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Multi-core MPI+Pthreads – Game of Life

1 10 100

1000

1000

0

0%

50%

100%

150%

200%

250%Multi-Node CPU GOL - 1 Billion Agents

MPI Level R (Rm)

Impr

ovem

ent L

evel

Page 25: Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory

25 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Multi-core MPI+pthreads – Leadership

2 4 80%

50%

100%

150%

200%

250%Multi-Node CPU LDR - 1 bil Agents

Rt=1 Rt=2 Rt=4Rt=2 Rt=4

MPI Level R (Rm)

Impr

ovem

ent L

evel

Page 26: Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory

26 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Summary

• B2R Algorithm applies across heterogeneous, hierarchical platforms– Deep GPU hierarchies– Deep CPU multi-core systems

• Cubic nature of runtime dependence on R is a a remarkable aspect– A maximum and minimum exist– Optimal (minimum) can be dramatically low

• Results show clear performance improvement– Up to 150x in the best case (fine grained)

Page 27: Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory

27 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Future Work

• Generate cross-platform code– E.g, Implement in OpenCL

• Add to CUDA-MPI levels– Multi-GPU per node

• Implement and test with more benchmarks– E.g., From existing ABMS

suites NetLogo & Repast

• Generalize to unstructured inter-agent graphs– E.g., Social networks

• Potential to apply to other domains– E.g., Stencil computations

Page 28: Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory

Thank you!Questions?

Additional material at our webpage:

Discrete Computing Systems

www.ornl.gov/~2ip