efficient simulation of agent-based models on multi-gpu & multi-core clusters kalyan s....
TRANSCRIPT
Efficient Simulation ofAgent-based Models onMulti-GPU & Multi-Core Clusters
Kalyan S. Perumalla, Ph.D.
Senior R&D ManagerOak Ridge National Laboratory
Adjunct ProfessorGeorgia Institute of Technology
SimuTools, Malaga, Spain
March 16, 2010
2 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
In a Nut Shell
B2R Algorithm
Hierarchical Hardware
• Multi-GPU• Multi-core• Network
Agent-based Model Execution
• Large scale• Fine-grained
Challenges• Latency spectrum• Unified recursive
solution
Multi-Node
Node (Multi-GPU)
GPU
Block ........
........ Thread
........
Block0,0
P0,0
Block0,1
P0,1
Block0,2
P0,2
Block1,0
P1,0
Block1,1
P1,1
Block1,2
P1,2
Block2,0
P2,0
Block2,1
P2,1
Block2,2
P2,2
B+2R
R R
Dramatic improvements in speed
3 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Outline
•Definition, Examples, Larger sizes, Demo, Time stepped, Parallel style
ABMS•Mu
lti-GPU, Multi-CPU, MPI, CUDA, Access times, Latency problem
Computational Hierarchy
•Basic idea, Hierarchical framework, Analysis equations, Cubic nature, Implementation
B2R Algorithm
•CUDA, Pthreads, MPI, Lens cluster, Game of Life, Leadership, R vs. Improvement
Performance Study
•Multi-GPU per node, OpenCL, More benchmarks, Unstructured inter-agent graphs
Future Work
4 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
ABMS: Motivating Demonstrations
Agent Based Modeling and Simulation (ABMS)
• Game of Life
• Afghan Leadership
GOL LDR
5 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
GPU-based ABMS References
Examples: K. S. Perumalla and B. Aaby, "Data
Parallel Execution Challenges and Runtime Performance of Agent Simulations on GPUs," in Agent-Directed Simulation Symposium, 2008
R. D'Souza, M. Lysenko, and K. Rehmani, "SugarScape on Steroids: Simulating Over a Million Agents at Interactive Rates," in AGENT Conference on Complex Interaction and Social Emergence, 2007
Examples: K. S. Perumalla and B. Aaby, "Data
Parallel Execution Challenges and Runtime Performance of Agent Simulations on GPUs," in Agent-Directed Simulation Symposium, 2008
R. D'Souza, M. Lysenko, and K. Rehmani, "SugarScape on Steroids: Simulating Over a Million Agents at Interactive Rates," in AGENT Conference on Complex Interaction and Social Emergence, 2007
6 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Hierarchical GPU System Hardware
Multi-Node
Node (Multi-GPU)
GPU
Block ........
........ Thread
........
7 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Computation Kernels on each GPUE.g., CUDA Threads
• Host initiates “launch” of many SIMD threads
• Threads get “scheduled” in batches on GPU hardware
• CUDA claims extremely efficient thread-launch implementation– Millions of CUDA threads at once
8 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
GPU Memory Types (CUDA)
GPU memory comes in several flavors Registers
Local Memory
Shared Memory
Constant Memory
Global Memory
Texture Memory
An important challenge is organizing the application to make most effective use of hierarchy
GPU memory comes in several flavors Registers
Local Memory
Shared Memory
Constant Memory
Global Memory
Texture Memory
An important challenge is organizing the application to make most effective use of hierarchy
9 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
GPU Communication Latencies (CUDA)
Memory Type Speed Scope Lifetime SizeRegisters Fastest (4 cycles) Thread KernelShared Memory Very fast (4 -? cycles) Block Thread
Global Memory 100x slower (400- cycles) Device ProcessLocal Memory 150x slower (600 cycles) Block ThreadTexture Memory Fast (10s of cycles) Device ProcessConstant Memory Fairly fast (read-only) Device Process
10 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
CUDA + MPI
• An economical cluster solution– Affordable GPUs, each providing one-node CUDA– MPI on giga-bit Ethernet for inter-node comm.
• Memory speed-constrained system– Inter-memory transfers can dominate runtime– Runtime overhead can be severe
• Need a way to tie CUDA and MPI– Algorithmic solution needed– Need to overcome latency challenge
11 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Analogous Networked Multi-core System
Multi-Socket
Multi-Core
Thread
Multi-Node
........
........
12 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Parallel Execution: Conventional Method
Block0,0
P0,0
Block0,1
P0,1
Block0,2
P0,2
Block1,0
P1,0
Block1,1
P1,1
Block1,2
P1,2
Block2,0
P2,0
Block2,1
P2,1
Block2,2
P2,2
B
13 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Latency Challenge: Conventional Method
• High latency between GPU and CPU memories– CUDA inter-memory data
transfer primitives
• Very high latency across CPU memories– MPI communication for data
transfers
• Naïve method gives very poor computation to communication ratio– Slow-downs instead of
speedups
• Need latency resilient method …
14 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Our Solution: B2R Method
Block0,0
P0,0
Block0,1
P0,1
Block0,2
P0,2
Block1,0
P1,0
Block1,1
P1,1
Block1,2
P1,2
Block2,0
P2,0
Block2,1
P2,1
Block2,2
P2,2
BR R
15 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
B2R AlgorithmLet Te be total number of iterations in the simulation 1 For all blocks Blockij in the given agent grid G 1.1 Let (tli, tlj) be the top left index of Blockij 1.2 Let (bri, brj) be the bottom right index of Blockij 1.3 For t=0 to Te/R 1.4 For r=R-1 down to 0 1.5 Update( tli-r, tlj-r, bri+r, brj+r ) 1.6 Communicate( tli, tlj, bri, brj, r ) 1.7 Barrier()
B
R B+2R
R
Direction of error propagation in R
iterations
B×B sub-block mapped
to processing element p
R layers of lagging cells
16 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Total Runtime Cost: Analytical Form
3 2 2 24 2[ (2 2) ( 2 ) ] [8 ]3 3CPUF a R B R B B R B b BR c
3 2 2 2 24 2[ (2 2) ( 2 ) ] [ ]3 3 2GPUF a R B R B B R Bb cRB
At any level in the hierarchy, total runtime F is given by:
Most interesting aspectCubic in R!
17 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Implications of being Cubic in RTo
tal E
xecu
tion
Tim
e
R
• Benefits with B2R not immediately seen for small R– In fact, degradation for
small R!
• Dramatic improvement possible after small R– Our experiments confirm
this trend!
• Too large is too bad too– Can’t profit indefinitely!
18 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Sub-division Across LevelsE.g., MPI to Blocks to Threads
Multi-Node
Node (Multi-GPU)
GPU
Block ........
........ Thread
........
MPI: Rm
Block: Rb
Thread: Rt
19 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Hierarchy and Recursive Use of B & R
B2R can be applied at all levels!• A different R can be chosen at
every level, E.g.– Rb for block-level R
– Rt for thread-level R
• Simple constraints exist for possible values of R– Between R and B– Between R’s at different levels– Details in our paper
E.g., CUDA Hierarchy
Multi-Node
Node (Multi-GPU)
GPU
Block ........
........ Thread
........
20 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
B2R Implementation within CUDA
Split into b×b logical blocks
Global memory
.…
Shared memory per block
B+2R
b×b blocks
R state updates
R state updates
21 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Performance
Multi-Node GPU GOL - 16 mil Agents
0
50
100
150
1 2 4 8MPI Level R (Rm)
Sp
eed
up
Rt=1 Rt=2 Rt=4
Over 100× speedup with MPI+CUDA
Speedup relative to naïve method with no latency-hiding
Multi-Node GPU LDR - 16 mil Agents
0
10
20
30
40
2 4 8
MPI Level R (Rm)
Sp
eed
up
Rt=2 Rt=4 Rt=8
22 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Multi-GPU MPI+CUDA – Game of Life
2 4 80%
500%
1000%
1500%
2000%
2500%
3000%
3500%Multi-Node GPU LDR - 16 mil Agents
Rt=1 Rt=2 Rt=4 Rt=2 Rt=4Rt=8
MPI Level R (Rm)
Impr
ovem
ent L
evel
23 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Multi-core MPI+pthreads– Game of Life
1 2 3 40%
50%
100%
150%
200%
250%
Multi-Node CPU GOL - 1 bil Agents
Rt=1 Rt=2 Rt=4
MPI Level R (Rm)
Impr
ovem
ent L
evel
24 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Multi-core MPI+Pthreads – Game of Life
1 10 100
1000
1000
0
0%
50%
100%
150%
200%
250%Multi-Node CPU GOL - 1 Billion Agents
MPI Level R (Rm)
Impr
ovem
ent L
evel
25 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Multi-core MPI+pthreads – Leadership
2 4 80%
50%
100%
150%
200%
250%Multi-Node CPU LDR - 1 bil Agents
Rt=1 Rt=2 Rt=4Rt=2 Rt=4
MPI Level R (Rm)
Impr
ovem
ent L
evel
26 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Summary
• B2R Algorithm applies across heterogeneous, hierarchical platforms– Deep GPU hierarchies– Deep CPU multi-core systems
• Cubic nature of runtime dependence on R is a a remarkable aspect– A maximum and minimum exist– Optimal (minimum) can be dramatically low
• Results show clear performance improvement– Up to 150x in the best case (fine grained)
27 Managed by UT-Battellefor the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Future Work
• Generate cross-platform code– E.g, Implement in OpenCL
• Add to CUDA-MPI levels– Multi-GPU per node
• Implement and test with more benchmarks– E.g., From existing ABMS
suites NetLogo & Repast
• Generalize to unstructured inter-agent graphs– E.g., Social networks
• Potential to apply to other domains– E.g., Stencil computations
Thank you!Questions?
Additional material at our webpage:
Discrete Computing Systems
www.ornl.gov/~2ip