on dynamic load balancing on graphics processors daniel cederman and philippas tsigas chalmers...
TRANSCRIPT
On Dynamic Load Balancing on Graphics Processors
Daniel Cederman and Philippas TsigasChalmers University of Technology
Overview
• Motivation
• Methods
• Experimental evaluation
• Conclusion
The problem setting
Work
Task Task Task
Task Task Task Task
Task Task Task Task
Offline
Online
Static Load Balancing
Processor Processor Processor Processor
Static Load Balancing
Processor Processor Processor Processor
Task Task Task Task
Static Load Balancing
Processor Processor Processor Processor
Task
Task
Task
Task
Static Load Balancing
Processor Processor Processor Processor
Task
Task
Task
Task
Subtask Subtask Subtask Subtask
Static Load Balancing
Processor Processor Processor Processor
Task
Task
Task
Task
SubtaskSubtask
Subtask
Subtask
Dynamic Load Balancing
Processor Processor Processor Processor
Task
Task
Task
Task
Subtask
SubtaskSubtask
Subtask
Task sharing
Work done?
Try to get task
New tasks
?
Perform task
Got task?
Add task
Task Set
No, retry
Check condition
Acquire Task
Add Task
No, continue
Task
Task
Task
Task
Task
Done
System Model
• CUDA
• Global Memory
• Gather and scatter
• Compare-And-Swap
• Fetch-And-Inc
• Multiprocessors
• Maximum number ofconcurrent thread blocks
Multi-processor
Thread Block
Thread Block
Thread Block
Multi-processor
Thread Block
Thread Block
Thread Block
Multi-processor
Thread Block
Thread Block
Thread Block
Global Memory
Synchronization
• Blocking
• Uses mutual exclusion to only allow one process at a time to access the object.
• Lockfree
• Multiple processes can access the object concurrently. At least one operation in a set of concurrent operations finishes in a finite number of its own steps.
• Waitfree
• Multiple processes can access the object concurrently. Every operation finishes in a finite number of its own steps.
Load Balancing Methods
• Blocking Task Queue
• Non-blocking Task Queue
• Task Stealing
• Static Task List
Blocking queue
TB 1
TB 2
TB n
Free
Head
Tail
Blocking queue
TB 1
TB 2
TB n
Free
Head
Tail
Blocking queue
T1
TB 1
TB 2
TB n
Free
Head
Tail
Blocking queue
T1
TB 1
TB 2
TB n
Free
Head
Tail
Blocking queue
T1
TB 1
TB 2
TB n
Free
Head
Tail
Non-blocking Queue
T1 T2 T3 T4
TB 1
TB 2
TB 1
TB 2
TB n
Head
Tail
ReferenceP. Tsigas and Y. Zhang, A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems [SPAA01]
Non-blocking Queue
T1 T2 T3 T4
TB 1
TB 2
TB 1
TB 2
TB n
Head
Tail
Non-blocking Queue
T1 T2 T3 T4
TB 1
TB 2
TB 1
TB 2
TB n
Head
Tail
Non-blocking Queue
T1 T2 T3 T4
TB 1
TB 2
TB 1
TB 2
TB n
Head
Tail
Non-blocking Queue
T1 T2 T3 T4 T5
TB 1
TB 2
TB 1
TB 2
TB n
Head
Tail
Non-blocking Queue
T1 T2 T3 T4 T5
TB 1
TB 2
TB 1
TB 2
TB n
Head
Tail
Task stealing
T1
T3 T2
TB 1
TB 2
TB n
ReferenceArora N. S., Blumofe R. D., Plaxton C. G. , Thread Scheduling for Multiprogrammed Multiprocessors [SPAA 98]
Task stealing
T1 T4
T3 T2
TB 1
TB 2
TB n
Task stealing
T1 T4 T5
T3 T2
TB 1
TB 2
TB n
Task stealing
T1 T4
T3 T2
TB 1
TB 2
TB n
Task stealing
T1
T3 T2
TB 1
TB 2
TB n
Task stealing
T3 T2
TB 1
TB 2
TB n
Task stealing
T2
TB 1
TB 2
TB n
Static Task List
T1
T2
T3
T4
In
Static Task List
T1
T2
T3
T4
In
TB 1
TB 2
TB 3
TB 4
Static Task List
T1
T2
T3
T4
InOut
TB 1
TB 2
TB 3
TB 4
Static Task List
T1
T2
T3
T4
T5
InOut
TB 1
TB 2
TB 3
TB 4
Static Task List
T1
T2
T3
T4
T5
T6
InOut
TB 1
TB 2
TB 3
TB 4
Static Task List
T1
T2
T3
T4
T5
T6
T7
InOut
TB 1
TB 2
TB 3
TB 4
Octree Partitioning
• Bandwidth bound
Octree Partitioning
• Bandwidth bound
Octree Partitioning
• Bandwidth bound
Octree Partitioning
• Bandwidth bound
Four-in-a-row
• Computation intensive
Graphics Processors
8800GT• 14 Multiprocessors
• 57 GB/sec bandwidth
9600GT• 8 Multiprocessors
• 57 GB/sec bandwidth
Blocking Queue – Octree/9600GT
16 32
48 64
80 96
112 128
16 32
48 64
80 96
112 128
0
200
400
600
Time (ms)
ThreadsBlocks
Time (ms)
200
300
400
500
Blocking Queue – Octree/8800GT
16 32
48 64
80 96
112 128
16 32
48 64
80 96
112 128
0
200
400
600
800
Time (ms)
ThreadsBlocks
Time (ms)
200
400
600
800
Blocking Queue – Four-in-a-row
16 32
48 64
80 96
112 128
16 32
48 64
80 96
112 128
0 500
1000 1500 2000 2500
Time (ms)
ThreadsBlocks
Time (ms)
500 1000 1500 2000 2500
Non-blocking Queue – Octree/9600GT
16 32
48 64
80 96
112 128
16 32
48 64
80 96
112 128
0 50
100 150 200 250
Time (ms)
ThreadsBlocks
Time (ms)
100
150
200
Non-blocking Queue – Octree/8800GT
16 32
48 64
80 96
112 128
16 32
48 64
80 96
112 128
0 50
100 150 200 250
Time (ms)
ThreadsBlocks
Time (ms)
100
150
200
Non-blocking Queue - Four-in-a-row
16 32
48 64
80 96
112 128
16 32
48 64
80 96
112 128
0
50
100
150
200
Time (ms)
ThreadsBlocks
Time (ms)
100
150
200
Task stealing – Octree/9600GT
16 32
48 64
80 96
112 128
16 32
48 64
80 96
112 128
0 50
100 150 200 250
Time (ms)
ThreadsBlocks
Time (ms)
0
50
100
150
200
Task stealing – Octree/8800GT
16 32
48 64
80 96
112 128
16 32
48 64
80 96
112 128
0 50
100 150 200 250
Time (ms)
ThreadsBlocks
Time (ms)
50
100
150
200
Task stealing – Four-in-a-row
16 32
48 64
80 96
112 128
16 32
48 64
80 96
112 128
0
50
100
150
Time (ms)
ThreadsBlocks
Time (ms)
50
100
150
Static List
8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 1280
20
40
60
80
100
120
140
Octree 9600GT Octree 8800GTS Four-in-a-row
Threads/Block
Tim
e (m
s)
Octree Comparison
100 150 200 250 300 350 400 450 50010
100
Blocking Queue Non-Blocking Queue Static ListWork Stealing
Particles (thousands)
Tim
e (m
s)
Previous work
• Korch M., Raubert T., A comparison of task pools for dynamic load balancing of irregular algorithms, Concurrency and Computation: Practice & Experience, 16, 2003
• Heirich A., Arvo J., A competetive analysis of load balancing strategies for parallel ray tracing, Journal of Supercomputing, 12, 1998
• Foley T., Sugerman J., KD-tree acceleration structures for a GPU raytracer, Graphics Hardware 2005
Conclusion
• Synchronization plays a significant role in dynamic load-balancing
• Lock-free data structures/synchronization scales well and looks promising also in the GPU general purpose programming
• Locks perform poorly
• It is good that operations such as CAS and FAA have been introduced in the new GPUs
• Work stealing could outperform static load balancing
Thank you!
http://www.cs.chalmers.se/~dcs