prefix sums on gpus - gpu.cs.uct.ac.za
TRANSCRIPT
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Prefix sums on GPUs
Bruce Merry
Department of Computer Science, University of Cape Town
GPGPU2 Workshop 2014
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Outline
1 Definition and ApplicationsMotivating ProblemDefinitionsOther Applications
2 Parallel AlgorithmsKogge-StoneBrent-Kung
3 GPU StrategiesReduce-then-ScanTwo-Level Prefix Sum
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Outline
1 Definition and ApplicationsMotivating ProblemDefinitionsOther Applications
2 Parallel AlgorithmsKogge-StoneBrent-Kung
3 GPU StrategiesReduce-then-ScanTwo-Level Prefix Sum
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Problem Statement
For every object in a set, output a list of the other objectsthat differ by less than some amount.This is deliberately vague: could be for n-body simulation,clustering, scattered data interpolation.
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Problem Statement
For every object in a set, output a list of the other objectsthat differ by less than some amount.This is deliberately vague: could be for n-body simulation,clustering, scattered data interpolation.
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Output Format
The lists should be packed together contiguously.
A0 A1 A2
Assuming one workitem per object, how do the workitemsknow where to start?
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Output Format
The lists should be packed together contiguously.
A0 A1 A2 B0 B1
Assuming one workitem per object, how do the workitemsknow where to start?
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Output Format
The lists should be packed together contiguously.
A0 A1 A2 B0 B1 D0
Assuming one workitem per object, how do the workitemsknow where to start?
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Output Format
The lists should be packed together contiguously.
A0 A1 A2 B0 B1 D0 E0 E1
Assuming one workitem per object, how do the workitemsknow where to start?
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Output Format
The lists should be packed together contiguously.
A0 A1 A2 B0 B1 D0 E0 E1
Assuming one workitem per object, how do the workitemsknow where to start?
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Solution
This can be solved with a multi-pass approach:1 Every workitem counts how many records to emit, and
writes this number to a buffer.2 The buffer is processed to determine the start position
for each object, and writes this position to a buffer.3 Each workitem reads this buffer, and emits its records
in the right place.
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Solution
This can be solved with a multi-pass approach:1 Every workitem counts how many records to emit, and
writes this number to a buffer.2 The buffer is processed to determine the start position
for each object, and writes this position to a buffer.3 Each workitem reads this buffer, and emits its records
in the right place.
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Solution
This can be solved with a multi-pass approach:1 Every workitem counts how many records to emit, and
writes this number to a buffer.2 The buffer is processed to determine the start position
for each object, and writes this position to a buffer.3 Each workitem reads this buffer, and emits its records
in the right place.
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Outline
1 Definition and ApplicationsMotivating ProblemDefinitionsOther Applications
2 Parallel AlgorithmsKogge-StoneBrent-Kung
3 GPU StrategiesReduce-then-ScanTwo-Level Prefix Sum
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Exclusive Prefix Sum
Given an operator ⊕ and an identity element I, the exclusiveprefix sum of (a0,a1, . . . ,an−1) is
(I,a0,a0 ⊕ a1,a0 ⊕ a1 ⊕ a2, . . . ,a0 ⊕ · · · ⊕ an−2) =
i−1⊕j=0
aj
In other words, element i is the sum of all elements strictlybefore i .
4 3 7 9 2 3
0 4 7 14 23 25
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Exclusive Prefix Sum
Given an operator ⊕ and an identity element I, the exclusiveprefix sum of (a0,a1, . . . ,an−1) is
(I,a0,a0 ⊕ a1,a0 ⊕ a1 ⊕ a2, . . . ,a0 ⊕ · · · ⊕ an−2) =
i−1⊕j=0
aj
In other words, element i is the sum of all elements strictlybefore i .
4 3 7 9 2 3
0 4 7 14 23 25
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Inclusive Prefix Sum
Given an operator ⊕ and an identity element I, the inclusiveprefix sum of (a0,a1, . . . ,an−1) is
(a0,a0 ⊕ a1,a0 ⊕ a1 ⊕ a2, . . . ,a0 ⊕ · · · ⊕ an−1) =
i⊕j=0
aj
In other words, element i is the sum of all elements beforeand including i .
4 3 7 9 2 3
4 7 14 23 25 28
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Inclusive Prefix Sum
Given an operator ⊕ and an identity element I, the inclusiveprefix sum of (a0,a1, . . . ,an−1) is
(a0,a0 ⊕ a1,a0 ⊕ a1 ⊕ a2, . . . ,a0 ⊕ · · · ⊕ an−1) =
i⊕j=0
aj
In other words, element i is the sum of all elements beforeand including i .
4 3 7 9 2 3
4 7 14 23 25 28
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Outline
1 Definition and ApplicationsMotivating ProblemDefinitionsOther Applications
2 Parallel AlgorithmsKogge-StoneBrent-Kung
3 GPU StrategiesReduce-then-ScanTwo-Level Prefix Sum
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Other Applications
Compaction: select all objects that satisfy a predicatePartitioning: rearrange objects that satisfy a predicatebefore the othersSorting: radix sort is just repeated partitioningVisibility: an object is visible if it is not preceded by ataller one (using max operator instead of +)Meshing: each cell produces an variable number oftriangles
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Other Applications
Compaction: select all objects that satisfy a predicatePartitioning: rearrange objects that satisfy a predicatebefore the othersSorting: radix sort is just repeated partitioningVisibility: an object is visible if it is not preceded by ataller one (using max operator instead of +)Meshing: each cell produces an variable number oftriangles
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Other Applications
Compaction: select all objects that satisfy a predicatePartitioning: rearrange objects that satisfy a predicatebefore the othersSorting: radix sort is just repeated partitioningVisibility: an object is visible if it is not preceded by ataller one (using max operator instead of +)Meshing: each cell produces an variable number oftriangles
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Other Applications
Compaction: select all objects that satisfy a predicatePartitioning: rearrange objects that satisfy a predicatebefore the othersSorting: radix sort is just repeated partitioningVisibility: an object is visible if it is not preceded by ataller one (using max operator instead of +)Meshing: each cell produces an variable number oftriangles
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Other Applications
Compaction: select all objects that satisfy a predicatePartitioning: rearrange objects that satisfy a predicatebefore the othersSorting: radix sort is just repeated partitioningVisibility: an object is visible if it is not preceded by ataller one (using max operator instead of +)Meshing: each cell produces an variable number oftriangles
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Atomics
Atomics offer an alternative way to allocate unique memoryper work-item, but
Suffer heavy contention, which is slow (but gettingbetter all the time)Do not preserve the original orderingDo not give reproducible ordering
Atomics have the advantage of allowing for single-passalgorithms.
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Outline
1 Definition and ApplicationsMotivating ProblemDefinitionsOther Applications
2 Parallel AlgorithmsKogge-StoneBrent-Kung
3 GPU StrategiesReduce-then-ScanTwo-Level Prefix Sum
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Idea
Let sti be the sum of the (up to) t inputs ending with ai . Then
s2ti = st
i−t ⊕ sti .
We start with (s1i ) = (ai), then compute (s2
i ), (s4i ), (s
8i ) and
so on, up to (sNi ), in O(log2 N) iterations, to give an
inclusive prefix sum.
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Example
6 9 2 3 6 1 6 3 s1
6 15 11 5 9 7 7 9 s2
6 15 17 20 20 12 16 16 s3
6 15 17 20 26 27 33 36 s4
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Pseudo-code
foreach power-of-two t from 1 to N dofor i ← t to N − 1 do in parallel
ai ← ai−t ⊕ ai ;
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Work-item Pseudo-code
i ← workitem ID;foreach power-of-two t from 1 to N do
x ← ai ;if t ≤ i then
x ← x ⊕ ai−t ;barrier();ai ← x ;barrier();
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Optimizations
The working register x can be reused between loopiterations without reloading.The if statement can be eliminated by padding at thefront with zeros.Shared memory can be used to reduce global memoryaccesses.
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Optimizations
The working register x can be reused between loopiterations without reloading.The if statement can be eliminated by padding at thefront with zeros.Shared memory can be used to reduce global memoryaccesses.
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Optimizations
The working register x can be reused between loopiterations without reloading.The if statement can be eliminated by padding at thefront with zeros.Shared memory can be used to reduce global memoryaccesses.
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Properties
It is work-inefficient: it performs O(N log N) operationsin totalAbout 2 log2 N barriersAbout N log N reads and N log N writesMemory access pattern is good: sequential accesses
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Properties
It is work-inefficient: it performs O(N log N) operationsin totalAbout 2 log2 N barriersAbout N log N reads and N log N writesMemory access pattern is good: sequential accesses
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Properties
It is work-inefficient: it performs O(N log N) operationsin totalAbout 2 log2 N barriersAbout N log N reads and N log N writesMemory access pattern is good: sequential accesses
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Properties
It is work-inefficient: it performs O(N log N) operationsin totalAbout 2 log2 N barriersAbout N log N reads and N log N writesMemory access pattern is good: sequential accesses
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Outline
1 Definition and ApplicationsMotivating ProblemDefinitionsOther Applications
2 Parallel AlgorithmsKogge-StoneBrent-Kung
3 GPU StrategiesReduce-then-ScanTwo-Level Prefix Sum
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Idea
For an exclusive scan:Add pairs of adjacent elements:
pi = a2i ⊕ a2i+1
Recursively scan these sums:
qi =i−1⊕j=0
pi =2i−1⊕j=0
aj
Use these sums to compute the result:
s2i = qi , s2i+1 = qi ⊕ a2i
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Example
6 9 2 3 6 1 6 3
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Example
15
6 9
5
2 3
7
6 1
9
6 3
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Example
20
15
6 9
5
2 3
16
7
6 1
9
6 3
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Example
36
20
15
6 9
5
2 3
16
7
6 1
9
6 3
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Example
0
20
15
6 9
5
2 3
16
7
6 1
9
6 3
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Example
0
0
15
6 9
5
2 3
20
7
6 1
9
6 3
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Example
0
0
0
6 9
15
2 3
20
20
6 1
27
6 3
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Example
0
0
0
0 6
15
15 17
20
20
20 26
27
27 33
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Memory Arrangement
In-place Each sum replaces the second element of thepair being summed. No extra memory, but hasbad bank conflicts.
Out-of-place Each level of the tree stored contiguously.Requires double the memory, but conflicts areonly 2-way.
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Memory Arrangement
In-place Each sum replaces the second element of thepair being summed. No extra memory, but hasbad bank conflicts.
Out-of-place Each level of the tree stored contiguously.Requires double the memory, but conflicts areonly 2-way.
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Pseudo-code
Out-of-place exclusive sum, for N = 2n:Copy ai to bi+N for i ∈ [0,N);for t ← n − 1 downto 1 do
for i ← 0 to 2t − 1 do in parallelb2t+i ← b2t+1+i ⊕ b2t+1+i+1;
// Exclusive prefix sum of two elementsb3 ← b2;b2 ← I;for t ← 1 to n − 1 do
for i ← 0 to 2t − 1 do in parallelb2t+1+i+1 ← b2t+i ⊕ b2t+1+i ;b2t+1+i ← b2t+i ;
Copy bi+N to si for i ∈ [0,N);
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Per-workitem Pseudo-code
Uses N2 work-items:
i ← work-item ID;bi+N ← ai ;bi+N+ N
2← ai+ N
2;
barrier();for t ← n − 1 downto 1 do
if i < 2t thenb2t+i ← b2t+1+2i ⊕ b2t+1+2i+1;
barrier();if i = 0 then
a3 ← a2;a2 ← I;
barrier();for t ← 1 to n − 1 do
if i < 2t thenb2t+1+2i+1 ← b2t+i ⊕ b2t+1+2i ;b2t+1+2i ← b2t+i ;
barrier();si ← bi+N ;si+ N
2← bi+N+ N
2;
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Properties
Work-efficient: O(N) addition operationsStill requires about 2 log2 N barriersRequires about 4N reads and 3N writesOnly N
2 work-items requiredHas branching, but it is coherent
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Properties
Work-efficient: O(N) addition operationsStill requires about 2 log2 N barriersRequires about 4N reads and 3N writesOnly N
2 work-items requiredHas branching, but it is coherent
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Properties
Work-efficient: O(N) addition operationsStill requires about 2 log2 N barriersRequires about 4N reads and 3N writesOnly N
2 work-items requiredHas branching, but it is coherent
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Properties
Work-efficient: O(N) addition operationsStill requires about 2 log2 N barriersRequires about 4N reads and 3N writesOnly N
2 work-items requiredHas branching, but it is coherent
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Properties
Work-efficient: O(N) addition operationsStill requires about 2 log2 N barriersRequires about 4N reads and 3N writesOnly N
2 work-items requiredHas branching, but it is coherent
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Motivation
Applying one of these at a larger (multi-workgroup) scalehas issues:
Synchronisation: no inter-workgroup synchronisation,so barriers must be kernel-instance boundariesMemory usage: need O(N) working spaceBandwidth: requires O(N log N) for Kogge-Stone, about7N for Brent-Kung
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Outline
1 Definition and ApplicationsMotivating ProblemDefinitionsOther Applications
2 Parallel AlgorithmsKogge-StoneBrent-Kung
3 GPU StrategiesReduce-then-ScanTwo-Level Prefix Sum
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Generalizing Brent-Kung
The Brent-Kung tree doesn’t have to be binary:
6 9 2 3 6 1 6 3 3 1 5 4 9 7 3 3
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Generalizing Brent-Kung
The Brent-Kung tree doesn’t have to be binary:
20
6 9 2 3
16
6 1 6 3
13
3 1 5 4
22
9 7 3 3
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Generalizing Brent-Kung
The Brent-Kung tree doesn’t have to be binary:
71
20
6 9 2 3
16
6 1 6 3
13
3 1 5 4
22
9 7 3 3
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Generalizing Brent-Kung
The Brent-Kung tree doesn’t have to be binary:
0
20
6 9 2 3
16
6 1 6 3
13
3 1 5 4
22
9 7 3 3
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Generalizing Brent-Kung
The Brent-Kung tree doesn’t have to be binary:
0
0
6 9 2 3
20
6 1 6 3
36
3 1 5 4
49
9 7 3 3
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Generalizing Brent-Kung
The Brent-Kung tree doesn’t have to be binary:
0
0
0 6 15 17
20
20 26 27 33
36
36 39 40 45
49
49 58 65 68
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Reduce-then-Scan strategy
1 Divide elements into blocks of size M.2 Use a workgroup per block to compute sum of each
block.3 Recursively prefix-sum the block sums.4 Use a workgroup per block to prefix-sum each block,
starting from the result from the previous level.
Steps 2 and 4 can use any parallel reduction/prefix sumalgorithm.
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Reduce-then-Scan strategy
1 Divide elements into blocks of size M.2 Use a workgroup per block to compute sum of each
block.3 Recursively prefix-sum the block sums.4 Use a workgroup per block to prefix-sum each block,
starting from the result from the previous level.
Steps 2 and 4 can use any parallel reduction/prefix sumalgorithm.
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Reduce-then-Scan strategy
1 Divide elements into blocks of size M.2 Use a workgroup per block to compute sum of each
block.3 Recursively prefix-sum the block sums.4 Use a workgroup per block to prefix-sum each block,
starting from the result from the previous level.
Steps 2 and 4 can use any parallel reduction/prefix sumalgorithm.
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Reduce-then-Scan strategy
1 Divide elements into blocks of size M.2 Use a workgroup per block to compute sum of each
block.3 Recursively prefix-sum the block sums.4 Use a workgroup per block to prefix-sum each block,
starting from the result from the previous level.
Steps 2 and 4 can use any parallel reduction/prefix sumalgorithm.
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Reduce-then-Scan strategy
1 Divide elements into blocks of size M.2 Use a workgroup per block to compute sum of each
block.3 Recursively prefix-sum the block sums.4 Use a workgroup per block to prefix-sum each block,
starting from the result from the previous level.
Steps 2 and 4 can use any parallel reduction/prefix sumalgorithm.
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Analysis
Assuming that M is reasonably large:About logM N kernel instancesMost memory accesses can be to local memorySlightly over 2N global readsSlightly over N global writesSlightly over O(N log M) barrier instructions
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Analysis
Assuming that M is reasonably large:About logM N kernel instancesMost memory accesses can be to local memorySlightly over 2N global readsSlightly over N global writesSlightly over O(N log M) barrier instructions
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Analysis
Assuming that M is reasonably large:About logM N kernel instancesMost memory accesses can be to local memorySlightly over 2N global readsSlightly over N global writesSlightly over O(N log M) barrier instructions
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Analysis
Assuming that M is reasonably large:About logM N kernel instancesMost memory accesses can be to local memorySlightly over 2N global readsSlightly over N global writesSlightly over O(N log M) barrier instructions
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Analysis
Assuming that M is reasonably large:About logM N kernel instancesMost memory accesses can be to local memorySlightly over 2N global readsSlightly over N global writesSlightly over O(N log M) barrier instructions
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Outline
1 Definition and ApplicationsMotivating ProblemDefinitionsOther Applications
2 Parallel AlgorithmsKogge-StoneBrent-Kung
3 GPU StrategiesReduce-then-ScanTwo-Level Prefix Sum
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Reducing Parallelism
The more parallelism one uses in a prefix sum, the higherthe overheads become.Therefore, only use as much parallelism as is necessary tosaturate the hardware.
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Reducing Parallelism
The more parallelism one uses in a prefix sum, the higherthe overheads become.Therefore, only use as much parallelism as is necessary tosaturate the hardware.
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Fixed Block Count
Use the same reduce-then-scan strategy, butFix the number of blocks at C, set M = N
C
Fix a work-group size GC should be tuned so that C × G workitems saturatethe deviceC should be small enough that only 2 levels arerequired
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Prefix-Summing a Block
Each block has size M but workgroups only have Gworkitems. How does a workgroup prefix-sum a block?Serially. In sub-blocks of size G or 2G.
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Prefix-Summing a Block
Each block has size M but workgroups only have Gworkitems. How does a workgroup prefix-sum a block?Serially. In sub-blocks of size G or 2G.
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Prefix-Summing a Block
Each block has size M but workgroups only have Gworkitems. How does a workgroup prefix-sum a block?Serially. In sub-blocks of size G or 2G.
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Advantages
Only three kernel instances, two of which use the fullGPUOnly O(N log G) barrier instructionsOnly O(C) extra global memory
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Advantages
Only three kernel instances, two of which use the fullGPUOnly O(N log G) barrier instructionsOnly O(C) extra global memory
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Advantages
Only three kernel instances, two of which use the fullGPUOnly O(N log G) barrier instructionsOnly O(C) extra global memory
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Summary
Parallel prefix sum is hard workGPUs need parallelism, but algorithm works best withleast parallelismWith good implementation, can be bandwidth-limited
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Summary
Parallel prefix sum is hard workGPUs need parallelism, but algorithm works best withleast parallelismWith good implementation, can be bandwidth-limited
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
Summary
Parallel prefix sum is hard workGPUs need parallelism, but algorithm works best withleast parallelismWith good implementation, can be bandwidth-limited
Prefix sumson GPUs
Bruce Merry
Definition andApplicationsMotivating Problem
Definitions
Other Applications
ParallelAlgorithmsKogge-Stone
Brent-Kung
GPUStrategiesReduce-then-Scan
Two-Level PrefixSum
Summary
References
Guy E. Blelloch.Prefix sums and their applications.Technical Report CMU-CS-90-190, Computer ScienceDepartment, Carnegie Mellon University, November1990.
Duane Merrill and Andrew Grimshaw.Parallel scan for stream architectures.Technical Report CS2009-14, Department of ComputerScience, University of Virginia, December 2009.