prefix sums on gpus - gpu.cs.uct.ac.za

88
Prefix sums on GPUs Bruce Merry Definition and Applications Motivating Problem Definitions Other Applications Parallel Algorithms Kogge-Stone Brent-Kung GPU Strategies Reduce-then-Scan Two-Level Prefix Sum Summary Prefix sums on GPUs Bruce Merry Department of Computer Science, University of Cape Town GPGPU2 Workshop 2014

Upload: others

Post on 19-Oct-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Prefix sums on GPUs

Bruce Merry

Department of Computer Science, University of Cape Town

GPGPU2 Workshop 2014

Page 2: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Outline

1 Definition and ApplicationsMotivating ProblemDefinitionsOther Applications

2 Parallel AlgorithmsKogge-StoneBrent-Kung

3 GPU StrategiesReduce-then-ScanTwo-Level Prefix Sum

Page 3: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Outline

1 Definition and ApplicationsMotivating ProblemDefinitionsOther Applications

2 Parallel AlgorithmsKogge-StoneBrent-Kung

3 GPU StrategiesReduce-then-ScanTwo-Level Prefix Sum

Page 4: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Problem Statement

For every object in a set, output a list of the other objectsthat differ by less than some amount.This is deliberately vague: could be for n-body simulation,clustering, scattered data interpolation.

Page 5: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Problem Statement

For every object in a set, output a list of the other objectsthat differ by less than some amount.This is deliberately vague: could be for n-body simulation,clustering, scattered data interpolation.

Page 6: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Output Format

The lists should be packed together contiguously.

A0 A1 A2

Assuming one workitem per object, how do the workitemsknow where to start?

Page 7: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Output Format

The lists should be packed together contiguously.

A0 A1 A2 B0 B1

Assuming one workitem per object, how do the workitemsknow where to start?

Page 8: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Output Format

The lists should be packed together contiguously.

A0 A1 A2 B0 B1 D0

Assuming one workitem per object, how do the workitemsknow where to start?

Page 9: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Output Format

The lists should be packed together contiguously.

A0 A1 A2 B0 B1 D0 E0 E1

Assuming one workitem per object, how do the workitemsknow where to start?

Page 10: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Output Format

The lists should be packed together contiguously.

A0 A1 A2 B0 B1 D0 E0 E1

Assuming one workitem per object, how do the workitemsknow where to start?

Page 11: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Solution

This can be solved with a multi-pass approach:1 Every workitem counts how many records to emit, and

writes this number to a buffer.2 The buffer is processed to determine the start position

for each object, and writes this position to a buffer.3 Each workitem reads this buffer, and emits its records

in the right place.

Page 12: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Solution

This can be solved with a multi-pass approach:1 Every workitem counts how many records to emit, and

writes this number to a buffer.2 The buffer is processed to determine the start position

for each object, and writes this position to a buffer.3 Each workitem reads this buffer, and emits its records

in the right place.

Page 13: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Solution

This can be solved with a multi-pass approach:1 Every workitem counts how many records to emit, and

writes this number to a buffer.2 The buffer is processed to determine the start position

for each object, and writes this position to a buffer.3 Each workitem reads this buffer, and emits its records

in the right place.

Page 14: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Outline

1 Definition and ApplicationsMotivating ProblemDefinitionsOther Applications

2 Parallel AlgorithmsKogge-StoneBrent-Kung

3 GPU StrategiesReduce-then-ScanTwo-Level Prefix Sum

Page 15: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Exclusive Prefix Sum

Given an operator ⊕ and an identity element I, the exclusiveprefix sum of (a0,a1, . . . ,an−1) is

(I,a0,a0 ⊕ a1,a0 ⊕ a1 ⊕ a2, . . . ,a0 ⊕ · · · ⊕ an−2) =

i−1⊕j=0

aj

In other words, element i is the sum of all elements strictlybefore i .

4 3 7 9 2 3

0 4 7 14 23 25

Page 16: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Exclusive Prefix Sum

Given an operator ⊕ and an identity element I, the exclusiveprefix sum of (a0,a1, . . . ,an−1) is

(I,a0,a0 ⊕ a1,a0 ⊕ a1 ⊕ a2, . . . ,a0 ⊕ · · · ⊕ an−2) =

i−1⊕j=0

aj

In other words, element i is the sum of all elements strictlybefore i .

4 3 7 9 2 3

0 4 7 14 23 25

Page 17: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Inclusive Prefix Sum

Given an operator ⊕ and an identity element I, the inclusiveprefix sum of (a0,a1, . . . ,an−1) is

(a0,a0 ⊕ a1,a0 ⊕ a1 ⊕ a2, . . . ,a0 ⊕ · · · ⊕ an−1) =

i⊕j=0

aj

In other words, element i is the sum of all elements beforeand including i .

4 3 7 9 2 3

4 7 14 23 25 28

Page 18: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Inclusive Prefix Sum

Given an operator ⊕ and an identity element I, the inclusiveprefix sum of (a0,a1, . . . ,an−1) is

(a0,a0 ⊕ a1,a0 ⊕ a1 ⊕ a2, . . . ,a0 ⊕ · · · ⊕ an−1) =

i⊕j=0

aj

In other words, element i is the sum of all elements beforeand including i .

4 3 7 9 2 3

4 7 14 23 25 28

Page 19: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Outline

1 Definition and ApplicationsMotivating ProblemDefinitionsOther Applications

2 Parallel AlgorithmsKogge-StoneBrent-Kung

3 GPU StrategiesReduce-then-ScanTwo-Level Prefix Sum

Page 20: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Other Applications

Compaction: select all objects that satisfy a predicatePartitioning: rearrange objects that satisfy a predicatebefore the othersSorting: radix sort is just repeated partitioningVisibility: an object is visible if it is not preceded by ataller one (using max operator instead of +)Meshing: each cell produces an variable number oftriangles

Page 21: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Other Applications

Compaction: select all objects that satisfy a predicatePartitioning: rearrange objects that satisfy a predicatebefore the othersSorting: radix sort is just repeated partitioningVisibility: an object is visible if it is not preceded by ataller one (using max operator instead of +)Meshing: each cell produces an variable number oftriangles

Page 22: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Other Applications

Compaction: select all objects that satisfy a predicatePartitioning: rearrange objects that satisfy a predicatebefore the othersSorting: radix sort is just repeated partitioningVisibility: an object is visible if it is not preceded by ataller one (using max operator instead of +)Meshing: each cell produces an variable number oftriangles

Page 23: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Other Applications

Compaction: select all objects that satisfy a predicatePartitioning: rearrange objects that satisfy a predicatebefore the othersSorting: radix sort is just repeated partitioningVisibility: an object is visible if it is not preceded by ataller one (using max operator instead of +)Meshing: each cell produces an variable number oftriangles

Page 24: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Other Applications

Compaction: select all objects that satisfy a predicatePartitioning: rearrange objects that satisfy a predicatebefore the othersSorting: radix sort is just repeated partitioningVisibility: an object is visible if it is not preceded by ataller one (using max operator instead of +)Meshing: each cell produces an variable number oftriangles

Page 25: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Atomics

Atomics offer an alternative way to allocate unique memoryper work-item, but

Suffer heavy contention, which is slow (but gettingbetter all the time)Do not preserve the original orderingDo not give reproducible ordering

Atomics have the advantage of allowing for single-passalgorithms.

Page 26: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Outline

1 Definition and ApplicationsMotivating ProblemDefinitionsOther Applications

2 Parallel AlgorithmsKogge-StoneBrent-Kung

3 GPU StrategiesReduce-then-ScanTwo-Level Prefix Sum

Page 27: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Idea

Let sti be the sum of the (up to) t inputs ending with ai . Then

s2ti = st

i−t ⊕ sti .

We start with (s1i ) = (ai), then compute (s2

i ), (s4i ), (s

8i ) and

so on, up to (sNi ), in O(log2 N) iterations, to give an

inclusive prefix sum.

Page 28: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Example

6 9 2 3 6 1 6 3 s1

6 15 11 5 9 7 7 9 s2

6 15 17 20 20 12 16 16 s3

6 15 17 20 26 27 33 36 s4

Page 29: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Pseudo-code

foreach power-of-two t from 1 to N dofor i ← t to N − 1 do in parallel

ai ← ai−t ⊕ ai ;

Page 30: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Work-item Pseudo-code

i ← workitem ID;foreach power-of-two t from 1 to N do

x ← ai ;if t ≤ i then

x ← x ⊕ ai−t ;barrier();ai ← x ;barrier();

Page 31: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Optimizations

The working register x can be reused between loopiterations without reloading.The if statement can be eliminated by padding at thefront with zeros.Shared memory can be used to reduce global memoryaccesses.

Page 32: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Optimizations

The working register x can be reused between loopiterations without reloading.The if statement can be eliminated by padding at thefront with zeros.Shared memory can be used to reduce global memoryaccesses.

Page 33: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Optimizations

The working register x can be reused between loopiterations without reloading.The if statement can be eliminated by padding at thefront with zeros.Shared memory can be used to reduce global memoryaccesses.

Page 34: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Properties

It is work-inefficient: it performs O(N log N) operationsin totalAbout 2 log2 N barriersAbout N log N reads and N log N writesMemory access pattern is good: sequential accesses

Page 35: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Properties

It is work-inefficient: it performs O(N log N) operationsin totalAbout 2 log2 N barriersAbout N log N reads and N log N writesMemory access pattern is good: sequential accesses

Page 36: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Properties

It is work-inefficient: it performs O(N log N) operationsin totalAbout 2 log2 N barriersAbout N log N reads and N log N writesMemory access pattern is good: sequential accesses

Page 37: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Properties

It is work-inefficient: it performs O(N log N) operationsin totalAbout 2 log2 N barriersAbout N log N reads and N log N writesMemory access pattern is good: sequential accesses

Page 38: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Outline

1 Definition and ApplicationsMotivating ProblemDefinitionsOther Applications

2 Parallel AlgorithmsKogge-StoneBrent-Kung

3 GPU StrategiesReduce-then-ScanTwo-Level Prefix Sum

Page 39: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Idea

For an exclusive scan:Add pairs of adjacent elements:

pi = a2i ⊕ a2i+1

Recursively scan these sums:

qi =i−1⊕j=0

pi =2i−1⊕j=0

aj

Use these sums to compute the result:

s2i = qi , s2i+1 = qi ⊕ a2i

Page 40: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Example

6 9 2 3 6 1 6 3

Page 41: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Example

15

6 9

5

2 3

7

6 1

9

6 3

Page 42: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Example

20

15

6 9

5

2 3

16

7

6 1

9

6 3

Page 43: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Example

36

20

15

6 9

5

2 3

16

7

6 1

9

6 3

Page 44: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Example

0

20

15

6 9

5

2 3

16

7

6 1

9

6 3

Page 45: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Example

0

0

15

6 9

5

2 3

20

7

6 1

9

6 3

Page 46: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Example

0

0

0

6 9

15

2 3

20

20

6 1

27

6 3

Page 47: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Example

0

0

0

0 6

15

15 17

20

20

20 26

27

27 33

Page 48: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Memory Arrangement

In-place Each sum replaces the second element of thepair being summed. No extra memory, but hasbad bank conflicts.

Out-of-place Each level of the tree stored contiguously.Requires double the memory, but conflicts areonly 2-way.

Page 49: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Memory Arrangement

In-place Each sum replaces the second element of thepair being summed. No extra memory, but hasbad bank conflicts.

Out-of-place Each level of the tree stored contiguously.Requires double the memory, but conflicts areonly 2-way.

Page 50: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Pseudo-code

Out-of-place exclusive sum, for N = 2n:Copy ai to bi+N for i ∈ [0,N);for t ← n − 1 downto 1 do

for i ← 0 to 2t − 1 do in parallelb2t+i ← b2t+1+i ⊕ b2t+1+i+1;

// Exclusive prefix sum of two elementsb3 ← b2;b2 ← I;for t ← 1 to n − 1 do

for i ← 0 to 2t − 1 do in parallelb2t+1+i+1 ← b2t+i ⊕ b2t+1+i ;b2t+1+i ← b2t+i ;

Copy bi+N to si for i ∈ [0,N);

Page 51: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Per-workitem Pseudo-code

Uses N2 work-items:

i ← work-item ID;bi+N ← ai ;bi+N+ N

2← ai+ N

2;

barrier();for t ← n − 1 downto 1 do

if i < 2t thenb2t+i ← b2t+1+2i ⊕ b2t+1+2i+1;

barrier();if i = 0 then

a3 ← a2;a2 ← I;

barrier();for t ← 1 to n − 1 do

if i < 2t thenb2t+1+2i+1 ← b2t+i ⊕ b2t+1+2i ;b2t+1+2i ← b2t+i ;

barrier();si ← bi+N ;si+ N

2← bi+N+ N

2;

Page 52: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Properties

Work-efficient: O(N) addition operationsStill requires about 2 log2 N barriersRequires about 4N reads and 3N writesOnly N

2 work-items requiredHas branching, but it is coherent

Page 53: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Properties

Work-efficient: O(N) addition operationsStill requires about 2 log2 N barriersRequires about 4N reads and 3N writesOnly N

2 work-items requiredHas branching, but it is coherent

Page 54: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Properties

Work-efficient: O(N) addition operationsStill requires about 2 log2 N barriersRequires about 4N reads and 3N writesOnly N

2 work-items requiredHas branching, but it is coherent

Page 55: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Properties

Work-efficient: O(N) addition operationsStill requires about 2 log2 N barriersRequires about 4N reads and 3N writesOnly N

2 work-items requiredHas branching, but it is coherent

Page 56: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Properties

Work-efficient: O(N) addition operationsStill requires about 2 log2 N barriersRequires about 4N reads and 3N writesOnly N

2 work-items requiredHas branching, but it is coherent

Page 57: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Motivation

Applying one of these at a larger (multi-workgroup) scalehas issues:

Synchronisation: no inter-workgroup synchronisation,so barriers must be kernel-instance boundariesMemory usage: need O(N) working spaceBandwidth: requires O(N log N) for Kogge-Stone, about7N for Brent-Kung

Page 58: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Outline

1 Definition and ApplicationsMotivating ProblemDefinitionsOther Applications

2 Parallel AlgorithmsKogge-StoneBrent-Kung

3 GPU StrategiesReduce-then-ScanTwo-Level Prefix Sum

Page 59: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Generalizing Brent-Kung

The Brent-Kung tree doesn’t have to be binary:

6 9 2 3 6 1 6 3 3 1 5 4 9 7 3 3

Page 60: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Generalizing Brent-Kung

The Brent-Kung tree doesn’t have to be binary:

20

6 9 2 3

16

6 1 6 3

13

3 1 5 4

22

9 7 3 3

Page 61: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Generalizing Brent-Kung

The Brent-Kung tree doesn’t have to be binary:

71

20

6 9 2 3

16

6 1 6 3

13

3 1 5 4

22

9 7 3 3

Page 62: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Generalizing Brent-Kung

The Brent-Kung tree doesn’t have to be binary:

0

20

6 9 2 3

16

6 1 6 3

13

3 1 5 4

22

9 7 3 3

Page 63: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Generalizing Brent-Kung

The Brent-Kung tree doesn’t have to be binary:

0

0

6 9 2 3

20

6 1 6 3

36

3 1 5 4

49

9 7 3 3

Page 64: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Generalizing Brent-Kung

The Brent-Kung tree doesn’t have to be binary:

0

0

0 6 15 17

20

20 26 27 33

36

36 39 40 45

49

49 58 65 68

Page 65: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Reduce-then-Scan strategy

1 Divide elements into blocks of size M.2 Use a workgroup per block to compute sum of each

block.3 Recursively prefix-sum the block sums.4 Use a workgroup per block to prefix-sum each block,

starting from the result from the previous level.

Steps 2 and 4 can use any parallel reduction/prefix sumalgorithm.

Page 66: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Reduce-then-Scan strategy

1 Divide elements into blocks of size M.2 Use a workgroup per block to compute sum of each

block.3 Recursively prefix-sum the block sums.4 Use a workgroup per block to prefix-sum each block,

starting from the result from the previous level.

Steps 2 and 4 can use any parallel reduction/prefix sumalgorithm.

Page 67: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Reduce-then-Scan strategy

1 Divide elements into blocks of size M.2 Use a workgroup per block to compute sum of each

block.3 Recursively prefix-sum the block sums.4 Use a workgroup per block to prefix-sum each block,

starting from the result from the previous level.

Steps 2 and 4 can use any parallel reduction/prefix sumalgorithm.

Page 68: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Reduce-then-Scan strategy

1 Divide elements into blocks of size M.2 Use a workgroup per block to compute sum of each

block.3 Recursively prefix-sum the block sums.4 Use a workgroup per block to prefix-sum each block,

starting from the result from the previous level.

Steps 2 and 4 can use any parallel reduction/prefix sumalgorithm.

Page 69: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Reduce-then-Scan strategy

1 Divide elements into blocks of size M.2 Use a workgroup per block to compute sum of each

block.3 Recursively prefix-sum the block sums.4 Use a workgroup per block to prefix-sum each block,

starting from the result from the previous level.

Steps 2 and 4 can use any parallel reduction/prefix sumalgorithm.

Page 70: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Analysis

Assuming that M is reasonably large:About logM N kernel instancesMost memory accesses can be to local memorySlightly over 2N global readsSlightly over N global writesSlightly over O(N log M) barrier instructions

Page 71: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Analysis

Assuming that M is reasonably large:About logM N kernel instancesMost memory accesses can be to local memorySlightly over 2N global readsSlightly over N global writesSlightly over O(N log M) barrier instructions

Page 72: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Analysis

Assuming that M is reasonably large:About logM N kernel instancesMost memory accesses can be to local memorySlightly over 2N global readsSlightly over N global writesSlightly over O(N log M) barrier instructions

Page 73: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Analysis

Assuming that M is reasonably large:About logM N kernel instancesMost memory accesses can be to local memorySlightly over 2N global readsSlightly over N global writesSlightly over O(N log M) barrier instructions

Page 74: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Analysis

Assuming that M is reasonably large:About logM N kernel instancesMost memory accesses can be to local memorySlightly over 2N global readsSlightly over N global writesSlightly over O(N log M) barrier instructions

Page 75: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Outline

1 Definition and ApplicationsMotivating ProblemDefinitionsOther Applications

2 Parallel AlgorithmsKogge-StoneBrent-Kung

3 GPU StrategiesReduce-then-ScanTwo-Level Prefix Sum

Page 76: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Reducing Parallelism

The more parallelism one uses in a prefix sum, the higherthe overheads become.Therefore, only use as much parallelism as is necessary tosaturate the hardware.

Page 77: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Reducing Parallelism

The more parallelism one uses in a prefix sum, the higherthe overheads become.Therefore, only use as much parallelism as is necessary tosaturate the hardware.

Page 78: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Fixed Block Count

Use the same reduce-then-scan strategy, butFix the number of blocks at C, set M = N

C

Fix a work-group size GC should be tuned so that C × G workitems saturatethe deviceC should be small enough that only 2 levels arerequired

Page 79: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Prefix-Summing a Block

Each block has size M but workgroups only have Gworkitems. How does a workgroup prefix-sum a block?Serially. In sub-blocks of size G or 2G.

Page 80: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Prefix-Summing a Block

Each block has size M but workgroups only have Gworkitems. How does a workgroup prefix-sum a block?Serially. In sub-blocks of size G or 2G.

Page 81: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Prefix-Summing a Block

Each block has size M but workgroups only have Gworkitems. How does a workgroup prefix-sum a block?Serially. In sub-blocks of size G or 2G.

Page 82: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Advantages

Only three kernel instances, two of which use the fullGPUOnly O(N log G) barrier instructionsOnly O(C) extra global memory

Page 83: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Advantages

Only three kernel instances, two of which use the fullGPUOnly O(N log G) barrier instructionsOnly O(C) extra global memory

Page 84: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Advantages

Only three kernel instances, two of which use the fullGPUOnly O(N log G) barrier instructionsOnly O(C) extra global memory

Page 85: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Summary

Parallel prefix sum is hard workGPUs need parallelism, but algorithm works best withleast parallelismWith good implementation, can be bandwidth-limited

Page 86: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Summary

Parallel prefix sum is hard workGPUs need parallelism, but algorithm works best withleast parallelismWith good implementation, can be bandwidth-limited

Page 87: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Summary

Parallel prefix sum is hard workGPUs need parallelism, but algorithm works best withleast parallelismWith good implementation, can be bandwidth-limited

Page 88: Prefix sums on GPUs - gpu.cs.uct.ac.za

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

References

Guy E. Blelloch.Prefix sums and their applications.Technical Report CMU-CS-90-190, Computer ScienceDepartment, Carnegie Mellon University, November1990.

Duane Merrill and Andrew Grimshaw.Parallel scan for stream architectures.Technical Report CS2009-14, Department of ComputerScience, University of Virginia, December 2009.