reduction

5/19/2018 Reduction

1/9

Developer Central China / AMD.com

OpenCL Optimization Case Study: SimpleReductionsReductions, which take a vector of data and reduce it to a single element, are widely used in data-parallel programming. In this article, we examine

strategies for efficiently mapping reductions onto the ATI RadeonHD 5870 GPU and AMD PhenomII X4 965 CPU. Taking advantage of

properties of the reduction being performed, as well as matching the style of reduction to the hardware platform, can result in performance

improvements of up to 15x, compared to native code.

Bryan Catanzaro 8/24/2010

Introduction

OpenCLallows developers to write portable, high-performance code that can target all varieties of parallel processing platforms, including AMD

CPUs and GPUs. Like with any parallel programming model for parallel processing, achieving good efficiency requires careful attention to how the

computation is mapped to the hardware platform and executed. Since performance is a prime motivation for using OpenCL, understanding the

issues which arise when optimizing OpenCLcode is a natural part of learning how to use OpenCLitself.

This article discusses simple reductions. A reduction is a very simple operation that takes an array of data and reduces it down to a single element,

for example by summing all the elements in the array. Consider this simple C code, which sums all the elements in an array:

float reduce_sum(float* input, int length) {

float accumulator = input[0];

for(int i = 1; i < length; i++)

accumulator += input[i];

return accumulator;

}

This code is completely sequential! Theres no way to parallelize the loop, since every iteration of the loop depends on the iteration before it. How

can we parallelize this code?

It turns out that we can parallelize many reductions by taking advantage of the properties of the reduction were trying to perform. As counter-

intuitive as it may seem, reductions are a fundamental data-parallel primitive used in many applications from databases to physical simulation

and machine learning. There are many different kinds of reductions, depending on the type of data being reduced and the operator which is being

used to perform the reduction. For example, reductions can be used to find the sum of all elements in a vector, find the maximum or minimum

element of a vector, or find the index of the maximum or minimum element of a vector.

The performance of parallel reductions can strongly depend on the details of how the reduction is mapped to a parallel platform. In this article, we

will see how selecting the right strategy for reduction can be an order of magnitude faster than using a naive reduction algorithm, on both CPU

devices, represented by the AMD PhenomII X4 965 CPU, as well as GPU devices, represented by the ATI RadeonHD 5870 GPU.

Reduction

The simple sequential sum reduction we just saw is not parallel at all: theres a sequential dependency on the accumulator variable that requires

this reduction be done in a particular order, from front to back of the input array.

HOME TOOLS & SDKS RESOURCES COMMUNITY PARTNERS SUPPORT

OpenCL Optimization Case Study: Simple Redu... http://developer.amd.com/resources/docume

1 of 9 16/03/2014

5/19/2018 Reduction

2/9

in a variety of ways. Well take a look at how to do this by starting from the bottom and going up.

Associativity and Commutativity

As we just mentioned, if our reduction operator gives some flexibility in terms of what order the operations must be performed, we can parallelize

a sequential reduction. Addition is a common reduction operator that gives us a lot of flexibility lets consider summing a vector of three

numbers: [10, 20, 30]. The sequential sum would do two additions: ((10 + 20) + 30). But, wed get the same answer if we had grouped the additions

differently: (10 + (20 + 30)), or even if we had reordered the additions: ((30 + 10) + 20).Youve probably heard of these properties before if an operator allows us to regroup the operations and still get the same result, we call it

associative, and if it allows us to reorder the operations and still get the same result, we call it commutative.

It turns out that these properties are key to parallelizing a reduction. We can take advantage of associativity to divide up the reduction into

independent pieces, and then combine results from the independent pieces. For example, a+b+c+d = (a+b)+(c+d). (a+b) can be computed in parallel

with (c+d), and then the two partial reductions combined to complete the reduction. This can be generalized to reductions on vectors of arbitrary

size by recursively dividing the input vectors, computing partial reductions, and then reducing the partial reductions to form the result.

Figure 1: Associative Reduction Tree and SIMD Mapping

Figure 1 illustrates an associative reduction tree on an eight element vector. This reduction tree does not assume commutativity of the operator,

since none of the additions are reordered, they are just regrouped.

Building a reduction for GPU Devices

Lets build a parallel reduction for the GPU, starting at the OpenCLwork-group level. Well take advantage of associativity to break the vector into

small chunks, each of which well build independent reduction trees for each chunk, and execute them independently, in parallel. Well make sure

each of the chunks is small enough that it fits in local memory, and then well assign one work-item per element.

At each stage of the reduction tree, well be loading and storing partial reductions as we compute, so its crucial to use local memory to

communicate between work-items in the work group. Well then execute the reduction tree by using a for loop in conjunction with OpenCL

barriers. For example, see the following code, which performs a min reduction to find the smallest element in a vector:

__kernel

void reduce(

__global float* buffer,

__local float* scratch,

__const int length,

__global float* result) {

int global_index = get_global_id(0);

Home > Resources > Documentation & Articles > Articles & Whitepapers > OpenCLOptimization Case Study: Simple Reductions


2 of 9 16/03/2014

5/19/2018 Reduction

3/9

if (global_index < length) {

scratch[local_index] = buffer[global_index];

} else {

// Infinity is the identity element for the min operation

scratch[local_index] = INFINITY;

}

barrier(CLK_LOCAL_MEM_FENCE);

for(int offset = 1;

offset < get_local_size(0);

offset Articles & Whitepapers > OpenCLOptimization Case Study: Simple Reductions


3 of 9 16/03/2014

5/19/2018 Reduction

4/9

Figure 2: Commutative Reduction and SIMD Mapping

With a commutative reduction tree, the active work-items are compacted into contiguous blocks. Assuming the work-group will be mapped onto

multiple SIMD wavefronts, as a wavefront becomes completely unused during the reduction, it will be dropped from execution. This leads to better

SIMD efficiency if we assume a wavefront size of four work-items for the tree shown in figure 2, our SIMD efficiency is 58%, which is about twice

the efficiency of the associative reduction. The code needed to do this is basically identical to the associative version, just with the for looprestructured as follows:

for(int offset = get_local_size(0) / 2;

offset > 0;

offset >>= 1) {

if (local_index < offset) {

float other = scratch[local_index + offset];

float mine = scratch[local_index];

scratch[local_index] = (mine < other) ? mine : other;

}


}

Multi-stage Reduction

Figure 3: Multi-stage Reduction

So far, weve been discussing parallel reductions on vectors which are short enough to fit in a single work-group. In order to reduce over larger

vectors, we need to move to a multi-stage reduction, where we do local reductions, followed by larger reductions. Three possible ways of doing this

are:

1. Recursive multi-stage reductions. In this approach, illustrated in figure 3, the results produced by each local reduction are gathered into a new



4 of 9 16/03/2014

5/19/2018 Reduction

5/9

expresses the largest amount of parallelism, and it can be written to only take advantage of associativity, for operators which are not commutative.

However, it can be less efficient.

2. Two-stage reductions. In this approach,which we will explain in greater detail later, we express just enough parallelism to fill the machine, and

then follow with a final global reduction. Taking advantage of commutativity, we can then perform most of the work sequentially, which improves

efficiency compared to the fully-parallel multi-stage reduction. Additionally, we only have to launch two kernels per reduction.

3. Reductions using atomics. Instead of using an explicitly multi-stage algorithm, you can use atomic memory operations in OpenCL, such asatom_ add() to reduce the partial results from each local reduction. Of course, atomic transactions will limit you to the operators and data-types

which are supported by the platform youre targeting. Many applications need reductions which are not supported by atomic operations, so well

ust mention that they can be useful in some situations, but wont give details in this article.

Recursive Multi-stage Reduction Performance

The recursive, multi-stage reduction is often the first reduction approach people try, since its maximally parallel. We often think that since GPUs

are highly parallel processors that can process thousands of work-items, we should parallelize things as much as possible. As well see, sometimes

its better to choose a more serial approach.

Figure 4 shows the performance achieved on the ATI RadeonHD 5870 GPU, for multi-stage reduction kernels which find the minimum element in

vectors of floats. The purely associative reduction averages 1.6 GigaReductions/second for large input sizes, while the multi-stage commutative

kernel, which permutes the reduction tree to better fit the SIMD nature of the GPU, averages 2.1 GigaReductions/second.

Figure 4: Multi-stage GPU Performance

To analyze how well this performs, we note that for a vector of nelements, there are n-1operations which must be performed to complete the

reduction. This means that our reduction kernels are performing at 1.6 or 2.1 GFlops/s. All n elements need to be loaded from off-chip memory,

which means that the peak performance we could expect if we were off-chip memory bandwidth limited would be:

157.6 GB/s * 1 SPFLOP/4 B = 39.4 SPGFLOP/s

Accordingly, our multi-stage reductions are performing at about 5% of peak performance, which shows weve got quite a bit of room to improve

performance.

Two-stage Reduction

All those parallel reductions in the multi-stage reductions weve been discussing are fairly expensive, because they involve lots of synchronization



5 of 9 16/03/2014

5/19/2018 Reduction

6/9

This observation motivates the two-stage reduction, where the input is divided up into pchunks, where pis large enough to keep all of our

processors busy. In OpenCL, each chunk will be processed by a work-group. Taking advantage of commutativity, we can have each work-group

process its chunk by iterating over work-group sized pieces, and having every work-item keep a running reduction as it goes. After weve processed

the entire array, each work-group writes out a single reduction result, which we assemble into another array, and then reduce with a final reduction

call.

Technically, we could do a two stage reduction without taking advantage of commutativity by having each work-item sequentially reduce a large

contiguous block of the array, and then finishing with a parallel reduction in each work-group, followed by a final reduction call. However, it isdifficult to make this approach efficient, since each work-item would be loading data from a separate region of memory, which then

reduces bandwidth utilization substantially, since the loads from a wavefront of work-items are not contiguous. Well leave it as an exercise to the

reader how to create an efficient two-stage, associative reduction, but will note that since a great number of reduction operators are commutative

anyway, this problem is mostly just a curiosity.

Figure 5: Two-stage Reduction

Figure 5 illustrates how the two-stage reduction works. The vector is divided into work-group size chunks, which are then parceled out to the

compute units on the device. In this example, we show a parallelization for two work-groups, a blue work group and a green work-group. Thework-groups loop over their chunks of the input vector, by performing sequential reductions. Once the entire vector has been processed, the

work-groups perform parallel reductions to finish out the first stage. The second stage of the reduction performs a parallel reduction on all the

partial results from each work-group.

__kernel

void reduce(__global float* buffer,

__local float* scratch,

__const int length,


int global_index = get_global_id(0);

float accumulator = INFINITY;

// Loop sequentially over chunks of input vector

while (global_index < length) {

float element = buffer[global_index];

accumulator = (accumulator < element) ? accumulator : element;

global_index += get_global_size(0);

}

// Perform parallel reduction

int local_index = get_local_id(0);

scratch[local_index] = accumulator;


for(int offset = get_local_size(0) / 2;

offset > 0;



6 of 9 16/03/2014

5/19/2018 Reduction

7/9

float other = scratch[local_index + offset];

float mine = scratch[local_index];

scratch[local_index] = (mine < other) ? mine : other;

}


}

if (local_index == 0) {

result[get_group_id(0)] = scratch[0];

}

}

By launching only enough work-groups to fill the compute device, we ensure that most of the reductions happen sequentially, which maximizes

SIMD efficiency and drastically improves performance. In the fully-parallel, recursive reduction style mentioned above, during the first reduction

phase, for a vector of length nelements, the number of parallel reductions we have to do is proportional to n. Using the two-stage reduction style,

we only perform a constant number of parallel reductions, regardless of how large our input is, since we only do parallel reductions for as many

work-groups as we need to fill the compute device. This makes the reduction much more efficient.

For this experiment, we found that launching 80 work-groups was a good choice for the ATI RadeonHD 5870 GPU. There are 20 compute units

on the GPU, so launching 80 work-groups provides for simultaneous scheduling of multiple work-groups on the same compute unit, which can help

cover memory latencies during execution.

Two-stage Reduction Performance

Figure 6: Two-stage GPU Performance

Figure 6 shows the performance we achieve using the two-stage approach. As you can see, its quite a bit faster, averaging 27.3

GigaReductions/second for large vectors. This is 70% of our peak bandwidth-bound performance, quite a big difference. Vectorizing this kernel sothat it performs operations on float4 vectors instead of just float elements improved the performance even more, bringing us to 30.3

GigaReductions/second, or 77% of our bound. Its important to note that vectorizing reductions requires that the reduction operator be

commutative, since the vectorization process inherently reorders the reduction tree.

This result is perhaps somewhat counterintuitive: switching from the most parallel algorithm we could devise for reduction to a more serial

algorithm brought us great efficiency benefits. Despite the fact that GPU hardware needs a lot of parallelism to perform well, it turns out that for

some algorithms, its best to choose only a degree of parallelism which fills the compute device, but no more.

Reductions on CPU Devices



7 of 9 16/03/2014

5/19/2018 Reduction

8/9

Figure 7: CPU Performance

Figure 7 shows the realized performance of several reduction kernels on the AMD PhenomII X4 965 processor, which has four compute units.

Running the two-stage kernel which provided the greatest performance on the GPU brings us 79 MegaReductions/second for large inputs. Our

bound is 21.3 GB/s * 1 SPFLOP/4 B = 5.3 SPGFLOP/s.

This means that the two-stage kernel is yielding only 1.5% of peak performance, which clearly shows room for improvement.

As weve shown, serial reductions are more efficient than parallel reductions, so the following code attempts to serialize the reduction as much as

possible. It is still a two-phase reduction, but it expects that each work-group is a serial work-group with only one work-item. Each work-groupperforms the reduction serially, and then writes out the reduction for their chunk of the input vector. We can then instantiate a few work-groups,

ust enough to fill all the compute units of our CPU device, and finish the reduction with a very small, sequential reduction in host code. The

following is the OpenCLkernel for this serial reduction.

__kernel

void reduce(__global float* buffer,

__const int block,

__const int length,


int global_index = get_global_id(0) * block;

float accumulator = INFINITY;

int upper_bound = (get_global_id(0) + 1) * block;

if (upper_bound > length) upper_bound = length;

while (global_index < upper_bound) {

float element = buffer[global_index];

accumulator = (accumulator < element) ? accumulator : element;

global_index++;

}

result[get_group_id(0)] = accumulator;

}

This code performs much better on our CPU device. When we launch a kernel with only one work-group, using just one of the four compute units

the AMD PhenomCPU provides, we average 1.2 GigaReductions/second, or 22% of our bound for this device. When we parallelize the reduction

across all the compute units of our CPU device, we average 3.0 GigaReductions/second, or 57% of our bound. Vectorizing the reductions improves



8 of 9 16/03/2014

5/19/2018 Reduction

9/9

been using in these code examples. The ?: operator, when used on the CPU on a SIMD vector, which introduces control flow into the SIMD vector,

causing the OpenCLcompiler to emit non-vectorized code, resulting in performance losses instead of performance gains.

Overall, getting 62% of our bandwidth limited performance shows that OpenCLcan provide good performance on CPU devices as well as GPU

devices.

Conclusion

When we started this article, we were faced with a challenge: how to take a sequential reduction loop and parallelize it. We took a look at howassociativity and commutativity allow us to restructure a sequential loop into reduction trees, and then looked at several strategies for building

efficient reduction trees. Perhaps surprisingly, we found that the most parallel reduction trees were also very inefficient, because they required lots

of communication and synchronization, which is expensive on parallel platforms. We then found that performing the reduction as serially as

possible provided the best performance, both on the CPU as well as the GPU. We saw a 15x performance improvement on the GPU by taking

advantage of commutativity to reduce the number of local parallel reductions we executed, compared to the fully parallel, recursive reduction. We

saw a 2.8x performance improvement on the CPU by using all the cores and the SIMD units, compared to a sequential reduction.

Since reductions are such an important part of data-parallel programming, many OpenCLprogrammers will encounter the need to write them at

some point. Hopefully this article has given you some ideas about what approaches will work well for your problem, whether on the CPU or the

GPU.

OpenCLand the OpenCLlogo are trademarks of Apple Inc. used by permission by Khronos

Get the hcNewsFlash.

Your email address:

I agree to receive hcNewsFlash and

related AMD Developer Central

communications .

SUBMIT

HSA is going to rock your world.

Learn more about Heterogeneous

System Architecture.

Got Questions?

Ask the Developer Forums

Community. Theyve got answers.

Careers / Site Map / Terms and Conditions / Privacy / Cookie Policy / Trademarks

2014 Advanced Micro Devices, Inc. OpenCLand the OpenCLlogo are trademarks of Apple, Inc., used with permission by Khronos.



9 of 9 16/03/2014

reduction

Documents