
Upload: abdoulaye-aw

Post on 08-Oct-2015




0 download




  • 5/19/2018 Reduction


    Developer Central China /

    OpenCL Optimization Case Study: SimpleReductionsReductions, which take a vector of data and reduce it to a single element, are widely used in data-parallel programming. In this article, we examine

    strategies for efficiently mapping reductions onto the ATI RadeonHD 5870 GPU and AMD PhenomII X4 965 CPU. Taking advantage of

    properties of the reduction being performed, as well as matching the style of reduction to the hardware platform, can result in performance

    improvements of up to 15x, compared to native code.

    Bryan Catanzaro 8/24/2010


    OpenCLallows developers to write portable, high-performance code that can target all varieties of parallel processing platforms, including AMD

    CPUs and GPUs. Like with any parallel programming model for parallel processing, achieving good efficiency requires careful attention to how the

    computation is mapped to the hardware platform and executed. Since performance is a prime motivation for using OpenCL, understanding the

    issues which arise when optimizing OpenCLcode is a natural part of learning how to use OpenCLitself.

    This article discusses simple reductions. A reduction is a very simple operation that takes an array of data and reduces it down to a single element,

    for example by summing all the elements in the array. Consider this simple C code, which sums all the elements in an array:

    float reduce_sum(float* input, int length) {

    float accumulator = input[0];

    for(int i = 1; i < length; i++)

    accumulator += input[i];

    return accumulator;


    This code is completely sequential! Theres no way to parallelize the loop, since every iteration of the loop depends on the iteration before it. How

    can we parallelize this code?

    It turns out that we can parallelize many reductions by taking advantage of the properties of the reduction were trying to perform. As counter-

    intuitive as it may seem, reductions are a fundamental data-parallel primitive used in many applications from databases to physical simulation

    and machine learning. There are many different kinds of reductions, depending on the type of data being reduced and the operator which is being

    used to perform the reduction. For example, reductions can be used to find the sum of all elements in a vector, find the maximum or minimum

    element of a vector, or find the index of the maximum or minimum element of a vector.

    The performance of parallel reductions can strongly depend on the details of how the reduction is mapped to a parallel platform. In this article, we

    will see how selecting the right strategy for reduction can be an order of magnitude faster than using a naive reduction algorithm, on both CPU

    devices, represented by the AMD PhenomII X4 965 CPU, as well as GPU devices, represented by the ATI RadeonHD 5870 GPU.


    The simple sequential sum reduction we just saw is not parallel at all: theres a sequential dependency on the accumulator variable that requires

    this reduction be done in a particular order, from front to back of the input array.


    OpenCL Optimization Case Study: Simple Redu...

    1 of 9 16/03/2014

  • 5/19/2018 Reduction


    in a variety of ways. Well take a look at how to do this by starting from the bottom and going up.

    Associativity and Commutativity

    As we just mentioned, if our reduction operator gives some flexibility in terms of what order the operations must be performed, we can parallelize

    a sequential reduction. Addition is a common reduction operator that gives us a lot of flexibility lets consider summing a vector of three

    numbers: [10, 20, 30]. The sequential sum would do two additions: ((10 + 20) + 30). But, wed get the same answer if we had grouped the additions

    differently: (10 + (20 + 30)), or even if we had reordered the additions: ((30 + 10) + 20).Youve probably heard of these properties before if an operator allows us to regroup the operations and still get the same result, we call it

    associative, and if it allows us to reorder the operations and still get the same result, we call it commutative.

    It turns out that these properties are key to parallelizing a reduction. We can take advantage of associativity to divide up the reduction into

    independent pieces, and then combine results from the independent pieces. For example, a+b+c+d = (a+b)+(c+d). (a+b) can be computed in parallel

    with (c+d), and then the two partial reductions combined to complete the reduction. This can be generalized to reductions on vectors of arbitrary

    size by recursively dividing the input vectors, computing partial reductions, and then reducing the partial reductions to form the result.

    Figure 1: Associative Reduction Tree and SIMD Mapping

    Figure 1 illustrates an associative reduction tree on an eight element vector. This reduction tree does not assume commutativity of the operator,

    since none of the additions are reordered, they are just regrouped.

    Building a reduction for GPU Devices

    Lets build a parallel reduction for the GPU, starting at the OpenCLwork-group level. Well take advantage of associativity to break the vector into

    small chunks, each of which well build independent reduction trees for each chunk, and execute them independently, in parallel. Well make sure

    each of the chunks is small enough that it fits in local memory, and then well assign one work-item per element.

    At each stage of the reduction tree, well be loading and storing partial reductions as we compute, so its crucial to use local memory to

    communicate between work-items in the work group. Well then execute the reduction tree by using a for loop in conjunction with OpenCL

    barriers. For example, see the following code, which performs a min reduction to find the smallest element in a vector:


    void reduce(

    __global float* buffer,

    __local float* scratch,

    __const int length,

    __global float* result) {

    int global_index = get_global_id(0);

    Home > Resources > Documentation & Articles > Articles & Whitepapers > OpenCLOptimization Case Study: Simple Reductions

    OpenCL Optimization Case Study: Simple Redu...

    2 of 9 16/03/2014

  • 5/19/2018 Reduction


    if (global_index < length) {

    scratch[local_index] = buffer[global_index];

    } else {

    // Infinity is the identity element for the min operation

    scratch[local_index] = INFINITY;



    for(int offset = 1;

    offset < get_local_size(0);

    offset Articles & Whitepapers > OpenCLOptimization Case Study: Simple Reductions

    OpenCL Optimization Case Study: Simple Redu...

    3 of 9 16/03/2014

  • 5/19/2018 Reduction


    Figure 2: Commutative Reduction and SIMD Mapping

    With a commutative reduction tree, the active work-items are compacted into contiguous blocks. Assuming the work-group will be mapped onto

    multiple SIMD wavefronts, as a wavefront becomes completely unused during the reduction, it will be dropped from execution. This leads to better

    SIMD efficiency if we assume a wavefront size of four work-items for the tree shown in figure 2, our SIMD efficiency is 58%, which is about twice

    the efficiency of the associative reduction. The code needed to do this is basically identical to the associative version, just with the for looprestructured as follows:

    for(int offset = get_local_size(0) / 2;

    offset > 0;

    offset >>= 1) {

    if (local_index < offset) {

    float other = scratch[local_index + offset];

    float mine = scratch[local_index];

    scratch[local_index] = (mine < other) ? mine : other;




    Multi-stage Reduction

    Figure 3: Multi-stage Reduction

    So far, weve been discussing parallel reductions on vectors which are short enough to fit in a single work-group. In order to reduce over larger

    vectors, we need to move to a multi-stage reduction, where we do local reductions, followed by larger reductions. Three possible ways of doing this


    1. Recursive multi-stage reductions. In this approach, illustrated in figure 3, the results produced by each local reduction are gathered into a new

    Home > Resources > Documentation & Articles > Articles & Whitepapers > OpenCLOptimization Case Study: Simple Reductions

    OpenCL Optimization Case Study: Simple Redu...

    4 of 9 16/03/2014

  • 5/19/2018 Reduction


    expresses the largest amount of parallelism, and it can be written to only take advantage of associativity, for operators which are not commutative.

    However, it can be less efficient.

    2. Two-stage reductions. In this approach,which we will explain in greater detail later, we express just enough parallelism to fill the machine, and

    then follow with a final global reduction. Taking advantage of commutativity, we can then perform most of the work sequentially, which improves

    efficiency compared to the fully-parallel multi-stage reduction. Additionally, we only have to launch two kernels per reduction.

    3. Reductions using atomics. Instead of using an explicitly multi-stage algorithm, you can use atomic memory operations in OpenCL, such asatom_ add() to reduce the partial results from each local reduction. Of course, atomic transactions will limit you to the operators and data-types

    which are supported by the platform youre targeting. Many applications need reductions which are not supported by atomic operations, so well

    ust mention that they can be useful in some situations, but wont give details in this article.

    Recursive Multi-stage Reduction Performance

    The recursive, multi-stage reduction is often the first reduction approach people try, since its maximally parallel. We often think that since GPUs

    are highly parallel processors that can process thousands of work-items, we should parallelize things as much as possible. As well see, sometimes

    its better to choose a more serial approach.

    Figure 4 shows the performance achieved on the ATI RadeonHD 5870 GPU, for multi-stage reduction kernels which find the minimum element in

    vectors of floats. The purely associative reduction averages 1.6 GigaReductions/second for large input sizes, while the multi-stage commutative

    kernel, which permutes the reduction tree to better fit the SIMD nature of the GPU, averages 2.1 GigaReductions/second.

    Figure 4: Multi-stage GPU Performance

    To analyze how well this performs, we note that for a vector of nelements, there are n-1operations which must be performed to complete the

    reduction. This means that our reduction kernels are performing at 1.6 or 2.1 GFlops/s. All n elements need to be loaded from off-chip memory,

    which means that the peak performance we could expect if we were off-chip memory bandwidth limited would be:

    157.6 GB/s * 1 SPFLOP/4 B = 39.4 SPGFLOP/s

    Accordingly, our multi-stage reductions are performing at about 5% of peak performance, which shows weve got quite a bit of room to improve


    Two-stage Reduction

    All those parallel reductions in the multi-stage reductions weve been discussing are fairly expensive, because they involve lots of synchronization

    Home > Resources > Documentation & Articles > Articles & Whitepapers > OpenCLOptimization Case Study: Simple Reductions

    OpenCL Optimization Case Study: Simple Redu...

    5 of 9 16/03/2014

  • 5/19/2018 Reduction


    This observation motivates the two-stage reduction, where the input is divided up into pchunks, where pis large enough to keep all of our

    processors busy. In OpenCL, each chunk will be processed by a work-group. Taking advantage of commutativity, we can have each work-group

    process its chunk by iterating over work-group sized pieces, and having every work-item keep a running reduction as it goes. After weve processed

    the entire array, each work-group writes out a single reduction result, which we assemble into another array, and then reduce with a final reduction


    Technically, we could do a two stage reduction without taking advantage of commutativity by having each work-item sequentially reduce a large

    contiguous block of the array, and then finishing with a parallel reduction in each work-group, followed by a final reduction call. However, it isdifficult to make this approach efficient, since each work-item would be loading data from a separate region of memory, which then

    reduces bandwidth utilization substantially, since the loads from a wavefront of work-items are not contiguous. Well leave it as an exercise to the

    reader how to create an efficient two-stage, associative reduction, but will note that since a great number of reduction operators are commutative

    anyway, this problem is mostly just a curiosity.

    Figure 5: Two-stage Reduction

    Figure 5 illustrates how the two-stage reduction works. The vector is divided into work-group size chunks, which are then parceled out to the

    compute units on the device. In this example, we show a parallelization for two work-groups, a blue work group and a green work-group. Thework-groups loop over their chunks of the input vector, by performing sequential reductions. Once the entire vector has been processed, the

    work-groups perform parallel reductions to finish out the first stage. The second stage of the reduction performs a parallel reduction on all the

    partial results from each work-group.


    void reduce(__global float* buffer,

    __local float* scratch,

    __const int length,

    __global float* result) {

    int global_index = get_global_id(0);

    float accumulator = INFINITY;

    // Loop sequentially over chunks of input vector

    while (global_index < length) {

    float element = buffer[global_index];

    accumulator = (accumulator < element) ? accumulator : element;

    global_index += get_global_size(0);


    // Perform parallel reduction

    int local_index = get_local_id(0);

    scratch[local_index] = accumulator;


    for(int offset = get_local_size(0) / 2;

    offset > 0;

    Home > Resources > Documentation & Articles > Articles & Whitepapers > OpenCLOptimization Case Study: Simple Reductions

    OpenCL Optimization Case Study: Simple Redu...

    6 of 9 16/03/2014

  • 5/19/2018 Reduction


    float other = scratch[local_index + offset];

    float mine = scratch[local_index];

    scratch[local_index] = (mine < other) ? mine : other;




    if (local_index == 0) {

    result[get_group_id(0)] = scratch[0];



    By launching only enough work-groups to fill the compute device, we ensure that most of the reductions happen sequentially, which maximizes

    SIMD efficiency and drastically improves performance. In the fully-parallel, recursive reduction style mentioned above, during the first reduction

    phase, for a vector of length nelements, the number of parallel reductions we have to do is proportional to n. Using the two-stage reduction style,

    we only perform a constant number of parallel reductions, regardless of how large our input is, since we only do parallel reductions for as many

    work-groups as we need to fill the compute device. This makes the reduction much more efficient.

    For this experiment, we found that launching 80 work-groups was a good choice for the ATI RadeonHD 5870 GPU. There are 20 compute units

    on the GPU, so launching 80 work-groups provides for simultaneous scheduling of multiple work-groups on the same compute unit, which can help

    cover memory latencies during execution.

    Two-stage Reduction Performance

    Figure 6: Two-stage GPU Performance

    Figure 6 shows the performance we achieve using the two-stage approach. As you can see, its quite a bit faster, averaging 27.3

    GigaReductions/second for large vectors. This is 70% of our peak bandwidth-bound performance, quite a big difference. Vectorizing this kernel sothat it performs operations on float4 vectors instead of just float elements improved the performance even more, bringing us to 30.3

    GigaReductions/second, or 77% of our bound. Its important to note that vectorizing reductions requires that the reduction operator be

    commutative, since the vectorization process inherently reorders the reduction tree.

    This result is perhaps somewhat counterintuitive: switching from the most parallel algorithm we could devise for reduction to a more serial

    algorithm brought us great efficiency benefits. Despite the fact that GPU hardware needs a lot of parallelism to perform well, it turns out that for

    some algorithms, its best to choose only a degree of parallelism which fills the compute device, but no more.

    Reductions on CPU Devices

    Home > Resources > Documentation & Articles > Articles & Whitepapers > OpenCLOptimization Case Study: Simple Reductions

    OpenCL Optimization Case Study: Simple Redu...

    7 of 9 16/03/2014

  • 5/19/2018 Reduction


    Figure 7: CPU Performance

    Figure 7 shows the realized performance of several reduction kernels on the AMD PhenomII X4 965 processor, which has four compute units.

    Running the two-stage kernel which provided the greatest performance on the GPU brings us 79 MegaReductions/second for large inputs. Our

    bound is 21.3 GB/s * 1 SPFLOP/4 B = 5.3 SPGFLOP/s.

    This means that the two-stage kernel is yielding only 1.5% of peak performance, which clearly shows room for improvement.

    As weve shown, serial reductions are more efficient than parallel reductions, so the following code attempts to serialize the reduction as much as

    possible. It is still a two-phase reduction, but it expects that each work-group is a serial work-group with only one work-item. Each work-groupperforms the reduction serially, and then writes out the reduction for their chunk of the input vector. We can then instantiate a few work-groups,

    ust enough to fill all the compute units of our CPU device, and finish the reduction with a very small, sequential reduction in host code. The

    following is the OpenCLkernel for this serial reduction.


    void reduce(__global float* buffer,

    __const int block,

    __const int length,

    __global float* result) {

    int global_index = get_global_id(0) * block;

    float accumulator = INFINITY;

    int upper_bound = (get_global_id(0) + 1) * block;

    if (upper_bound > length) upper_bound = length;

    while (global_index < upper_bound) {

    float element = buffer[global_index];

    accumulator = (accumulator < element) ? accumulator : element;



    result[get_group_id(0)] = accumulator;


    This code performs much better on our CPU device. When we launch a kernel with only one work-group, using just one of the four compute units

    the AMD PhenomCPU provides, we average 1.2 GigaReductions/second, or 22% of our bound for this device. When we parallelize the reduction

    across all the compute units of our CPU device, we average 3.0 GigaReductions/second, or 57% of our bound. Vectorizing the reductions improves

    Home > Resources > Documentation & Articles > Articles & Whitepapers > OpenCLOptimization Case Study: Simple Reductions

    OpenCL Optimization Case Study: Simple Redu...

    8 of 9 16/03/2014

  • 5/19/2018 Reduction


    been using in these code examples. The ?: operator, when used on the CPU on a SIMD vector, which introduces control flow into the SIMD vector,

    causing the OpenCLcompiler to emit non-vectorized code, resulting in performance losses instead of performance gains.

    Overall, getting 62% of our bandwidth limited performance shows that OpenCLcan provide good performance on CPU devices as well as GPU



    When we started this article, we were faced with a challenge: how to take a sequential reduction loop and parallelize it. We took a look at howassociativity and commutativity allow us to restructure a sequential loop into reduction trees, and then looked at several strategies for building

    efficient reduction trees. Perhaps surprisingly, we found that the most parallel reduction trees were also very inefficient, because they required lots

    of communication and synchronization, which is expensive on parallel platforms. We then found that performing the reduction as serially as

    possible provided the best performance, both on the CPU as well as the GPU. We saw a 15x performance improvement on the GPU by taking

    advantage of commutativity to reduce the number of local parallel reductions we executed, compared to the fully parallel, recursive reduction. We

    saw a 2.8x performance improvement on the CPU by using all the cores and the SIMD units, compared to a sequential reduction.

    Since reductions are such an important part of data-parallel programming, many OpenCLprogrammers will encounter the need to write them at

    some point. Hopefully this article has given you some ideas about what approaches will work well for your problem, whether on the CPU or the


    OpenCLand the OpenCLlogo are trademarks of Apple Inc. used by permission by Khronos

    Get the hcNewsFlash.

    Your email address:

    I agree to receive hcNewsFlash and

    related AMD Developer Central

    communications .


    HSA is going to rock your world.

    Learn more about Heterogeneous

    System Architecture.

    Got Questions?

    Ask the Developer Forums

    Community. Theyve got answers.

    Careers / Site Map / Terms and Conditions / Privacy / Cookie Policy / Trademarks

    2014 Advanced Micro Devices, Inc. OpenCLand the OpenCLlogo are trademarks of Apple, Inc., used with permission by Khronos.

    Home > Resources > Documentation & Articles > Articles & Whitepapers > OpenCLOptimization Case Study: Simple Reductions

    OpenCL Optimization Case Study: Simple Redu...

    9 of 9 16/03/2014