kit 2.1 accelerated library framework programmer’s guide api...
TRANSCRIPT
Cell Broadband Engine
Software Development Kit 2.1
Accelerated Library Framework
Programmer’s Guide and API Reference
Version 1.1
SC33-8333-01
���
Cell Broadband Engine
Software Development Kit 2.1
Accelerated Library Framework
Programmer’s Guide and API Reference
Version 1.1
SC33-8333-01
���
Note
Before using this information and the product it supports, read the information in “Notices” on page 101.
Second Edition (March 2007)
This edition applies to the version 2.1 of the Cell Broadband Edition Software Development Kit and to all
subsequent releases and modifications until otherwise indicated in new editions.
© Copyright International Business Machines Corporation 2006, 2007. All rights reserved.
US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract
with IBM Corp.
Contents
About this publication . . . . . . . . v
How to send your comments . . . . . . . . . v
Part 1. ALF components . . . . . . . 1
Chapter 1. Overview of ALF external
components . . . . . . . . . . . . . 3
Chapter 2. Compute task . . . . . . . 5
Chapter 3. Data transfer list . . . . . . 7
Chapter 4. Work blocks . . . . . . . . 9
Chapter 5. Buffers . . . . . . . . . . 11
Buffer types . . . . . . . . . . . . . . 11
Local memory allocation and address calculations 14
Memory constraints . . . . . . . . . . . 16
Chapter 6. Accelerator data partitioning 17
Chapter 7. Synchronization points . . . 19
Chapter 8. Error handling . . . . . . . 21
Part 2. ALF API reference . . . . . 23
Chapter 9. Host API . . . . . . . . . 25
Basic framework API . . . . . . . . . . . 25
alf_handle_t . . . . . . . . . . . . . 25
ALF_ERR_POLICY_T . . . . . . . . . . 25
alf_configure . . . . . . . . . . . . . 26
alf_query_system_info . . . . . . . . . . 26
alf_init . . . . . . . . . . . . . . . 28
alf_exit . . . . . . . . . . . . . . . 29
alf_register_error_handler . . . . . . . . . 30
Compute task API . . . . . . . . . . . . 31
alf_task_handle_t . . . . . . . . . . . 31
alf_task_context_handle_t . . . . . . . . . 31
alf_task_info_t . . . . . . . . . . . . 31
alf_task_create . . . . . . . . . . . . 32
alf_task_context_create . . . . . . . . . 33
alf_task_context_add_entry . . . . . . . . 35
alf_task_context_register . . . . . . . . . 36
alf_task_query . . . . . . . . . . . . 36
alf_task_wait . . . . . . . . . . . . . 37
alf_task_destroy . . . . . . . . . . . . 38
Work block API . . . . . . . . . . . . . 39
Data structures . . . . . . . . . . . . 39
alf_wb_create . . . . . . . . . . . . . 39
alf_wb_enqueue . . . . . . . . . . . . 40
alf_wb_add_parm . . . . . . . . . . . 41
alf_wb_add_io_buffer . . . . . . . . . . 41
alf_wb_sync . . . . . . . . . . . . . 42
sync_callback_func . . . . . . . . . . . 44
alf_wb_sync_wait . . . . . . . . . . . 44
Chapter 10. Accelerator API . . . . . . 47
alf_comp_kernel . . . . . . . . . . . . . 47
alf_prepare_input_list . . . . . . . . . . . 47
alf_prepare_output_list . . . . . . . . . . 48
ALF_DT_LIST_CREATE . . . . . . . . . . 49
ALF_DT_LIST_ADD_ENTRY . . . . . . . . 49
Chapter 11. Cell/B.E. architecture
platform-dependent API . . . . . . . 51
alf_task_info_t_CBEA . . . . . . . . . . . 51
Part 3. Programming with ALF . . . 53
Chapter 12. Understand the problem 55
Chapter 13. Data layout and partition
design for the ALF implementation on
Cell/B.E. . . . . . . . . . . . . . . 57
Chapter 14. Double buffering on ALF 59
Chapter 15. ALF host application and
data transfer lists . . . . . . . . . . 61
Chapter 16. Debugging and tuning . . . 63
Chapter 17. Matrix addition example . . 65
Partition scheme . . . . . . . . . . . . . 66
Example compute kernel . . . . . . . . . . 68
The main thread and data transfer lists . . . . . 68
Chapter 18. Matrix transpose example 71
Partition scheme . . . . . . . . . . . . . 71
Example compute kernel . . . . . . . . . . 73
The main thread and data transfer lists . . . . . 73
Debugging and tuning . . . . . . . . . . . 76
Chapter 19. Vector min-max example 79
Partition scheme . . . . . . . . . . . . . 80
Task context buffer . . . . . . . . . . . . 80
Overlapped I/O buffer . . . . . . . . . . 81
Barrier . . . . . . . . . . . . . . . . 81
The code list . . . . . . . . . . . . . . 81
© Copyright IBM Corp. 2006, 2007 iii
Part 4. Platform specific constraints
for the ALF implementation on
Cell/B.E. architecture . . . . . . . 87
Chapter 20. SPU resource reserved and
used . . . . . . . . . . . . . . . . 89
Chapter 21. Memory constraints . . . . 91
Chapter 22. Data transfer list
limitations . . . . . . . . . . . . . 93
Part 5. Compile time options . . . . 95
Part 6. Appendixes . . . . . . . . . 97
Appendix. Accessibility features . . . . 99
Notices . . . . . . . . . . . . . . 101
Trademarks . . . . . . . . . . . . . . 103
Terms and conditions . . . . . . . . . . . 104
Related documentation . . . . . . . 105
Glossary . . . . . . . . . . . . . 107
Index . . . . . . . . . . . . . . . 109
iv ALF Programmer’s Guide and API Reference
About this publication
This book provides detailed information regarding the use of the Accelerated
Library Framework APIs. It contains an overview of the Accelerated Library
Framework, detailed reference information about the APIs, and usage information
for programming with the APIs.
For information about the accessibility features of this product, see “Accessibility
features,” on page 99.
Who should use this book
This book is intended for use by accelerated library developers and compute
kernel developers.
Related information
See “Related documentation” on page 105.
How to send your comments
Your feedback is important in helping to provide the most accurate and highest
quality information. If you have any comments about this publication, send your
comments using Resource Link™ at http://www.ibm.com/servers/resourcelink.
Click Feedback on the navigation pane. Be sure to include the name of the book,
the form number of the book, and the specific location of the text you are
commenting on (for example, a page number or table number).
© Copyright IBM Corp. 2006, 2007 v
vi ALF Programmer’s Guide and API Reference
Part 1. ALF components
The Accelerated Library Framework (ALF) application programming interface
(API) provides a set of functions to solve parallel problems on multi-core memory
hierarchy systems. This programmer’s guide addresses the ALF implementation on
the Cell Broadband Engine™ (Cell/B.E.™) architecture.
Overview of ALF
ALF supports the single-program-multiple-data (SPMD) programming style with a
single program running on all allocated accelerator elements at one time. ALF
provides an interface to write data parallel applications without requiring
architecturally dependent code. The ALF APIs are designed to be platform
independent, and currently only the Cell/B.E. implementation is supported.
Features of ALF include data transfer management, parallel task management,
double buffering, and data partitioning.
ALF considers a natural division of labor between the two types of processing
elements in a hybrid system: the host element and the accelerator element. Also,
two different types of tasks are defined in a typical parallel program: the control
task and the compute task. The control task resides on the host element, while the
compute task resides on the accelerator element. The PowerPC® Processing
Element (PPE) is considered the host, and the Synergistic Processor Elements (SPE)
is considered the accelerator. This division of labor enables programmers to
specialize in different parts of a given parallel workload.
ALF defines three different types of work that can be assigned to the following
types of programmers:
Application developer
At the highest level, the application developer programs only at the host
level. Application programmers can use the provided accelerated libraries
without direct knowledge of the inner workings of the hybrid system.
Accelerated library developer
Using the provided ALF APIs, the accelerated library developers provide
the library interfaces to invoke the compute kernels on the accelerators.
Accelerated library developers are responsible for breaking the problem
into the control process running on the host and the compute kernel
running on the accelerators. Accelerated library developers then partition
the input and output into work blocks that ALF can schedule to run on
different accelerators.
Compute kernel developer
At the accelerator level, the compute kernel developer writes optimized
accelerator code. The ALF API provides a common interface for the
compute task to be invoked automatically by the framework.
The ALF APIs were inspired by the observation that many applications targeted for
Cell/B.E. or multi-core computing follow the general usage pattern of breaking up
a set of data into a set of independent tasks, creating a list of data to be computed
by code on the SPE, and then managing the distribution of that data to the various
SPE processes. This type of control process/compute process usage scenario, along
with the corresponding work queue definition, are the fundamental abstractions in
© Copyright IBM Corp. 2006, 2007 1
ALF. The framework design also enables a separation of work. Compute kernel
developers focus on the compute process, while the accelerated library developers
focus on the data partitioning strategy. Because the runtime framework handles the
underlying task management, data movement, and error handling, the focus is on
the kernel and the data partitioning, not the direct memory access (DMA) list
creation or the lock management on the work queue.
2 ALF Programmer’s Guide and API Reference
Chapter 1. Overview of ALF external components
With the provided ALF API, you can create work blocks, put them on a queue, and
the ALF runtime on the host can assign the work blocks to the accelerators.
Figure 1 provides an overview of the different external components in the ALF. The
main programming construct is a compute task that is run in parallel on the
accelerators. Different input data is entered into this compute task, and the
accelerators run the task and return the output data based on the given input. To
run the compute tasks in parallel, the input data and the corresponding output
data are divided into separate portions, called work blocks. The accelerators
process the assigned work blocks and send the output, per work block, back to the
host element.
Input Data Partition
Output Data Partition
Output Data
Input Data
ComputeTask
WorkBlock
WorkQueue
Accelerator
Main Application
AccelerationLibrary
ALF Runtime (Host)
ALF Runtime(Accelerator)
ComputeKernel
HostAPI
AcceleratorAPI
Host
Figure 1. Overview of ALF
© Copyright IBM Corp. 2006, 2007 3
4 ALF Programmer’s Guide and API Reference
Chapter 2. Compute task
A compute task is constructed by linking the compute kernel code with the ALF
accelerator runtime code. The ALF accelerator runtime code provides the main
entry point and calls the compute kernel code when input data is ready. The
runtime assumes a single default API to the compute kernel as alf_comp_kernel.
When data partition descriptions need to be generated on the accelerators, the
runtime supports APIs that enable you to generate your own data transfer
descriptions on accelerators.
© Copyright IBM Corp. 2006, 2007 5
6 ALF Programmer’s Guide and API Reference
Chapter 3. Data transfer list
A data transfer list contains entries that consist of the data size and a pointer to the
host memory location of the data.
For many applications, the input data for a single compute kernel cannot be stored
contiguously in the host memory. For example, in the case of a multi-dimensional
matrix, the matrix is usually partitioned into smaller submatrixes for the
accelerators to process. For certain data partitioning schemes, the data of a
submatrix is scattered to different memory locations in the data space of the large
matrix. Accelerator memory is usually limited, and the most efficient way to store
the submatrix is contiguously. Data for each row or column of the submatrix is put
together in a contiguous buffer. For input data, they are gathered to the local
memory of the accelerator from scattered host memory locations. With output data,
the above situation is reversed, and the data in the local memory of the accelerator
is scattered to different locations in host memory.
The complexity of data movement patterns can vary. For example, in the case of a
two-dimensional matrix, the movement pattern can be described by a base pointer,
a column width, a row count, and a stride length; while for certain fast Fourier
transform (FFT) kernels, the host addresses of data are derived from very complex
butterfly exchanging paths. To address these complexities, these operations are
represented as a data transfer list. The data in the local memory of the accelerator
is always packed and is organized in the order of the entries in the list. For input
data, the data transfer list describes a data gathering operation. For output data,
the data transfer list describes a scattering operation. See Figure 2 for a diagram of
a data transfer list.
A
B
C
D
E
F
G
A
H
B C
D
E
F
G
H
D
C
A
G
F
B
E
H
Accelerator memory
Host memory
Data transfer list
Figure 2. Data transfer list
© Copyright IBM Corp. 2006, 2007 7
8 ALF Programmer’s Guide and API Reference
Chapter 4. Work blocks
A work block represents related input data, output data, and parameters. The
input and output data are described by corresponding data transfer lists. The
parameters are provided through ALF APIs. Depending on the application, the
data transfer list can either be generated on the host (host data partition) or by the
accelerators (accelerator data partition).
Before calling the compute kernel, the ALF accelerator runtime retrieves the
parameters and the input data based on the input data transfer list from the input
buffer in host memory. After calling the compute kernel, the ALF accelerator
runtime puts the output result back into the host memory. The ALF accelerator
runtime manages the memory of the accelerator to accommodate the input and
output data. The ALF accelerator runtime also supports overlapping data transfers
and computations transparently through double buffering techniques if there is
enough free memory.
Single-use work block
A single-use work block is processed only once. A single-use work block enables
you to generate input and output data transfer lists on either the host or the
accelerator.
Multi-use work block
A multi-use work block is repeatedly processed to the specified iterations. Unlike
the single-use work block, the multi-use work block does not enable you to
generate input and output data transfer lists from the host process. For multi-use
work blocks, all input and output data transfer lists must be generated on the
accelerators each time a work block is processed by the ALF runtime. The ALF
runtime passes the parameters, total number of iterations, and current iteration
count to the accelerator data partition subroutines. See Chapter 6, “Accelerator data
partitioning,” on page 17 for more information about single-use work blocks and
multi-use work blocks.
© Copyright IBM Corp. 2006, 2007 9
10 ALF Programmer’s Guide and API Reference
Chapter 5. Buffers
On the accelerator, the ALF accelerator runtime manages the data of the work
blocks for the compute kernel. The compute kernel developers only need to focus
on the organization of data and the actual compute code. Buffer management and
data movement are handled by the ALF accelerator runtime. However, it is still
important that the programmers have a good understanding of the usage of each
buffer and their relationship with the compute task.
Buffer types
The ALF accelerator runtime code provides handles to the following five different
buffers for each instance of a compute task:
Task context buffer
A task context buffer is used by applications that require common persistent data
buffers that can be referenced by all work blocks. It is also useful for merge or
all-reduce operations. It is a shared buffer that is allocated when an instance of the
compute task is started on the accelerator. The task context consists of two optional
sections that are concatenated into a contiguous buffer. One section is for read-only
access, the other section is writable and can be modified by the compute task while
processing the work blocks. If there is a read-only section, it will be placed before
the writable section. See Figure 3 on page 12. The writable task context is returned
to host memory after the ALF runtime code finishes processing a compute task. If
other processes update this buffer in host memory when the task context buffer is
returned to host memory, ALF does not ensure data consistency. To avoid data
coherency problems, create a unique writable context buffer in host memory for
each task instance.
© Copyright IBM Corp. 2006, 2007 11
Work block parameter and context buffer
The work block parameter and context buffer serves two purposes:
v It passes work block specific constants or reference-by-value parameters.
v It reserves storage space for the compute task to save the specific context data of
the work block.
This buffer can be used by the alf_comp_kernel accelerator routine, the
alf_prepare_input_list accelerator routine, or the alf_prepare_output_list
accelerator routine. The parameters are copied to an internal buffer associated with
the work block data structure in host memory when the alf_wb_add_parm
accelerator routine is invoked.
Figure 4 gives an illustration of the buffer layout when the task only has dedicated
input and output buffers. The buffers in this case are not guaranteed to be adjacent
in memory.
WB
WB
WB
WB WB
WB
WB
WB
WB
WB
WB
WB
Writeablecontextbuffer
Writeablecontextbuffer
Writeablecontextbuffer
Writeablecontextbuffer
Writeablecontextbuffer
Host/task main thread
Accelerator/task instance
Accelerator/task instance
Accelerator/task instance
Read-only context(shared)
Accelerator/task instance
Accelerator/task instance
Figure 3. Task context buffer
Input data Output dataParameter
data
Task contextRead only
Input databuffer pointer
Output databuffer pointer
Parameterbuffer pointer
Task contextbuffer pointer
Writeable
Contiguouslocal memory Contiguous local memory
Contiguouslocal memory
Contiguouslocal memory
Figure 4. Work block with only input and output buffers
12 ALF Programmer’s Guide and API Reference
Work block input data buffer
The work block input data buffer contains the input data for each work block (or
each iteration of a multi-use work block) for the compute kernel. For each instance
of the ALF compute kernel, there is a single contiguous input data buffer.
However, the input buffer can consist of a collection of data from distinct memory
segments set in host memory. These data buffers are gathered into the input data
buffer on the accelerators. The ALF runtime code minimizes performance overhead
by not duplicating input data unnecessarily. When the contents of the work block
is constructed by the alf_wb_add_io_buffer routine, only the pointers to the input
data are saved to the internal data structure of the work block. This data is
transferred to the memory of the accelerator when the work block is processed. A
pointer to the contiguous input buffer in the memory of the accelerator is passed
to the compute kernel. For more information about data scattering and gathering,
see Chapter 3, “Data transfer list,” on page 7.
Work block output data buffer
This buffer is used to save the output of the compute kernel. It is a single
contiguous buffer in the memory of the accelerator. Output data can be transferred
to distinct memory segments in host memory. After the compute kernel returns
from processing one work block, the data in this buffer is moved to the host
memory locations specified by the alf_wb_add_io_buffer routine when the work
block is constructed.
Work block overlapped input and output data buffer
This buffer contains both input and output data. It is dynamically allocated for
each work block. However, when this buffer is declared, the buffer organization on
the accelerator is different. The input, overlapped, and output buffer are
concatenated as one contiguous buffer where the dedicated input buffer is the first
section, the overlapped buffer follows, and the dedicated output buffer is the third
section. Only two pointers are passed to the compute kernel: the input data pointer
and the output data pointer. The input data pointer points to the beginning of the
contiguous buffer. The output data pointer points to the beginning of the
overlapped data buffer.
There are now two contiguous buffers:
v The dedicated input buffer plus the overlapped buffer for input data
v The overlapped buffer plus the dedicated output buffer for output data
Remember that the two buffers are overlapped, similar to the implementation of
the memmove C runtime function.
Figure 5 on page 14 shows the buffer layout when the task has all three types of
data buffers. The input, overlapped, and output buffers are concatenated to a
single contiguous buffer. If you need a pointer to the dedicated output buffer, you
can calculate it based on the output data pointer and the known size of the
overlapped data buffer.
Chapter 5. Buffers 13
There can be another special case with the overlapped buffers when there is no
dedicated input buffer. In this special case, the input and output data pointers all
point to the beginning of the overlapped buffer. Figure 6 shows the buffer layout
when there is no dedicated input buffer.
For accelerator data partition, all input and output data transfer lists for the
overlapped buffer can be defined through the alf_prepare_input_list function
and the alf_prepare_output_list function. This provides you with the necessary
flexibility to organize data in situations where accelerator memory is limited.
Overlapped buffers allow you to maximize the use of local memory in computing
task scenarios where the temporarily copied input data can be overwritten or the
output buffer is also the input to the computation. For example, to compute C = A
+ B when you know that the input data B can be overwritten by the result C
during the computations, define the data buffers of B and C as one overlapped
buffer. This eliminates the requirement of a dedicated output buffer for C in the
local memory and you could save a large percent of memory. You can now
increase the size of the work blocks to support double buffering if the local
memory is too limited to do that when C is in a dedicated buffer.
Local memory allocation and address calculations
Buffers are allocated according to the size given by the alf_task_info_t data
structure. When the corresponding data transfer lists do not cover the whole
buffer, there could be unused memory regions at the end of the corresponding
section of the buffer.
Task context buffer
All of the data added by the alf_task_context_add_entry
(ALF_TASK_CONTEXT_READ) function is written to local memory starting at the
alf_comp_kernel (p_task_context) address. The size of the data will not exceed
alf_task_info_t.task_context_buffer_read_only_size.
Contiguous local memoryContiguous local memory
Input data Output dataParameter
data
Task contextRead only
Input databuffer pointer
Output databuffer pointer
Parameterbuffer pointer
Task contextbuffer pointer
Input andoutput data Writeable
Contiguous localmemory
Figure 5. Work block with input data buffer, overlapped input and output data buffer, and
output data buffer
Output dataParameter
data
Task contextRead only
Input databuffer pointer
Output databuffer pointer
Parameterbuffer pointer
Task contextbuffer pointer
Input andoutput data Writeable
Contiguous local memory Contiguous local memoryContiguous local memory
Figure 6. Work block with overlapped I/O buffer and no dedicated input buffer
14 ALF Programmer’s Guide and API Reference
All of the data added by the alf_task_context_add_entry
(ALF_TASK_CONTEXT_WRITABLE) function is written to or retrieved from local
memory starting at the alf_comp_kernel (p_task_context) +
alf_task_info_t.task_context_buffer_read_only_size address. The size of the
data will not exceed alf_task_info_t.task_context_buffer_writable_size.
Work block parameter and context buffer
All of the data added by the alf_wb_add_parm function is written to local memory
starting at the alf_comp_kernel (p_parm_ctx_buffer) address. The size of the data
will not exceed alf_task_info_t.parm_ctx_buffer_size.
Work block input data buffer
All of the data added by the alf_wb_add_io_buffer (ALF_BUFFER_INPUT) function is
written to local memory starting at the alf_comp_kernel (p_input_buffer)
address. The size of the data will not exceed alf_task_info_t.input_buffer_size.
When an accelerator data partition is used, the ALF_DT_LIST_CREATE (io_buffer
offset) address offset value is based on alf_comp_kernel (p_input_buffer). The
size of the data will not exceed alf_task_info_t.input_buffer_size +
alf_task_info_t.overlapped_buffer_size. There can be multiple data transfer lists
to retrieve data from different offsets of the input buffer. These data lists might
target the same location on the host memory. However, if you write to the same
host memory location with different data transfer lists, ALF does not guarantee
data consistency.
Work block output data buffer
All of the data added by the alf_wb_add_io_buffer (ALF_BUFFER_OUTPUT) function
is written to local memory starting at the alf_comp_kernel (p_output_buffer) +
alf_task_info_t.overlapped_buffer_size address. The size of the data will not
exceed alf_task_info_t.output_buffer_size.
When an accelerator data partition is used, the ALF_ DT_LIST_CREATE (io_buffer
offset) address offset value is based on alf_comp_kernel (p_output_buffer). The
size of the data will not exceed
alf_task_info_t.overlapped_buffer_size+alf_task_info_t.output_buffer_size.
There can be multiple data transfer lists to write data to host memory. Each of
these data transfer lists can start from a different offset of the output buffer. These
data lists might overlap each other.
Work block overlapped input and output data buffer
All of the data added by the alf_wb_add_io_buffer (ALF_BUFFER_INOUT) function is
written to local memory starting at the alf_comp_kernel (p_output_buffer)
address. The size of the data will not exceed
alf_task_info_t.overlapped_buffer_size.
When an accelerator data partition is used, there are no dedicated APIs for the
overlapped buffer. For the input part, it is combined with the input data buffer
API. For the output part, it is combined with the output data API.
Chapter 5. Buffers 15
Memory constraints
To make the most efficient use of accelerator memory, the ALF runtime needs to
know the memory usage requirements of the compute task. The ALF runtime
requires that you specify the memory resources each compute task uses. The
runtime can then allocate the requested memory for the compute task.
16 ALF Programmer’s Guide and API Reference
Chapter 6. Accelerator data partitioning
When the data partition schemes are complex and require a lot of computing
resources, it can be more efficient to generate the data transfer lists on the
accelerators. This is especially useful if the host computing resources can be used
for other work or if the host does not have enough computing resources to
compute data transfer lists for all of its work blocks.
Data partition subroutines
Accelerated library developers must provide the alf_prepare_input_list
subroutine and the alf_prepare_output_list subroutine to do the data partition
for input and output and generate the corresponding data transfer list. The
alf_prepare_input_list is the input data partitioning subroutine and the
alf_prepare_output_list is the output data subroutine.
Number of data transfer list entries
Because dynamic memory allocation can be inefficient on the accelerator, it is
important to explicitly specify the maximum number of entries that the data
transfer list on the accelerator occupies. Then the ALF runtime can help allocate
the buffer to save the data transfer lists properly before calling the data partition
subroutines. Input and output data transfer lists are generated and used at
different times, so by specifying the size of the larger list, the buffers can be reused
between input and output data transfer lists.
Host memory addresses
The host does not generate the data transfer lists when using accelerator data
partitioning, so the addresses of input and output data buffers must be explicitly
passed to the accelerator through the parameter and context buffer.
Single-use and multi-use work blocks
Based on the characteristics of an application, you can use single-use work blocks
or multi-use work blocks to efficiently implement data partitioning on the
accelerators. For a given task that can be partitioned into N work blocks, the
following illustrates how the two different types of work blocks can be used:
v Single-use work block: N work blocks with only the parameter and context
buffer are created on the host. The input parameters must contain the necessary
information to generate the corresponding input and output data transfer lists
for each small block of data. The ALF runtime calls the alf_prepare_input_list
function, the alf_comp_kernel function, and the alf_prepare_output_list
function sequentially for each work block.
v Multi-use work block: One multi-use work block that is processed N times is
created on the host. The ALF runtime then calls the alf_prepare_input_list
alf_prepare_output_listalf_comp_kernelalf_prepare_input_list
Figure 7. Single-use work block
© Copyright IBM Corp. 2006, 2007 17
function, the alf_comp_kernel function, and the alf_prepare_output_list
function N times for each multi-use work block.
Modification of the parameter and context buffer during
multi-use work blocks
The parameter and context buffer of a multi-use work block is shared by multiple
invocations of the alf_prepare_input_list accelerator function and the
alf_prepare_output_list accelerator function. Use care when changing the
contents of this buffer. Because the ALF runtime does double buffering
transparently, it is possible that the current_count arguments for succeeding calls
to the alf_prepare_input_list function, the alf_comp_kernel function, and the
alf_prepare_output_list function are not strictly incremented when a multi-use
work block is processed. Because of this, modifying the parameter and context
buffer according to the current_count in one of the subroutines might cause
unexpected effects to other subroutines when they are called with different
current_count values at a later time.
alf_prepare_output_listalf_comp_kernelalf_prepare_input_list
N
Figure 8. Multi-use work block
18 ALF Programmer’s Guide and API Reference
Chapter 7. Synchronization points
Synchronization points introduce ordering into the processing flow of the work
blocks. Two types of synchronization points are supported: barrier and notify. Both
support synchronous event notification by callback and asynchronous event
notification by query.
Barrier
A barrier ensures that all work blocks enqueued before this point are finished
before new works block added after the barrier can be processed on any of the
accelerators. If a callback function is registered to this synchronization point, the
work queue processing will not continue until the callback function returns.
Notify
A notify synchronization point provides a mechanism to query and run a specific
piece of code when this point is reached. The ALF runtime will generate a
notification message that can be queried on the host and it will also invoke the
registered callback function when this synchronization point is reached. However,
it does not ensure the order of work block completion.
Callback and query
Callback and query features are listed below:
v You can do work block-related memory region modification in barrier callbacks
because no work block is processed during that time. For example, the callback
function can write to the input or output data area that might be referred by
later work blocks.
v Query will only return the current results when the corresponding callback is
returned. This is true for both notify and barrier.
v Only the alf_task_query API is supported in the callback function. Calls to
other APIs might result in errors or undetermined behaviors.
v All callbacks, including notify callbacks, are serialized. This means a new
callback is not allowed when an existing callback has not returned, and the
callbacks will be invoked in the order that the corresponding synchronization
points are added into the work queue.
© Copyright IBM Corp. 2006, 2007 19
20 ALF Programmer’s Guide and API Reference
Chapter 8. Error handling
ALF supports limited capability to handle runtime errors. Upon encountering an
error, the ALF runtime tries to free up resources, then exits by default. To allow the
accelerated library developers to handle errors in a more graceful manner, you can
register a callback error handler function to the ALF runtime. Depending on the
type of error, the error handler function can direct the ALF runtime to retry the
current operation, stop the current operation, or shut down. These are controlled
by the return values of the callback error handler function.
When several errors happen in a short time or at the same time, the ALF runtime
attempts to invoke the error handler in sequential order.
Possible runtime errors include the following:
v Compute task runtime errors such as bus error, memory allocation issues, dead
locks, and others
v Detectable internal data structure corruption errors, which might be caused by
improper data transfers or access boundary issues
v Application detectable/catchable errors
Standard error codes on supported platforms are used for return values when an
error occurs. For this implementation, the standard C/C++ header file, ″errno.h″, is
used. The API definitions in Part 2, “ALF API reference,” on page 23 list the
possible error codes.
© Copyright IBM Corp. 2006, 2007 21
22 ALF Programmer’s Guide and API Reference
Part 2. ALF API reference
Conventions
ALF and alf are the prefixes for the namespace for ALF. For normal function
prototypes and data structure declarations, use all lowercase characters with
underscores (_) separating the words. For macro definitions, use all uppercase
characters with underscores separating the words.
Data type assumptions
int This data type is assumed to be signed by default on both the host
and accelerator. The size of this data type is defined by the
Application Binary Interface (ABI) of the architecture. However, the
minimum size of this data type is 32 bits. Note that the actual size of
this data type might differ between the host and the accelerator
architectures.
unsigned int This data type is assumed to be the same size as that of int.
char This data type is not assumed to be signed or unsigned. The size of
this data structure, however, must be 8 bits.
long This data type is not used in the API definitions because it might not
be uniformly defined across platforms.
void * The size of this data type is defined by the ABI of the corresponding
architecture and compiler implementation. Note that the actual size of
this data type might differ between the host and accelerator
architectures.
Platform-dependent auxiliary APIs or data structures
The basic APIs and data structures of ALF are designed with cross-platform
portability in mind. Platform-dependent implementation details are not exposed in
the core APIs.
Common data structures
The enumeration type ALF_DATA_TYPE_T defines the data types for data movement
operations between the hosts and the accelerators. The ALF runtime does byte
swapping automatically if the endianness of the host and the accelerators are
different.
ALF_DATA_BYTE For data types that are independent of byte orders
ALF_DATA_INT16 For two bytes signed / unsigned integer types
ALF_DATA_INT32 For four bytes signed / unsigned integer types
ALF_DATA_INT64 For eight bytes signed / unsigned integer types
ALF_DATA_FLOAT For four bytes float point types
ALF_DATA_DOUBLE For eight bytes float point types
The constant ALF_NULL_HANDLE is used to indicate a non-initialized handle in the
ALF runtime environment. All handles should be initialized to this value to avoid
ambiguity in code semantics.
© Copyright IBM Corp. 2006, 2007 23
ALF runtime APIs that create handles always return results through pointers to
handles. After the API call is successful, the original content of the handle is
overwritten. Otherwise, the content is kept unchanged. ALF runtime APIs that
destroy handles modify the contents of handle pointers and initialize the contents
to ALF_NULL_HANDLE.
24 ALF Programmer’s Guide and API Reference
Chapter 9. Host API
The host API includes the basic framework API, the compute task API, and the
work block API.
Basic framework API
The following API definitions are the basic framework APIs.
alf_handle_t
This data structure is used as a reference to one instance of the ALF runtime. The
data structure is initialized by calling the alf_init API call and is destroyed by
alf_exit.
Example
{
alf_handle_t half = ALF_NULL_HANDLE; // initialize
// do something here
if(ALF_NULL_HANDLE == half) // and check
{
fprintf(stderr, "The ALF handle is not initialized!\n");
}
}
ALF_ERR_POLICY_T
This is a callback function prototype that can be registered to the ALF runtime for
customized error handling.
Synopsis
ALF_ERR_POLICY_T(*alf_error_handler_t)(void *p_context_data, int
error_type, int error_code, char *error_string)
Parameters
p_context_data [IN] A pointer given to the ALF runtime when the error handler is
registered. The ALF runtime passes it to the error handler
when the error handler is invoked. The error handler can use
this pointer to keep its private data.
error_type [IN] A system-wide definition of error type codes, including the
following:
v ALF_ERR_FATAL: Cannot continue, the framework must shut
down.
v ALF_ERR_EXCEPTION: You can choose to retry or skip the
current operation.
v ALF_ERR_WARNING: You can choose to continue by ignoring
the error.
error_code [IN] A type-specific error code.
error_string [IN] A C string that holds a printable text string that provides
information about the error.
© Copyright IBM Corp. 2006, 2007 25
Return values
ALF_ERR_POLICY_RETRY Indicates that the ALF runtime should retry the operation that
caused the error. If a severe error occurs and the ALF runtime
cannot retry this operation, it will report an error and shut
down.
ALF_ERR_POLICY_SKIP Indicates that the ALF runtime should stop the operation that
caused the error and continue processing. If the error is severe
and the ALF runtime cannot continue, it will report an error
and shut down.
ALF_ERR_POLICY_ABORT Indicates that the ALF runtime must stop the operations and
shut down.
ALF_ERR_POLICY_IGNORE Indicates that the ALF runtime will ignore the error and
continue. If the error is severe and the ALF runtime cannot
continue, it will report an error and shut down.
Example
See “alf_register_error_handler” on page 30 for an example of this function.
alf_configure
This function configures the ALF runtime code according to the system
configuration provided. This API must be the first one called in any instance of
applications using the ALF runtime code. Depending on the platforms, ALF might
automatically detect the system configuration or might require the application
developer to provide some strings or a pointer to configuration files where the
ALF runtime code can get information about the system.
Synopsis
int alf_configure(alf_sys_config_t *p_configuration)
Parameters
p_configuration
[IN]
A platform-dependent configuration information place holder that ALF
uses to get the necessary system configuration data. In the current
Cell/B.E. architecture implementation, this argument is not used. A
NULL pointer should be given. On other platforms, specific definitions of
the data structure must be documented accordingly and the caller is
responsible to fill the data structure.
Return values
>= 0 Successful.
< 0 Errors occurred:
v -EINVAL: Invalid input parameter
v -ENODATA: Some system configuration data is not available
v -EBADR: Generic internal errors
alf_query_system_info
This function queries basic configuration information for the specific system on
which ALF is running.
26 ALF Programmer’s Guide and API Reference
Synopsis
int alf_query_system_info(int query_id, unsigned int * result)
Parameters
query_id [IN] A query identification that indicates the item to be queried:
v ALF_INFO_NUM_ACCL_NODES: Returns the number of accelerators in the
system.
v ALF_INFO_CTRL_NODE_MEM_SIZE: Returns the memory size of the hosts up
to 4 TB, in KB. When the size of memory is more than 4 TB, the total
reported memory size is (ALF_INFO_CTRL_NODE_MEM_SIZE_EXT*4 TB +
ALF_INFO_CTRL_NODE_MEM_SIZE*1KB) bytes. In a system where virtual
memory is supported, this should be the maximum size of one
contiguous memory block that a single user space application can
allocate.
v ALF_INFO_CTRL_NODE_MEM_SIZE_EXT: Returns the memory size of the
host, in units of 4 TB.
v ALF_INFO_ACCL_NODE_MEM_SIZE: Returns the memory size of the
accelerators up to 4 TB, in KB. When the size of memory is more than
4 TB, the total reported memory size is
(ALF_INFO_ACCL_NODE_MEM_SIZE_EXT*4 TB +
ALF_INFO_ACCL_NODE_MEM_SIZE*1KB) bytes. In a system where virtual
memory is supported, this should be the maximum size of one
contiguous memory block that a single user space application can
allocate.
v ALF_INFO_ACCL_NODE_MEM_SIZE_EXT: Returns the memory size of the
accelerators, in units of 4 TB.
v ALF_INFO_CTRL_NODE_ADDR_ALIGN: Returns the basic requirement of
memory address alignment on the host in an exponential of 2. A zero
indicates a byte-aligned address. A 4 is to align on 16-byte boundaries.
v ALF_INFO_ACCL_NODE_ADDR_ALIGN: Returns the basic requirement of
memory address alignment on the accelerator in an exponential of 2. A
zero indicates a byte-aligned address. An 8 is to align on 256-byte
boundaries.
v ALF_INFO_DT_LIST_ADDR_ALIGN: Returns the address alignment of the
data transfer list entries in units of bytes.
p_result [OUT] Pointer to a buffer where the return value of the query is saved. If the
query fails, the result is undefined. If a NULL pointer is provided, the
query value is not returned, but the call returns zero.
Return values
0 Successful, the result of query is returned by p_result if that pointer is
not NULL
< 0 Errors occurred:
v -EINVAL: Unsupported query
v -EPERM: The ALF runtime is not properly configured
v -EBADR: Generic internal errors
Example
{
unsigned long long memsize;
unsigned int memsize_low, memsize_ext;
alf_configure(NULL);
Chapter 9. Host API 27
nodes=alf_query_system_info(ALF_INFO_NUM_ACCL_NODES);
memsize_low = alf_query_system_info(ALF_INFO_ACCL_NODE_MEM_SIZE);
memsize_ext = alf_query_system_info(ALF_INFO_ACCL_NODE_MEM_SIZE_EXT);
memsize = (unsigned long long) memsize_low +
((unsigned long long) memsize_ext << 32);
printf("We have %ull KB memory on 1 of %d accelerator nodes\n"
}
alf_init
This function initializes the ALF runtime. It allocates accelerator resources and sets
up global data for ALF.
Synopsis
int alf_init(alf_handle_t *p_alf_handle, unsigned int
number_of_accelerators, ALF_STARTUP_POLICY_T policy)
Parameters
p_alf_handle [OUT] A pointer to a buffer that receives the contents of the data
structure that represents an instance of the ALF runtime.
This buffer is initialized with proper data if the call is
successful. Otherwise, the content is not modified.
number_of_accelerators [IN] Specifies the number of accelerators to allocate. When this
parameter is zero, the runtime tries to allocate all available
accelerators. If there are not enough accelerator resources,
the behavior is defined according to the policy parameter.
policy [IN] Defines the behavior of the function if there are not enough
accelerator resources as specified by the
number_of_accelerators parameter. Possible options are:
v ALF_INIT_PERSIST: Waits until the requested accelerators
are available.
v ALF_INIT_COMPROMISE: Obtains all available accelerators
and continues, even if the number of accelerators is less
than requested. Even if more accelerators are available
after this function returns, ALF will not dynamically add
more accelerators to satisfy the initial request.
v ALF_INIT_TRY: Stops, with an error code, if the number of
accelerators can not be satisfied.
Return values
> 0 The actual number of accelerators allocated for the runtime.
0 Not defined.
< 0 Error occurred:
v -EINVAL: Invalid input argument.
v -EPERM: The process does not have sufficient privileges to
fulfill the requirements.
v -ENOMEM: Out of memory or system resource.
v -ENOSYS: The required policy is not supported.
v -EBADR: Generic internal errors.
28 ALF Programmer’s Guide and API Reference
Example
{
alf_handle_t half = ALF_NULL_HANDLE; // initialize
alf_configure(NULL);
nodes=alf_query_system_info(ALF_INFO_NUM_ACCL_NODES);
rtn = alf_init(&half, nodes, ALF_INIT_PERSIST);
if(rtn < 0) // and check
{
fprintf(stderr, "alf_init failed with code %d !\n", rtn);
}
}
alf_exit
This function shuts down the ALF runtime. It frees allocated accelerator resources
and stops all running or pending work queues and tasks, depending on the policy
parameter.
Synopsis
int alf_exit(alf_handle_t *p_alf_handle, ALF_SHUTDOWN_POLICY_T policy)
Parameters
p_alf_handle
[IN/OUT]
A pointer to a buffer that holds the contents of the data structure that
represents an instance of the ALF runtime. On exit, this buffer is set to
ALF_NULL_HANDLE to avoid future misuse if the call is successful.
policy [IN] Defines the shut down behavior:
v ALF_SHUTDOWN_FORCE: Shuts down immediately and stops all unfinished
tasks.
v ALF_SHUTDOWN_WAIT: Waits for all tasks to be processed and then shuts
down.
v ALF_SHUTDOWN_TRY: Returns with a failure if there are unfinished tasks.
Return values
>= 0 The shut down succeeded. The number of unfinished work blocks is
returned.
< 0 The shut down failed:
v -EINVAL: Invalid input argument
v -EBADF: Invalid ALF handle
v -EPERM: Process does not have sufficient privileges to fulfill the
requirements
v -ENOSYS: The required policy is not supported
v -EBUSY: There are still running tasks
v -EBADR: Generic internal errors
Example
{
alf_configure(NULL);
nodes = alf_query_system_info(ALF_INFO_NUM_ACCL_NODES);
rtn = alf_init(&half, nodes, ALF_INIT_PERSIST);
// do something here
Chapter 9. Host API 29
// and finally
rtn = alf_exit(&half, ALF_SHUTDOWN_WAIT);
// check return now
}
alf_register_error_handler
This function registers a global error handler function to the ALF runtime code. If
an error handler has already been registered, the new one replaces it.
Synopsis
int alf_register_error_handler(alf_handle_t alf_handle, alf_error_handler_t
error_handler_function, void *p_context)
Parameters
alf_handle [IN] A handle to the ALF runtime code.
error_handler_function
[IN]
A pointer to the user-defined error handler function. A NULL
value resets the error handler to the ALF default handler.
p_context [IN] A pointer to the user-defined context data for the error handler
function. This pointer is passed to the user-defined error
handler function when it is invoked.
Return values
0 Successful.
< 0 Errors occurred:
v -EINVAL: Invalid input argument
v -EBADF: Invalid ALF handle
v -EBADR: Generic internal errors
Example
static char my_context[256];
ALF_ERR_POLICY_T my_alf_error_handler(void *p_context_data,
int error_type, int error_code,
char *error_string)
{
if(ALF_ERR_FATAL == error_type)
{
fprintf(stderr, "Fatal error %d : ’%s’\n", error_code,
error_string);
return ALF_ERR_POLICY_ABORT;
}
return ALF_ERR_POLICY_SKIP;
}
int main(void)
{
alf_configure(NULL);
nodes=alf_query_system_info(ALF_INFO_NUM_ACCL_NODES);
rtn = alf_init(&half, nodes, ALF_INIT_PERSIST);
rtn = alf_register_error_handler(half, my_alf_error_handler,
30 ALF Programmer’s Guide and API Reference
my_context);
// check return now
}
Compute task API
The following API definitions are the compute task APIs.
alf_task_handle_t
This data structure is a handle to a specific compute task running on the
accelerators. It is created by calling the alf_task_create function and destroyed by
either calling the alf_task_destroy function or when the alf_exit function is
called. Call the alf_task_wait function to wait for the task to finish processing all
queued work blocks. The alf_task_wait API is also an indication to the ALF
runtime that no new work blocks will be added to the work queue of the
corresponding task in the future.
alf_task_context_handle_t
This data structure is a handle to access the task context of one task instance.
Context buffers are only available when the task is created with a nonzero
task_context_buffer_size. The handle is returned by alf_task_context_create.
Context buffer entries can then be added by calling alf_task_context_add_entry.
Then the context buffer is registered to the task instance by calling
alf_task_context_register. This handle will be destroyed by the runtime
automatically when the task has completed or has been destroyed explicitly.
alf_task_info_t
This data structure is used to hold the task creation information for the
alf_task_create function.
For more information about memory usage, see Chapter 5, “Buffers,” on page 11.
typedef struct
{
void * p_task_info;
/* This is a pointer to information regarding compute tasks that are
critical for starting the task on the accelerators. */
unsigned int task_context_buffer_read_only_size;
/* The size of the task context buffer section that should only be
referenced by all work blocks of a specific task on a specific
accelerator. This parameter can be zero if the task does not need
a task context buffer. For each instance of a compute task on an
accelerator, a context buffer will be created. */
unsigned int task_context_buffer_writable_size;
/* The size of the task context buffer that can be written to by all
work blocks of a specific task on a specific accelerator. This
parameter can be zero if the task does not need a task context
buffer. For each instance of a compute task on an accelerator,
a context buffer will be created. */
unsigned int parm_ctx_buffer_size;
/* The size of the parameter and context buffer that the work block needs,
specified in bytes. This parameter can be zero if the work blocks do
not need this buffer. */
unsigned int input_buffer_size;
/* The maximum size of the input data buffer of the work blocks, specified
in bytes. This parameter can be zero if the work blocks do not have
input data or do not need dedicated input buffers. */
unsigned int overlapped_buffer_size;
/* The maximum size allowed for overlapped input and output data buffers
Chapter 9. Host API 31
for work blocks, specified in number of bytes. This parameter can be
zero when the overlapped buffer feature is not used. */
unsigned int output_buffer_size;
/* The maximum size of the output data buffer of the work blocks, specified
in bytes. This parameter can be zero if the work blocks do not contain
output data or do not need dedicated output buffers. */
unsigned int dt_list_entries;
/* The maximum number of entries in input and output data transfer lists
when the accelerator data partition is used. This value can be
zero when the host data partition is used. For the ALF runtime to
manage memory resources on the accelerators more efficiently, an
approximate value for this number must be given to the runtime even
when host data partitioning is used. */
unsigned int task_attr;
/* A logical OR of the following values: ALF_TASK_ATTR_PARTITION_ON_ACCEL:
the accelerator functions that are provided by programmers are invoked
to generate data transfer list for input and output data. By default,
the host APIs are used to do data partitioning. */
} alf_task_info_t;
Cell/B.E. architecture specific implementation details
The p_task_info pointer points to an initialized alf_task_info_t_CBEA data
structure.
alf_task_create
This function spawns or schedules the compute task described by the
p_computing_task_info function on all accelerators and enables the addition of
work blocks to the work queue of the task.
ALF uses an SPMD model, so it is possible to create multiple tasks in a batch.
However, the tasks are run sequentially in the order they are created. The
corresponding alf_task_wait should be called to ensure the completion of the
specified task and then alf_task_destroy is called to free the task handle resource.
The runtime automatically spawns new tasks when the currently running task
completes.
For tasks with a task_context_buffer_size larger than zero, the task will not
begin to process work blocks until the task context buffer is assigned to each
instance of the task on an accelerator by invoking alf_task_context_assign.
Synopsis
int alf_task_create(alf_task_handle_t *p_task_handle, alf_handle_t
alf_handle, alf_task_info_t *p_computing_task_info)
Parameters
p_task_handle [OUT] Returns a handle to the created task. The content of the pointer
is not modified if the call fails.
alf_handle [IN] The handle to the ALF runtime.
p_computing_task_info
[IN]
A pointer to a data structure that contains critical information
that the ALF runtime uses to spawn a compute task on the
accelerators. Contents of this data structure are not referenced
by the ALF runtime after the function returns.
32 ALF Programmer’s Guide and API Reference
Return values
> 0 The number of task instances that have been or will be
spawned.
0 Not defined.
< 0 Errors occurred:
v -EINVAL: Invalid input argument.
v -EBADF: Invalid ALF handle.
v -ENOMEM: Out of memory or system resource.
v -EPERM: The process does not have sufficient privileges to
fulfill the requirements.
v -ENOEXEC: Invalid task image format or description
information.
v -E2BIG: Memory requirement cannot be satisfied.
v -ENOSYS: The required task attribute is not supported.
v -EBADR: Generic internal errors.
Example
{
alf_task_info_t tinfo;
alf_task_info_t_CBEA spe_tsk;
// init ALF ...
memset(&tinfo, 0, sizeof(tinfo));
spe_tsk.spe_task_image = my_spe_task_image;
spe_tsk.max_stack_size = 4096;
tinfo.p_task_info = &spe_tsk;
tinfo.parm_ctx_buffer_size = sizeof(my_parm_data_structure);
tinfo.input_buffer_size = 128*1024;
tinfo.output_buffer_size = 64*1024;
tinfo.dt_list_entries = 128;
tinfo.task_attr = ALF_TASK_ATTR_PARTITION_ON_ACCEL;
rtn = alf_task_create(&htask, half, &tinfo);
if(rtn < 0)
fprintf(stderr, "Failed to create task\n");
else
printf("A total of %d instances are / will be created.\n",
rtn);
}
alf_task_context_create
This function creates a context buffer handle to enable accesses to the context
buffer of the specified task instance.
A task can only have context buffers when it is started with
task_context_buffer_size set to nonzero. The context buffer is on the accelerator
that runs the task instance. For each instance of the task, only one context buffer
handle is created. The programmer calls alf_task_context_add_entry to add
references to host memory locations to the context buffer. After the entries are
added, alf_task_context_register registers the context buffer handle to the ALF
runtime. These host memory references are copied to the context buffer before the
task instance begins to process work blocks. After the task completes, the contents
of the context buffer are written back to the original locations in the host memory.
Chapter 9. Host API 33
Synopsis
int alf_task_context_create(alf_task_context_handle_t *p_tc_handle,
alf_task_handle_t task_handle, unsigned int accelerator_index)
Parameters
p_tc_handle [OUT] The pointer to a buffer where the created handle is returned.
The contents are not modified if this call returns an error.
task_handle [IN] The handle to the compute task.
accelerator_index [IN] The index of the accelerator. This value ranges from zero to the
number of allowed instances of the task as returned by
alf_task_create. If the value is zero, one of the instances
without a context buffer allocated will be selected by the ALF
runtime. Otherwise, the context buffer will be allocated for the
specific instance that is defined by the internal order of ALF
when the task is created.
Return values
0 Success.
< 0 Errors occurred:
v -EINVAL: Invalid input argument.
v -EPERM: Operation not allowed (one context buffer is already
allocated for this instance or the task is not declared as
supporting context buffers).
v -EBADF: Invalid task handle.
v -ENOMEM: Out of memory or system resource.
v -EBADR: Generic internal errors.
Example
{
alf_task_info_t tinfo;
alf_task_info_t_CBEA spe_tsk;
my_task_context_t *pctx;
// init ALF ...
memset(&tinfo, 0, sizeof(tinfo));
spe_tsk.spe_task_image = my_spe_task_image;
spe_tsk.max_stack_size = 4096;
tinfo.p_task_info = &spe_tsk;
tinfo.task_context_buffer_read_only_size = sizeof(read_only_data);
tinfo.task_context_buffer_writable_size = sizeof(my_task_context_t);
tinfo.parm_ctx_buffer_size = sizeof(my_parm_data_structure);
tinfo.input_buffer_size = 128*1024;
tinfo.output_buffer_size = 64*1024;
tinfo.dt_list_entries = 128;
tinfo.task_attr = ALF_TASK_ATTR_PARTITION_ON_ACCEL;
rtn = alf_task_create(&htask, half, &tinfo);
if(rtn < 0)
fprintf(stderr, "Failed to create task\n");
else
printf("A total of %d instances are / will be created.\n", rtn);
pctx = malloc_align(sizeof(my_task_context_t)*rtn, 128);
memset(pctx, 0, sizeof(my_task_context_t)*rtn);
for(i=0; i<rtn; i++)
34 ALF Programmer’s Guide and API Reference
{
alf_task_context_handle_t hctx;
hctx = alf_task_context_create(&hctx, htask, 0 /* i+1 is ok too */);
alf_task_context_add_entry(hctx, read_only_data, sizeof(read_only_data),
ALF_DATA_BYTE, ALF_TASK_CONTEXT_READ);
alf_task_context_add_entry(hctx, pctx+i, sizeof(my_task_context_t),
ALF_DATA_BYTE, ALF_TASK_CONTEXT_WRITABLE);
alf_task_context_register(hctx);
}
// ...
}
alf_task_context_add_entry
This function adds an entry to the context buffer of the corresponding task
instance. The entry describes a single piece of data that is transferred in from the
host memory before the task instance starts to process work blocks. This data is
transferred to the original location when the task instance finishes normally. For a
specific context buffer, further calls to this API will return an error after the context
buffer is registered by calling the alf_task_context_register function.
Synopsis
int alf_task_context_add_entry(alf_task_context_handle_t tc_handle, void
*p_address, unsigned int size_of_data, ALF_DATA_TYPE_T data_type,
ALF_TASK_CONTEXT_T entry_type)
Parameters
tc_handle [IN] The task context buffer handle.
p_address [IN] The pointer to the data in remote memory.
size_of_data
[IN]
The size of data in units of the data type.
data_type [IN] The type of data. This value is required if data endianness conversion is
necessary when moving the data.
entry_type [IN] Type of entry:
v ALF_TASK_CONTEXT_READ: Add this entry to the read-only section of the
context buffer. The content of this entry will not be written back to the
original location when the task is finished.
v ALF_TASK_CONTEXT_WRITABLE: Add this entry to the writable section of
the context buffer. The content of this entry can be modified by the
accelerator, so it will be written back to the original location when the
task is finished to ensure consistency.
Return values
0 Success.
< 0s Errors occurred:
v -EINVAL: Invalid input argument.
v -EBADF: Invalid task context buffer handle.
v -ENOBUFS: The size or offset of the entry is outside of the allowed range.
v -EBADR: Generic internal errors.
Example
See an example of this function in “alf_task_context_create” on page 33.
Chapter 9. Host API 35
alf_task_context_register
This function registers the given context buffer handle to the corresponding task
instance. It should only be called when all related calls to the
alf_task_context_add_entry function have returned. The task instance will not
begin to process work blocks before the context buffer handle is registered.
Synopsis
int alf_task_context_register(alf_task_context_handle_t tc_handle)
Parameters
tc_handle [IN] The task context buffer handle.
Return values
0 Success.
< 0 Errors occurred:
v -EBADF: Invalid task context buffer handle.
v -EBADR: Generic internal errors.
Example
See an example of this function in “alf_task_context_create” on page 33.
alf_task_query
This function queries the current status of a task.
Synopsis
int alf_task_query(alf_task_handle_t task_handle, unsigned int
*p_unfinished_wbs, unsigned int *p_total_wbs)
Parameters
task_handle [IN] The task to be queried.
p_unfinished_wbs [OUT] A pointer to an integer buffer where the number of unfinished
work blocks of this task is returned. When a NULL pointer is
given, the return value is ignored.
p_total_wbs [OUT] A pointer to an integer buffer where the total number of
submitted work blocks of this task is returned. When a NULL
pointer is given, the return value is ignored.
Return values
> 1 The task is the (N-1)th pending task in the task queue.
1 The task is currently running.
0 The task is finished and can be safely destroyed.
< 0 Errors occurred:
v -EINVAL: Invalid input argument.
v -EBADF: Invalid task handle.
v -EBADR: Generic internal errors.
36 ALF Programmer’s Guide and API Reference
Example
{
unsigned int unfinished;
rtn = alf_task_create(&htask, half, &tinfo);
// do things here
rtn = alf_task_query(htask, &unfinished, NULL);
if(rtn > 1)
printf("we are waiting in the queue !\n");
else if(rtn == 1)
printf("we are running and %d work blocks pending !\n", unfinished);
else if(rtn == 0)
printf("done !\n");
else
printf("Why ? \n");
}
alf_task_wait
This function declares that no work blocks will be added to the specified task and
waits for the completion of the spawned task instances on all accelerators.
When this function is called, new work blocks cannot be added to the work queue
of the specified task. Thus, further calls to alf_wb_enqueue will return an error.
This function provides a timeout mechanism that you can use to implement
synchronous or asynchronous work.
Note: The task handle and all of its related resources continue to be valid after this
function finishes. You must call alf_task_destroy to release the resources
associated with the task.
Synopsis
int alf_task_wait(alf_task_handle_t task_handle, int time_out)
Parameters
task_handle
[IN/OUT]
A task handle that is returned by the alf_create_task API.
time_out [IN] A timeout input with the following options for values:
v > 0: Waits for up to the number of milliseconds specified before a
timeout error occurs.
v 0: Checks the status of the accelerator and returns immediately.
v < 0: Waits until all of the accelerators finish processing.
Return values
> 0 The accelerators are still running and the number of unfinished work
blocks is returned. This value is only possible when the time_out
argument is zero.
0 All of the accelerators finished the job.
< 0 Errors occurred:
v -EINVAL: Invalid input argument.
v -EBADF: Invalid task handle.
v -ESRCH: Already closed task handle.
v -ETIME: Time out.
v -EBADR: Generic internal errors.
Chapter 9. Host API 37
1 There is a special case when all work blocks are finished, but the runtime
is still cleaning up the task environment (for example, when writing back
the context). The return value will be 1 at this time. This is an indication
that the task should not be destroyed at this time.
Example
{
rtn = alf_task_create(&htask, half, &tinfo);
// do things here
rtn = alf_task_wait(htask, 10*1000);
if(rtn > 0)
printf("Still running and %d work blocks pending !\n", rtn);
else if(rtn == 0)
printf("done !\n");
else if(-ETIME == rtn)
printf("Timeout\n");
else
printf("Something bad %d\n", rtn);
}
alf_task_destroy
This function destroys the specified task and releases the resources used by the
task. If there are work blocks that are still being processed, this routine forcibly
stops the processing of work blocks. Pending tasks are also destroyed. To release
the task resources without losing the computing results, ensure that calls to the
alf_task_wait function return zero.
Synopsis
int alf_task_destroy(alf_task_handle_t* p_task_handle)
Parameters
p_task_handle [IN/OUT] The pointer to a task handle that is returned by the
alf_create_task API. When there is a successful return, the
pointed content is set to ALF_NULL_HANDLE.
Return values
>= 0 Success and the number of unfinished work blocks is returned.
< 0 Errors occurred:
v -EINVAL: Invalid input argument.
v -EBADF: Invalid task handle.
v -EBUSY: Resource busy.
v -ENOSYS: Feature not implemented.
v -EBADR: Generic internal errors.
Example
{
rtn = alf_task_create(&htask, half, &tinfo);
// do things here
rtn = alf_task_wait(htask, 10*1000);
if(rtn == 0)
{
38 ALF Programmer’s Guide and API Reference
printf("done !\n");
alf_task_destroy(&htask);
}
}
Work block API
The following API definitions are the work block APIs.
Data structures
alf_wb_handle_t
This data structure is a handle to a work block.
alf_wb_sync_handle_t
This data structure refers to the synchronization point.
alf_wb_create
This function creates a new work block for the specified compute task. The work
block is added to the work queue of the task. The caller can only update the
contents of a work block before it is added to the work queue. After the work
block is added to the work queue, the life span of the data structure is determined
by the ALF runtime.
The ALF runtime is responsible for releasing any resources allocated for the work
block. The ALF runtime releases the allocated resources for the work block after
the runtime finishes processing it. This function can only be called before the
alf_task_wait function is invoked. After the alf_task_wait function is called,
additional calls to this function will return an error.
Synopsis
int alf_wb_create(alf_wb_handle_t *p_wb_handle, alf_task_handle_t
task_handle, ALF_WORK_BLOCK_TYPE_T work_block_type, unsigned int
repeat_count)
Parameters
p_wb_handle
[OUT]
The pointer to a buffer where the created handle is returned. The contents
are not modified if this call fails.
task_handle
[IN]
The handle to the compute task.
work_block_type
[IN]
The type of work block to be created. Choose from the following types:
v ALF_WB_SINGLE: Creates a single-use work block
v ALF_WB_MULTI: Creates a multi-use work block. This work block type is
only supported when the task is created with the
ALF_TASK_ATTR_PARTITION_ON_ACCEL attribute.
repeat_count
[IN]
Specifies the number of iterations for a multi-use work block. This
parameter is ignored when a single-use work block is created.
Return values
>= 0 Success.
Chapter 9. Host API 39
< 0 Errors occurred:
v -EINVAL: Invalid input argument.
v -EPERM: Operation not allowed.
v -EBADF: Invalid task handle.
v -ENOMEM: Out of memory.
v -EBADR: Generic internal errors.
Example
See “alf_wb_enqueue” for an example of this function.
alf_wb_enqueue
This function adds the work block to the work queue of the specified task handle.
The caller can only update the contents of a work block before it is added to the
work queue. After it is added to the work queue, you cannot access the wb_handle.
Synopsis
int alf_wb_enqueue(alf_wb_handle_t wb_handle)
Parameters
wb_handle [IN] The handle of the work block to be put into the work queue.
Return values
0 Success.
< 0 Errors occurred:
v -EINVAL: Invalid input argument.
v -EBADF: Invalid task handle.
v -EBUSY: An internal resource is occupied.
v -EBADR: Generic internal errors.
Example
{
alf_task_create(&htask, half, &tinfo);
for(X=0; X<1024; X+=M)
{
parm.x=X;
parm.y=0;
alf_wb_create (&hwb, htask, ALF_WB_SINGLE, 1);
alf_wb_add_parm (hwb, &parm;, sizeof(parm),
ALF_DATA_BYTE, 0);
alf_wb_add_io_buffer (hwb,
data_a[X], M*N*sizeof(float),
ALF_DATA_FLOAT, ALF_BUFFER_INPUT);
alf_wb_add_io_buffer (hwb,
&mat_b[X][0], M*N*sizeof(float),
ALF_DATA_FLOAT, ALF_BUFFER_INPUT);
alf_wb_add_io_buffer (hwb,
&mat_c[X][0], M*N*sizeof(float),
ALF_DATA_FLOAT, ALF_BUFFER_OUTPUT);
40 ALF Programmer’s Guide and API Reference
alf_wb_enqueue(hwb);
}
alf_task_wait(&htask, -1);
}
alf_wb_add_parm
This function adds the given parameter to the parameter and context buffer of the
work block in the order that this function is called. The starting address is from
offset zero. The added data is copied to the internal parameter and context buffer
immediately. The relative address of the data can be aligned as specified. For a
specific work block, additional calls to this API will return an error after the work
block is put into the work queue by calling the alf_wb_enqueue function.
Synopsis
int alf_wb_add_parm(alf_wb_handle_t wb_handle, void *pdata, unsigned int
size_of_data, ALF_DATA_TYPE_T data_type, unsigned int address_alignment)
Parameters
wb_handle [IN] The work block handle.
pdata [IN] A pointer to the data to be copied.
size_of_data [IN] The size of the data in units of the data type.
data_type [IN] The type of data. This value is required if data endianness
conversion is necessary when moving the data.
address_alignment [IN] Address alignment requirement in an exponential of 2. The
valid range is from 0 to 8. A zero indicates a byte-aligned
address. An 8 indicates alignment on 256 byte boundaries.
Return values
0 Success.
< 0 Errors occurred:
v -EINVAL: Invalid input argument.
v -EPERM: Operation not allowed.
v -EBADF: Invalid work block handle.
v -ENOBUFS: The size of the data to be added is too large.
v -EBADR: Generic internal errors.
Example
See “alf_wb_enqueue” on page 40 for an example of this function.
alf_wb_add_io_buffer
This function adds an entry to the input or output data transfer lists of a single-use
work block. The entry describes a single piece of data transferred from or to
remote memory.
For a specific work block, additional calls to this API return an error after the work
block is put into the work queue by calling the alf_wb_enqueue function. This
function can only be called if the compute task is not created with the
ALF_TASK_ATTR_PARTITION_ON_ACCEL attribute.
Chapter 9. Host API 41
Synopsis
int alf_wb_add_io_buffer(alf_wb_handle_t wb_handle, void *p_address,
unsigned int size_of_data, ALF_DATA_TYPE_T data_type, ALF_BUFFER_TYPE_T
io_type)
Parameters
wb_handle [IN] The work block handle.
p_remote_address [IN] A pointer to the data in remote memory.
size_of_data [IN] The size of the data in units of the data type.
data_type [IN] The type of data. This value is required if data endianness
conversion is necessary when doing the data movement.
buffer type[IN] v ALF_BUFFER_INPUT: Input buffer
v ALF_BUFFER_OUTPUT: Output buffer
v ALF_BUFFER_INOUT: Buffer used for both input and output
Return values
0 Success.
< 0 Errors occurred:
v -EINVAL: Invalid input argument.
v -EPERM: Operation not allowed.
v -EBADF: Invalid work block handle.
v -E2BIG: The ALF runtime cannot accommodate the number
of io_buffer requested.
v -ENOBUFS: The ALF runtime cannot accommodate the amount
of data requested.
v -EBADR: Generic internal errors.
Cell/B.E. architecture implementation details
For this macro, the ALF runtime handles the 16 KB DMA limitation transparently.
You must ensure the data is aligned properly because the ALF runtime will not do
data padding and data duplication to satisfy the address and data size alignment
requirements of the memory flow controller (MFC). An -EINVAL error will be
returned when the input data does not meet the alignment requirements.
Example
See “alf_wb_enqueue” on page 40 for an example of this function.
alf_wb_sync
This function adds a synchronization point to the current work queue for the
specified task. You can register a callback function for notification of this
synchronization point. The ALF runtime will invoke the callback function when the
synchronization condition is met. This API can only be called before alf_task_wait
is invoked. After alf_task_wait is called, further calls to the function will return
an error.
42 ALF Programmer’s Guide and API Reference
Synopsis
int alf_wb_sync(alf_wb_sync_handle_t *p_sync_handle, alf_task_handle_t
task_handle, ALF_SYNC_TYPE_T sync_type, int
(*sync_callback_func)(alf_wb_sync_handle_t sync_handle, void* p_context),
void *p_context, unsigned int context_size)
Parameters
p_sync_handle [IN/OUT] Pointer to buffer where the handle to the created
synchronization point is returned.
task_handle [IN] Task handle.
sync_type [IN] This can be set to one of the following values:
v ALF_SYNC_BARRIER: When the ALF runtime reaches this
synchronization point, all work blocks enqueued before this
point must be finished before any new work blocks added
after the synchronization point can be processed on any of
the accelerators. If a callback function is registered to this
synchronization point, the work queue will continue running
only when the callback function returns.
v ALF_SYNC_NOTIFY: The ALF accelerator runtime will send a
notification to the ALF host runtime and invoke the
registered callback function when this synchronization point
is reached. However, it does not ensure the order of work
block completion.
sync_callback_func [IN] The pointer to the call back function that will be registered for
this synchronization point. This parameter can be NULL if you
do not want a call back function.
p_context [IN] A pointer to a context buffer. The pointer to the internal buffer
will be passed to the callback function if there is a callback
function registered. The content of the context buffer is copied
by value only. A NULL value indicates no context buffer.
context_size [IN] The size of the context buffer in bytes. Zero indicates no
context buffer.
Return values
0 Success.
< 0 Errors occurred:
v -EINVAL: Invalid input argument.
v -EPERM: Operation not allowed.
v -EBADF: Invalid task handle.
v -ENOMEM: Out of memory.
v -EBADR: Generic internal errors.
Note: For a synchronization point without an associated callback function, its
behavior is always non-blocked. In this case, use alf_wb_sync_wait to check the
status of the synchronization point. If a callback function is associated with the
synchronization point, its behavior is always blocked, and the ALF runtime will
not assign new work blocks to the accelerators until the callback function has
returned. In either case, alf_wb_sync_wait is always supported.
Chapter 9. Host API 43
sync_callback_func
This is the prototype of the call back function for the synchronization point. The
callback function might be invoked in a different thread context than the main
application.
Synopsis
int (*sync_callback_func)(alf_wb_sync_handle_t sync_handle, void
*p_context)
Parameters
sync_handle [IN] The handle to the synchronization point.
p_context [IN] A pointer to the buffer where the programmer supplied
context values are duplicated. The contents of this buffer are
not kept after the callback function is returned.
Return values
0 No errors.
< 0 Errors occurred during the callback. An internal error with
type ALF_ERR_EXCEPTION will be raised.
alf_wb_sync_wait
This function waits for the arrival of a synchronization point. Timeout is given in
milliseconds.
Synopsis
int alf_wb_sync_wait (alf_wb_sync_handle_t sync_handle, int time_out)
Parameters
sync_handle [IN] Task handle. This is the value returned from alf_wb_sync.
time_out Timeout value in milliseconds.
v > 0: The function will wait for up to time_out milliseconds
before a time out error occurs.
v 0: The function will check the status of the synchronization
point and return immediately.
v < 0: The function will wait until the synchronization point is
reached.
Return values
> 0 The synchronization point has not been reached. This value is
only available when the time_out value is zero.
0 The synchronization operation completed successfully.
< 0 Errors occurred:
v -EINVAL: Invalid input argument.
v -EBADF: Invalid task handle.
v -ETIME: Timed out.
v -EBADR: Generic internal errors.
44 ALF Programmer’s Guide and API Reference
Note: The status of a synchronization point can be queried multiple times. When a
synchronization point has been reached, future calls to this API will always return
success until the corresponding task has completed.
Chapter 9. Host API 45
46 ALF Programmer’s Guide and API Reference
Chapter 10. Accelerator API
The following API definitions are the accelerator APIs.
alf_comp_kernel
This is the entry point to the compute kernel. The ALF runtime moves in the user
data and input data before invoking this call.
Synopsis
int alf_comp_kernel(void* p_task_context, void *p_parm_ctx_buffer, void
*p_input_buffer, void *p_output_buffer, unsigned int current_count,
unsigned int total_count)
Parameters
p_task_context [IN] A pointer to the local memory block where the task context
buffer is kept.
p_parm_ctx_data [IN] A pointer to the local memory block where the parameter and
context data are kept.
p_input_data [IN} A pointer to the local memory block where the input data is
loaded.
p_output_buffer [IN] A pointer to the local memory block where the output data is
written.
current_count [IN] The current iteration count of multi-use work blocks. This
value starts at 0. For single-use work blocks, this value is
always 0.
total_count [IN] The total number of iterations of multi-use work blocks. For
single-use work blocks, this value is always 1.
Return values
0 The computation finished correctly.
< 0 An error occurred during the computation. The error code is
passed back to the library developer to be handled.
For overlapped I/O buffers, when this API is called, the p_input_data will refer
the memory region that includes the dedicated input buffer and the overlapped
buffer, and the p_output_buffer will refer the memory region that includes the
overlapped buffer and the dedicated output buffer. See Figure 5 on page 14 and
Figure 6 on page 14.
alf_prepare_input_list
The ALF runtime calls this function in order to define the input data partition on
the accelerator.
Because ALF might be doing double buffering, the function should only refer to
the context and memory buffers provided by the p_parm_ctx_buffer. This function
is only called if the compute task is created with the alf_task_info_t.task_attr
parameter set to ALF_TASK_ATTR_PARTITION_ON_ACCEL.
© Copyright IBM Corp. 2006, 2007 47
Synopsis
int alf_prepare_input_list(void *p_task_context, void *p_parm_ctx_buffer,
void *p_dt_list_buffer, unsigned int current_count, unsigned int
total_count)
Parameters
p_task_context [IN] A pointer to the local memory block where the task context
buffer is kept.
p_parm_ctx_buffer [IN] A pointer to the local memory block where the parameter and
context of the work block are kept. The data partition is only
based on these contents.
p_dt_list_buffer [IN} A pointer to the buffer where the generated data transfer list is
saved.
current_count[IN] The current iteration count of multi-use work blocks. This
value starts at 0. For single-use work blocks, this value is
always 0.
total_count [IN] The total number of iterations of multi-use work blocks. For
single-use work blocks, this value is always 1.
Return values
0 The computation finished correctly.
< 0 An error occurred during the computation. The error code is
passed to the library developer to be handled.
Note: This API does not need to be implemented when data partitioning is
performed by the host and the compiler and linker support weak symbols.
alf_prepare_output_list
The ALF runtime calls this function in order to define the output data partition on
the accelerator.
Because ALF might be doing double buffering, the function should only refer to
the context and memory buffers provided by p_parm_ctx_buffer. Invoke this API
only when the compute task is spawned with the alf_task_info_t.task_attr
parameter set to ALF_TASK_ATTR_PARTITION_ON_ACCEL.
Synopsis
int alf_prepare_output_list(void *p_task_context, void *p_parm_ctx_buffer,
void *p_dt_list_buffer, unsigned int current_count, unsigned int
total_count)
Parameters
p_task_context [IN] A pointer to the local memory block where the task context
buffer is kept.
p_parm_ctx_buffer [IN] A pointer to the local memory block where the parameter and
context of the work block are kept. The data partition is based
on these contents.
p_dt_list_buffer [IN} A pointer to the buffer where the generated data transfer list is
saved.
48 ALF Programmer’s Guide and API Reference
current_count[IN] The current iteration count for multi-use work blocks. This
value starts at 0. For single-use work blocks, this value is
always 0.
total_count [IN] The total number of iterations for multi-use work blocks. For
single-use work blocks, this value is always 1.
Return values
0 The computation finished correctly.
< 0 An error occurred during the computation. The error code is
passed to the calling program to be handled.
Note: This API does not need to be implemented when data partitioning is
performed by the host and the compiler and linker support weak symbols.
ALF_DT_LIST_CREATE
This macro creates the data transfer list data structure for input or output data
transfers.
Synopsis
ALF_DT_LIST_CREATE (void *p_dt_list_buffer, unsigned int io_buffer_offset)
Parameters
p_dt_list_buffer [IN} A pointer to the buffer for the data transfer list data structure.
io_buffer_offset [IN] The offset to the input or output buffer pointer in accelerator
memory where the data transfer list will refer.
Return values
Not specified.
This API generates the data transfer list entries. It might be implemented as macros
on some platforms. For overlapped I/O buffers, when this API is called in
alf_prepare_input_list, the io_buffer_offset will refer the memory region that
includes the dedicated input buffer and the overlapped buffer. When it is called in
alf_prepare_output_list, the io_buffer_offset will refer the memory region that
includes the overlapped buffer and the dedicated output buffer. See Figure 5 on
page 14 and Figure 6 on page 14.
ALF_DT_LIST_ADD_ENTRY
This macro fills the data transfer list entry.
Synopsis
ALF_DT_LIST_ADD_ENTRY(void *p_dt_list_buffer, unsigned int data_size,
ALF_DATA_TYPE_T data_type, void *p_remote_address)
Chapter 10. Accelerator API 49
Parameters
p_dt_list_buffer [IN} A pointer to the buffer for the data transfer list data structure.
data_size [IN] The offset to the input or output buffer pointer in accelerator
memory where the data transfer list will refer.
data_type [IN] The type of data. This parameter is required if data endianness
conversion is necessary when doing the data movement.
p_remote_address [IN] The address of the remote memory.
Note: The Cell/B.E. implementation uses the parameter addr64 remote_address in
place of p_remote_address.
Return values
Not specified.
This API generates the data transfer list entries.
Cell/B.E. architecture implementation details
For this macro, the ALF runtime does not manage the 16 KB limitations
transparently. Cell/B.E. architecture programmers must overcome that limitation
by dividing the entry into multiple entries not larger than 16 KB. Cell/B.E.
architecture also has a data transfer list size limitation of 2048 entries. The ALF
runtime will not do a strict check on these constraints by default due to
performance requirements. However, some compile time options can be enabled to
support strict checks to these constraints. The Cell/B.E. implementation uses the
addr64 remote_address parameter in place of the void *p_remote_address
parameter. For more information about compile time options, see Part 5, “Compile
time options,” on page 95.
50 ALF Programmer’s Guide and API Reference
Chapter 11. Cell/B.E. architecture platform-dependent API
This API is platform-dependent.
alf_task_info_t_CBEA
This data structure holds the task creation information for alf_task_create on
Cell/B.E. architecture.
typedef struct
{
spe_program_handle_t *spe_task_image;
/* libspe SPE image handle */
unsigned int max_stack_size;
/* The maximum stack size the image requests when it is run.*/
}
© Copyright IBM Corp. 2006, 2007 51
52 ALF Programmer’s Guide and API Reference
Part 3. Programming with ALF
There are several things to consider when programming with ALF.
Basic structure of an ALF application
The basic structure of an ALF application is shown in Figure 9. On the host, you
initialize the ALF runtime and then create a compute task. After the task is created,
you can begin to add work blocks to the work queue of the task. Then, you can
wait for the task to complete and shut down the ALF runtime to release the
allocated resources. On the accelerator, after an instance of the task is spawned, it
waits for pending work blocks to be added to the work queue. Then the
alf_comp_kernel function is called for each work block. If the partition location
attribute of a task is ALF_TASK_ATTR_PARTITION_ON_ACCEL, then the
alf_prepare_input_list function is called before the invocation of the compute
kernel and the alf_prepare_output_list function is called after the compute
kernel exits.
HOST
Initialization
Create task
Creatework block
Wait task
Termination and exit
ALF
Runtime
Accelerator
Prepareinput DTL
Computekernel
Prepareoutput DTL
Figure 9. ALF application structure and process flow
© Copyright IBM Corp. 2006, 2007 53
54 ALF Programmer’s Guide and API Reference
Chapter 12. Understand the problem
The primary class of problems that ALF is well suited to solve is data parallel
problems. To decide if a problem is suitable for ALF, answer the following three
questions. If the answers are all YES, you can use ALF to solve the problem.
1. Can the problem be parallelized? Certain problems might seem to be
inherently serial at first. However, try to find alternative approaches to divide
the problem into sub-problems. One or all of the sub-problems can often be
parallelized.
2. Is the parallel problem SPMD-capable? ALF supports the SPMD
parallel-programming style, where one program runs on all accelerators with a
different set of data for each accelerator. The ALF runtime does not guarantee
on which accelerator a specific work block is processed, nor does it keep the
order of completion based on the queuing order. Each work block is a stateless
computation procedure; no context is preserved across succeeding work blocks.
3. Can the SPMD-parallel problem be supported on the current architecture?
Check that the problem is suitable for the specific architectures that ALF
supports. For example, the Cell/B.E. processor has the local memory size of 256
KB. If the data set of the problem cannot be divided into work blocks that fit
into local storage, you should not use ALF for that problem.
© Copyright IBM Corp. 2006, 2007 55
56 ALF Programmer’s Guide and API Reference
Chapter 13. Data layout and partition design for the ALF
implementation on Cell/B.E.
Data partitioning is crucial to the ALF programming model. Improper data
partitioning and data layout design either prevents ALF from being applicable or
results in degraded performance. Data partition and layout is closely coupled with
the design and implementation of the compute kernel, and they should be
considered simultaneously.
Use the following considerations for data layout and partition design:
v Use the proper size for data partitions.
Often, the local memory or data cache of the accelerator is limited. Performance
can degrade if the partitioned data cannot fit into the available memory. For
example, on Cell/B.E. architecture, if a single block of partitioned data is larger
than 128 KB, it might not be possible to support double buffering on the SPE.
This might result in up to 50% performance loss.
v Minimize the amount of data movement.
A large amount of data movement can cause performance loss in applications.
Improve performance by avoiding unnecessary data movements.
v Simplify data movement patterns.
Although the data transfer list feature of ALF enables flexible data gathering and
scattering patterns, it is better to keep the data movement patterns as simple as
possible. Some good examples are sequential access and using contiguous
movements instead of small discrete movements.
v Avoid data reorganization.
Data reorganization requires extra work. It is better to organize data in a way
that suits the usage pattern of the algorithm than to write extra code to
reorganize the data when it is used.
v Know address alignment limitations on Cell/B.E.
© Copyright IBM Corp. 2006, 2007 57
58 ALF Programmer’s Guide and API Reference
Chapter 14. Double buffering on ALF
When transferring data can be done in parallel with the computation, double
buffering can reduce the time lost to data transfer by overlapping it with the
computation time. The ALF runtime implementation on Cell/B.E. architecture
supports three different kinds of double buffering schemes.
See Figure 10 for an illustration of how double buffering works inside ALF. The
ALF runtime evaluates each work block and decides which buffering scheme is
most efficient. At each decision point, if the conditions are met, that buffering
scheme is used. The ALF runtime first checks if the work block uses the
overlapped I/O buffer. If the overlapped I/O buffer is not used, the ALF runtime
next checks the conditions for the four-buffer scheme, then the conditions of the
three-buffer scheme. If the conditions for neither scheme are met, the ALF runtime
does not use double buffering. If the work block uses the overlapped I/O buffer,
the ALF runtime first checks the conditions for the overlapped I/O buffer scheme,
and if those conditions are not met, double buffering is not used.
These examples use the following assumptions:
1. All SPUs have 256 KB of local memory.
2. 16 KB of memory is used for code and runtime data including stack, the task
context buffer, and the data transfer list. This leaves 240 KB of local storage for
the work block buffers.
3. Transferring data in or out of accelerator memory takes one unit of time and
each computation takes two units of time.
4. The input buffer size of the work block is represented as in_size, the output
buffer size as out_size, and the overlapped I/O buffer size as overlap_size.
5. There are three computations to be done on three inputs, which produces three
outputs.
Buffer schemes
The conditions and decision tree are further explained in the examples below.
v Four-buffer scheme: In the four-buffer scheme, two buffers are dedicated for
input data and two buffers are dedicated for output data. This buffer use is
shown in the Four-buffer scheme section of Figure 10.
0 1 2 3 4 5 6 7 8 9
DMA In
ComputeKernel Input
ComputeKernel Output
DMA Out
ComputeKernel In/Out
C3 O3
I0 C0
C0 O0
I2 C2
C2 O2
I1 C1
C1 O1
I3 C3
I0 C0
I0 C0
C1 O1 C3
O3
I3
C0 O0 I2 C2
I2 C2
C3 O3
C3
I1 C1 C2 O2
I1 C1 O1 I3
O2
Buf0
Buf1
Buf2
Buf3
Buf0
Buf1
Buf2
Buf0
Buf1
Timeline
Four-bufferscheme
Three-bufferscheme
OverlappedI/O bufferscheme
Buffer Usage Types
Figure 10. ALF double buffering
© Copyright IBM Corp. 2006, 2007 59
– Conditions satisfied: The ALF runtime chooses the four-buffer scheme if the
work block does not use the overlapped I/O buffer and the buffer sizes
satisfy the following condition: 2*(in_size + out_size) <= 240 KB.
– Conditions not satisfied: If the buffer sizes do not satisfy the four-buffer
scheme condition, the ALF runtime will check if the buffer sizes satisfy the
conditions of the three-buffer scheme.v Three-buffer scheme: In the three-buffer scheme, the buffer is divided into three
equally sized buffers of the size max(in_size, out_size). The buffers in this
scheme are used for both input and output as shown in the Three-buffer scheme
section of Figure 10 on page 59. This scheme requires the output data movement
of the previous result to be finished before the input data movement of the next
work block starts, so the DMA operations must be done in order. The advantage
of this approach is that for a specific work block, if the input and output buffer
are almost the same size, the total effective buffer size can be 2*240/3 = 160 KB.
– Conditions satisfied: The ALF runtime chooses the three-buffer scheme if the
work block does not use the overlapped I/O buffer and the buffer sizes
satisfy the following condition: 3*max(in_size, out_size) <= 240 KB.
– Conditions not satisfied: If the conditions are not satisfied, the single-buffer
scheme is used.v Overlapped I/O buffer scheme: In the overlapped I/O buffer scheme, two
contiguous buffers are allocated as shown in the Overlapped I/O buffer scheme
section of Figure 10 on page 59. The overlapped I/O buffer scheme requires the
output data movement of the previous result to be finished before the input data
movement of the next work block starts.
– Conditions satisfied: The ALF runtime chooses the overlapped I/O buffer
scheme if the work block uses the overlapped I/O buffer and the buffer sizes
satisfy the following condition: 2*(in_size + overlap_size + out_size) <= 240
KB.
– Conditions not satisfied: If the conditions are not satisfied, the single-buffer
scheme is used.v Single-buffer scheme: If none of the cases outlined above can be satisfied,
double buffering is not used, but performance might not be optimal.
When creating buffers and data partitions, remember the conditions of these
buffering schemes. If your buffer sizes can meet the conditions required for double
buffering, it can result in a performance gain, but double buffering does not double
the performances in all cases. When the time periods required by data movements
and computation are significantly different, the problem becomes either I/O-bound
or computing-bound. In this case, enlarging the buffers to allow more data for a
single computation might improve the performance even with a single buffer.
60 ALF Programmer’s Guide and API Reference
Chapter 15. ALF host application and data transfer lists
One important decision to make is whether to use accelerator data transfer list
generation.
See Figure 9 on page 53 for the flow diagram of ALF applications on the host.
Because there might be a large number of accelerators used in one compute task if
the data transfer list is complex, the host might not be able to generate work
blocks as fast as the accelerators can process them. In that case, you can supply the
data needed for data transfer list generation in the parameters of the work block
and use the accelerators to generate the data transfer lists based on these
parameters. You can also start a new task to be queued while another task is
running. You can then prepare the work blocks for the new task before the task
runs. However, the ALF programming model is SPMD, so the newly created task
starts when the current task is finished.
© Copyright IBM Corp. 2006, 2007 61
62 ALF Programmer’s Guide and API Reference
Chapter 16. Debugging and tuning
For easier debugging, first create a simple compute kernel that gives the correct
results, then focus on optimizing the algorithm to get better performance.
Debugging and tuning on the host are more simple than on the accelerator. An
advantage to programming on ALF is scalability. For most applications, you can
debug them by using a single accelerator and then verifying that everything runs
properly before moving to multiple accelerators. This approach can also help you
recreate bug scenarios where the issue happens only with a specific work block
sequence, because the whole sequence might not be preserved when multiple
accelerators are used.
To improve performance, try any of the following:
v Variants of data partition schemes.
The size of the data partition can have a significant impact on performance. On
memory-constrained architectures like Cell/B.E., this is especially important.
Data partitions that are large compared to the available accelerator memory will
prevent you from using double buffering. Data partitions that are very small in
relation to the available accelerator memory will increase the overhead of
processing work blocks and degrade performance. Try a partition scheme that
requires less than 50% of the total memory size minus the code foot print.
Remember to add the runtime stack size and any other overheads, such as
global data, when sizing the code foot print.
v Accelerator data partitioning.
When there is a large number of accelerators and the work blocks are small or
data layout is complex, the host might become a bottleneck when generating the
data movement descriptions for work blocks. Consider offloading the generation
of data movement descriptions to the accelerators to improve performance.
v Multi-use work blocks.
When there is a large number of small work blocks, the scheduler running on
the host might be heavily loaded and the performance might degrade. In this
case, use multi-use work blocks to reduce the load on the scheduler and
improve the overall performance. Multi-use work blocks actually group several
small work blocks together and allow them to run sequentially on one
accelerator. You might also need to adjust the iteration counts to get the best
performance.
v New data structures.
Data structures and layout schemes can significantly affect the complexity of
data movement and data access speed on the host and accelerator. Typical
considerations include array of structure (AOS) versus structure of array (SOA)
conversion, and row and column transpose. This, however, is closely related to
the algorithm implemented by the compute kernel.
v New algorithms.
The performance of algorithms can differ from architecture to architecture.
Algorithms also affect the way data is organized and moved around. The data
movement can become a performance bottleneck. Using a less advanced
algorithm that requires more simple data movement might improve the overall
performance.
© Copyright IBM Corp. 2006, 2007 63
64 ALF Programmer’s Guide and API Reference
Chapter 17. Matrix addition example
This is a simple application that does addition of two two-dimensional matrixes
and stores the result to a third matrix. For two-dimensional matrix addition, the
mathematical definition is as follows:
C = A + B, where ci,j
= ai,j
+ bi,j
as shown in Figure 11.
Simple solution
In the following analysis, assume the data to be a 1024x512 single-precision
floating point matrix. The following is a piece of plain C code that solves the
problem:
float mat_a[1024][512];
float mat_b[1024][512];
float mat_c[1024][512];
int main(void)
{
int i,j;
for (i=0; i<1024; i++)
for (j=0; j<512; j++)
mat_c[i][j] = mat_a[i][j] + mat_b[i][j];
return 0;
}
The limitation with this simple approach is that it cannot be made to run faster on
a system with many accelerators that can process the ci,j
= ai,j
+ bi,j
in parallel. There
are parallel programming languages and models that can speed up the program.
Potential solution for parallel speed increase
In general, most matrix math operations can be decomposed into similar
operations on many submatrixes. The operations on these submatrixes can be done
in parallel if there are no dependencies between them. For example, take a
1024x512 matrix and divide the matrix into 128 submatrixes, each of which has
64x64 elements. Then the operation can be done on each 64x64 submatrix in
parallel. In theory, the computation of the 1024x512 matrix addition can be
completed in 1/128th of the time of the simple serialized code.
3,32,31,3
3,22,21,2
3,12,11,1
3,32,31,3
3,22,21,2
3,12,11,1
3,32,31,3
3,22,21,2
3,12,11,1
bbb
bbb
bbb
aaa
aaa
aaa
ccc
ccc
ccc
��
Figure 11. Two-dimensional matrix addition
© Copyright IBM Corp. 2006, 2007 65
Partition scheme
Two-dimensional matrixes are usually represented in two-dimensional arrays in C
code.
The actual memory layout of a two-dimensional array in C code is in
one-dimensional arrays concatenated by the second (or column) index, as shown in
Figure 12 and Figure 13.
In the matrix addition example in Chapter 17, “Matrix addition example,” on page
65, the submatrixes were the basic unit of data. In this C matrix data structure, a
submatrix is part of the whole array as shown in Figure 12 and Figure 13. This
provides the following partition schemes to choose from:
v Partition Scheme A: With this partition scheme, the submatrixes are a part of the
whole column or row of the matrix. One of the submatrixes of ″a[m][n]″ is
defined as ″sa[h][v]″ where the h < m and v <= n.
…
…
a(1,1)
a(m,1)
a(1,n)
a(m,n)
… … …
Figure 12. Memory organization of a two-dimensional array ″a[m][n]″, part A
a(2,1)
a(1,2)
a(2,n)
…
….......
a(1,1)
a(m,1)
a(1,n)
a(m,n)
….......
Figure 13. Memory organization of a two-dimensional array ″a[m][n]″, part B
66 ALF Programmer’s Guide and API Reference
v Partition Scheme B: With this partition scheme, the submatrixes are defined as a
set of adjacent full-length rows of the matrix. One of the submatrixes of
″a[m][n]″ is defined as ″sa[h][v]″ where h < m and v == n.
The data of the submatrix in partition scheme A is collected from disjointed
segments in the data buffer of the matrix. For partition scheme B, the submatrix is
from one contiguous segment of the matrix. Mathematically, this makes no
significant difference, but the data movement in our matrix addition example is
significantly more complex in partition scheme A than in partition scheme B, as
can be seen in the following example code.
float a[m][n], b[m][n], c[m][n];
{
int i,j,k;
float sa[h][v], sb[h][v], sc[h][v];
// Partition Scheme A
for (i=0; i<m; i+=h)
for (j=0; j<n; j+=V)
{
for(k=0; k<h; k++)
{
data_move(&sa[k][0], &a[i+k][j], v*sizeof(float));
data_move(&sb[k][0], &b[i+k][j], v*sizeof(float));
}
call_mat_add_kernel(sa, sb, sc, h,v);
for(k=0; k<h; K++)
data_move(&c[i+k][j], &sc[k][0], v*sizeof(float));
}
// Partition Scheme B
for (i=0; i<m; i+=h)
{
data_move(&sa[0][0], &a[i][0], v*h*sizeof(float));
…
….......
a(1,1)
a(m,1)
a(1,n)
a(m,n)
…....... a(2,n)
a(1,2)
a(2,1)
Figure 14. Partition scheme A: Data partition of a two-dimensional submatrix
a(2,1)
a(1,2)
a(2,n)
…
….......
a(1,1)
a(m,1)
a(1,n)
a(m,n)
….......
Figure 15. Partition scheme B: Data partition of a two-dimensional submatrix
Chapter 17. Matrix addition example 67
data_move(&sb[0][0], &b[i][0], v*h*sizeof(float));
call_mat_add_kernel(sa, sb, sc, h,v);
data_move(&sb[0][0], &b[i][0], v*h*sizeof(float));
}
}
Based on the above analysis, partition scheme B is preferred in this matrix addition
example. Remember that this situation might change in some real world scenarios
where large contiguous data movement might not be supported.
Example compute kernel
After the data partition scheme has been defined, implement the compute kernel.
The following is a simple example of a compute kernel:
FILE: my_header.h
struct _add_parms_t
{
unsigned int h;
unsigned int v;
} add_parms_t;
FILE: my_kernel.c
#include <alf_accel.h>
#include "my_header.h"
int alf_comp_kernel(void *p_parm_ctx_buffer,
void *p_input_buffer, void *p_output_buffer,
unsigned int current_count, unsigned int total_count)
{
unsigned int i;
float *sa, *sb, *sc;
add_parms_t *p_parm = (add_parms_t *) p_parm_ctx_buffer;
cnt = p_parm->h * p_parm->v;
sa = (float *) p_input_buffer;
sb = sa + cnt;
sc = (float *) p_output_buffer;
for(i=0; i<cnt; i++)
sc[i] = sa[i] + sb[i];
return 0;
}
The main thread and data transfer lists
This example shows the implementation of the main thread using the ALF
implementation on Cell/B.E. architecture. To prevent errors caused by address
alignment in the example code of the simple compute kernel in the previous
section, all of the data is aligned when it is defined.
FILE: my_main.c
#include <alf.h>
#include <string.h>
#include "my_header.h"
#define H 16
#define V 512
#define MY_ALIGN(_my_var_def_, _my_al_) _my_var_def_ \
__attribute__((__aligned__(_my_al_)))
MY_ALIGN(float mat_a[1024][512], 128);
MY_ALIGN(float mat_b[1024][512], 128);
68 ALF Programmer’s Guide and API Reference
MY_ALIGN(float mat_c[1024][512], 128);
spe_program_handle_t spe_matrix_add;
int main(void)
{
alf_handle_t half;
alf_task_handle_t htask;
alf_wb_handle_t hwb;
alf_task_info_t tinfo;
alf_task_info_t_CBEA spe_tsk;
add_parms_t parm;
int i, nodes;
alf_configure(NULL);
alf_query_system_info(ALF_INFO_NUM_ACCL_NODES, &nodes);
alf_init(&half, nodes, ALF_INIT_PERSIST);
spe_tsk.spe_task_image = spe_matrix_add;
spe_tsk.max_stack_size = 4096;
memset(&tinfo, 0, sizeof(tinfo));
tinfo.p_task_info = &spe_tsk;
tinfo.parm_ctx_buffer_size = sizeof(add_parms_t);
tinfo.input_buffer_size = H*V*2*sizeof(float); //64k
tinfo.output_buffer_size = H*V*sizeof(float); // 32k
tinfo.dt_list_entries = 0; // let the runtime decide this
tinfo.task_attr = 0; // do partition on the host
alf_task_create(&htask, half, &tinfo);
parm.h=H;
parm.v=V;
for(i=0; i<1024; i+=H)
{
alf_wb_create (&hwb, htask, ALF_WB_SINGLE, 1);
alf_wb_add_parm (hwb, &parm, sizeof(parm),
ALF_DATA_BYTE, 0);
alf_wb_add_io_buffer (hwb,&mat_a[i][0], H*V*sizeof(float),
ALF_DATA_FLOAT, ALF_BUFFER_INPUT);
alf_wb_add_io_buffer (hwb,&mat_b[i][0], H*V*sizeof(float),
ALF_DATA_FLOAT, ALF_BUFFER_INPUT);
alf_wb_add_io_buffer (hwb,&mat_c[i][0], H*V*sizeof(float),
ALF_DATA_FLOAT, ALF_BUFFER_OUTPUT);
alf_wb_enqueue(hwb);
}
alf_task_wait(&htask, -1);
alf_task_destroy(&htask);
alf_exit(&half, ALF_SHUTDOWN_WAIT);
return 0;
}
Chapter 17. Matrix addition example 69
70 ALF Programmer’s Guide and API Reference
Chapter 18. Matrix transpose example
A two-dimensional matrix transpose is a common operation in matrix
computations and is defined as flipping the columns and rows by swapping the
indexes of matrix elements.
Figure 16 shows the data movement patterns of a two-dimensional matrix
transpose.
Simple solution
This implementation assumes that the data represents a 1024x512 single-precision
floating point matrix. This piece of plain C code performs the two-dimensional
matrix transpose.
float mat_a[1024][512];
float mat_c[512][1024];
int main(void)
{
int i,j;
for (i=0; i<512; i++)
for (j=0; j<1024; j++)
mat_c[i][j] = mat_a[j][i];
return 0;
}
Potential solution for parallel speed increase
Similar to the matrix addition example, matrix transpose can also be decomposed
into transposes on many submatrixes. The operations on these submatrixes can be
done in parallel to increase the speed of the process. The submatrix approach is
used throughout this chapter, Chapter 18, “Matrix transpose example.”
Partition scheme
As with the matrix addition example, the data partition parameters (h and v) must
be determined, but there is one difference between a matrix transpose problem and
matrix addition: as the submatrix is transposed from the source to the destination,
contiguous data movement from the source results in very fragmented data
movements. To address this, select a compromise between the size of one
contiguous data segment and the number of data transfers.
…
…
…
…
a(1,1)
a(m,1)
a(1,n)
a(m,n)
c(i,j) = a(j,i) c(1,1)
c(n,1)
c(1,m)
c(n,m)
… … … …… …
Figure 16. Matrix transpose in detail
© Copyright IBM Corp. 2006, 2007 71
In Figure 17 and Figure 18, the matrixes ″a[m][n]″ and ″c[n][m]″ are partitioned
into submatrixes, and ″Axy[h][v]″ is transposed to ″Cyx[v][h]″.
The following example program describes the data movement patterns of the
problem. The goal is to maximize the memory usage of the accelerator so that the
size of the submatrixes is constant. The total number of data movements is the
sum of the input and output data movements. In mathematical language, the
problem can be expressed as: Since h * v ≈ Constant, what h and v result in the
minimum h + v? When you look for integer solutions, the optimum result is found
when h and v are equal, or as close as possible.
float a[m][n], c[n][m];
{
int i,j,k;
float sa[h][v], sc[v][h];
for (i=0; i<m; i+=h)
for (j=0; j<n; j+=v)
{
for(k=0; k<h; k++)
data_move(&sa[k][0], &a[i+k][j], v*sizeof(float));
….......
….......….......
….......
….......…...
…...
.
.
.
.
AX1(1,v)AX1(1,1)
AX1(h,v)
AXY(1,v)
AXY(h,v)AX1(h,1)
A1Y(1,1)A11(1,1)
A1Y(h,1)
A1Y(1,v)
A1Y(h,v)A11(h,1)
Figure 17. Data partition of matrix transpose, part A
….......
….......….......
….......
….......…...
…...
.
.
.
.
CY1(1,h)CY1(1,1)
CY1(v,h)
CYX(1,h)
CYX(v;h)CY1(v,1)
C1X(1,1)C11(1,1)
C1X(v,1)
C1X(1,h)
C1X(v,h)C11(v,1)
Figure 18. Data partition of matrix transpose, part B
72 ALF Programmer’s Guide and API Reference
call_mat_transpose_kernel(sa, sc, h,v);
for(k=0; k<v; k++)
data_move(&c[j+k][i], &sc[k][0], h*sizeof(float));
}
}
Example compute kernel
The compute kernel is implemented in C code without SPE Single Instruction
Multiple Data (SIMD) intrinsics for instructional purposes. You can optimize this
code using SIMD and loop unrolling for better performance.
FILE: my_header.h
struct _trans_parms_t
{
unsigned int h;
unsigned int v;
} trans_parms_t;
FILE: my_kernel.c
#include <alf_accel.h>
#include "my_header.h"
int alf_comp_kernel(void *p_parm_ctx_buffer,
void *p_input_buffer, void *p_output_buffer,
unsigned int current_count, unsigned int total_count)
{
unsigned int i, j;
float *sa, *sc;
trans_parms_t *p_parm = (trans_parms_t *)p_parm_ctx_buffer;
sa = (float *) p_input_buffer;
sc = (float *) p_output_buffer;
for(i=0; i< p_parms->h; i++)
for(j=0; j< p_parms->v; j++)
*(sc+j*p_parms->h + i) /* sc[j][i] */
= *(sa+i*p_parms->v +j); /* sa[i][j] */
return 0;
}
The main thread and data transfer lists
There are two approaches to the implementation of the main thread by different
data transfer list generation policies. In the first approach, data transfer list
generation is accomplished on the host. In the second approach, the data transfer
list generation is done on the accelerator with the help of the input parameters.
The following sections compare these two approaches:
v Data transfer lists generated in the host: The following example shows the
implementation of the main thread with the data list generated on the host
using the ALF implementation on Cell/B.E. architecture. The data must be
aligned properly when it is defined.
– FILE: my_main.c
#include <string.h>
#include <alf.h>
#include "my_header.h"
#define H 128
#define V 128
Chapter 18. Matrix transpose example 73
#define MY_ALIGN(_my_var_def_, _my_al_) _my_var_def_ \
__attribute__((__aligned__(_my_al_)))
MY_ALIGN(float mat_a[1024][512], 128);
MY_ALIGN(float mat_c[512][1024], 128);
int main(void)
{
// ... same as before
tinfo.parm_ctx_buffer_size = sizeof(trans_parms_t);
tinfo.input_buffer_size = H*V*sizeof(float); //64k
tinfo.output_buffer_size = H*V*sizeof(float); // 64k
tinfo.dt_list_entries = max(H, V);
tinfo.task_attr = 0;
alf_task_create(&htask, half, &tinfo);
parm.h=H;
parm.v=V;
for(X=0; X<1024; X+=H)
for(Y=0; Y<512; Y+=V)
{
alf_wb_create (&hwb, htask, ALF_WB_SINGLE, 1);
alf_wb_add_parm (hwb, &parm, sizeof(parm),
ALF_DATA_BYTE, 0);
for(i=0; i<H; i++)
alf_wb_add_io_buffer (hwb, &mat_a[X+i][Y], V*sizeof(float),
ALF_DATA_FLOAT, ALF_BUFFER_INPUT);
for(j=0; j<V; j++)
alf_wb_add_io_buffer (hwb, &mat_c[Y+j][X], H*sizeof(float),
ALF_DATA_FLOAT, ALF_BUFFER_OUTPUT);
alf_wb_enqueue(hwb);
}
// ... same as before
return 0;
}
– In this solution, all data transfer list generation logic is run on the host. This
is reasonable when there are few work blocks. Otherwise, this can create a
performance bottleneck because the accelerators might be idle while waiting
for the data transfer list generation to finish on the host. Better performance
can be expected from the second solution in this situation, where the
capability of each accelerator is fully used.v Data transfer lists generated in the accelerator: The following example shows
the implementation of the main thread with the data list generated on the
accelerator using the ALF implementation on Cell/B.E. architecture. This
solution does not call alf_wb_add_io_buffer. To prepare the data transfer list on
the accelerator, the task_attr field in the task information structure must be set
to ALF_TASK_ATTR_PARTITION_ON_ACCEL. Also, more information needs to be
passed to the host in the parameters, so the trans_parms_t data structure is also
expanded to include more information.
– FILE: my_header.h
struct _trans_parms_t
{
unsigned int h;
unsigned int v;
unsigned int DIMX, DIMY;
unsigned int X, Y;
float *p_mat_a, *p_mat_c;
} trans_parms_t;
FILE: my_main.c
#include <alf.h>
#include <string.h>
74 ALF Programmer’s Guide and API Reference
#include "my_header.h"
#define H 128
#define V 128
#define MY_ALIGN(_my_var_def_, _my_al_) _my_var_def_ \
__attribute__((__aligned__(_my_al_)))
MY_ALIGN(float mat_a[1024][512], 128);
MY_ALIGN(float mat_c[512][1024], 128);
int main(void)
{
// ... same as before
tinfo.parm_ctx_buffer_size = sizeof(trans_parms_t);
tinfo.input_buffer_size = H*V*sizeof(float); //64k
tinfo.output_buffer_size = H*V*sizeof(float); // 64k
tinfo.dt_list_entries = max(H, V);
tinfo.task_attr = ALF_TASK_ATTR_PARTITION_ON_ACCEL;
alf_task_create(&htask, half, &tinfo);
parm.h=H;
parm.v=V;
parm.DIMX = 1024;
parm.DIMY = 512;
parm.p_mat_a = &mat_a[0][0];
parm.p_mat_c = &mat_c[0][0];
for(X=0; X<1024; X+=H)
for(Y=0; Y<512; Y+=V)
{
parm.X = X;
parm.Y = Y;
alf_wb_create (&hwb, htask, ALF_WB_SINGLE, 1);
alf_wb_add_parm (hwb, &parm, sizeof(parm),
ALF_DATA_BYTE, 0);
alf_wb_enqueue(hwb);
}
// ... same as before
return 0;
}
Data transfer lists
Below are two accelerator APIs used to generate the data input or output transfer
lists respectively. See the API definition descriptions for more information on the
parameters.
v alf_prepare_input_list
v alf_prepare_output_list
These two macros are used to create the data transfer list structure and create the
data transfer list entries:
v ALF_DT_LIST_CREATE
v ALF_DT_LIST_ADD_ENTRY
Note that the data transfer list must be initialized with the first macro before
appending an element to it. The following is the implementation for this example:
FILE: my_kernel.c
// omitted the previous kernel section here
int alf_prepare_input_list(void *p_task_context,
Chapter 18. Matrix transpose example 75
void *p_parm_ctx_buffer,
void *p_dt_list_buffer,
unsigned int current_count,
unsigned int total_count)
{
trans_parms_t *p_parm = (trans_parms_t *)p_parm_ctx_buffer;
float *pA;
unsigned int i;
addr64 ea;
pA = p_parm->p_mat_a +
p_parm->DIMY*p_parm->X + p_parm->Y; // mat_a[X][Y]
ALF_DT_LIST_CREATE(p_dt_list_buffer,0);
ea.ui[0] = 0;
for(i=0; i<p_parm->h; i++)
{
ea.ui[1] = (unsigned int)(pA + p_parm->DIMY*i); // mat_a[X+i][Y]
ALF_DT_LIST_ADD_ENTRY(p_dt_list_buffer,
p_parm->v,
ALF_DATA_FLOAT,
ea);
}
return 0;
}
int alf_prepare_output_list(void *p_task_context,
void *p_parm_ctx_buffer,
void *p_dt_list_buffer,
unsigned int current_count,
unsigned int total_count)
{
trans_parms_t *p_parm = (trans_parms_t *)p_parm_ctx_buffer;
float *pC;
unsigned int j;
addr64 ea;
pC = p_parm->p_mat_c +
p_parm->DIMX*p_parm->Y + p_parm->X; // mat_c[Y][X]
ALF_DT_LIST_CREATE(p_dt_list_buffer,0);
ea.ui[0] = 0;
for(j=0; j<p_parm->v; j++)
{
ea.ui[1] = (unsigned int)(pC + p_parm->DIMX*j); // mat_c[Y+j][X]
ALF_DT_LIST_ADD_ENTRY(p_dt_list_buffer,
p_parm->h,
ALF_DATA_FLOAT,
ea);
}
return 0;
}
Debugging and tuning
In the matrix transpose example, tuning the main thread will yield little
improvement, so the focus is on optimizing the compute kernel. An appropriate
strategy is to use SIMD and loop unrolling.
For a more thorough guide on how to optimize SPE code, see the SDK 2.1
Programmer’s Guide, which is listed in “Related documentation” on page 105. The
76 ALF Programmer’s Guide and API Reference
following example shows the compute kernel from the matrix transpose example
optimized with SPE intrinsics and some loop unrolling:
FILE: my_kernel.c
#include <alf_accel.h>
#include "my_header.h"
int alf_comp_kernel(void *p_parm_ctx_buffer,
void *p_input_buffer, void *p_output_buffer,
unsigned int current_count, unsigned int total_count)
{
unsigned int i, j;
vector float *sa, *sc;
trans_parms_t *p_parm = (trans_parms_t *)p_parm_ctx_buffer;
const vector unsigned char step1_pattern1 =
{0, 1, 2, 3, 16, 17, 18, 19, 4, 5, 6, 7, 20, 21, 22, 23};
const vector unsigned char step1_pattern2 =
{8, 9, 10, 11, 24, 25, 26, 27, 12, 13, 14, 15, 28, 29, 30, 31};
const vector unsigned char step2_pattern1 =
{0, 1, 2, 3, 4, 5, 6, 7, 16, 17, 18, 19, 20, 21, 22, 23};
const vector unsigned char step2_pattern2 =
{8, 9, 10, 11, 12, 13, 14, 15, 24, 25, 26, 27, 28, 29, 30, 31};
vector float f1, f2, f3, f4;
vector float tmp1, tmp2, tmp3, tmp4;
sa = (float *) p_input_buffer;
sc = (float *) p_output_buffer;
for(i=0; i< p_parms->h; i+=4)
for(j=0; j< p_parms->v; j+=4)
{
f1 = *(sa+(i )*p_parms->v/4 + j/4);
f2 = *(sa+(i+1)*p_parms->v/4 + j/4);
f3 = *(sa+(i+2)*p_parms->v/4 + j/4);
f4 = *(sa+(i+3)*p_parms->v/4 + j/4);
tmp1 = spu_shuffle(f1, f2, step1_pattern1);
tmp2 = spu_shuffle(f1, f2, step1_pattern2);
tmp3 = spu_shuffle(f3, f4, step1_pattern1);
tmp4 = spu_shuffle(f3, f4, step1_pattern2);
f1 = spu_shuffle(tmp1, tmp3, step2_pattern_1);
f2 = spu_shuffle(tmp1, tmp3, step2_pattern_2);
f3 = spu_shuffle(tmp2, tmp4, step2_pattern_1);
f4 = spu_shuffle(tmp2, tmp4, step2_pattern_2);
*(sc+(j )*p_parms->h/4 + i/4) = f1;
*(sc+(j+1)*p_parms->h/4 + i/4) = f2;
*(sc+(j+2)*p_parms->h/4 + i/4) = f3;
*(sc+(j+3)*p_parms->h/4 + i/4) = f4;
}
return 0;
}
Chapter 18. Matrix transpose example 77
78 ALF Programmer’s Guide and API Reference
Chapter 19. Vector min-max example
The current version of ALF has several new features such as the task context
buffer, overlapped I/O buffers, and synchronization points. The example in this
section shows some of these new features.
Understand the problem
The problem is defined as finding the maximum and minimum element value in a
vector sequence. Solve amin
= min(ai,j) and amax
= max(ai,j)where ai,j
are elements of
the vector sequence At
= {a0,t
, a1,t
,..., an-1,t}, t ≤ T as defined:
v A0
= {a0,0
, a1,0
,..., an-1,0} is the given initial value;
v At
= F(At-1) where aj,t
= c0
· aj,t-1
+ c1
· aj+1, t-1
+ c2
· aj+2,t-1
+ c3
· aj+3,t-1
and references
to aj,t-1
are defined as zero for all j > n-1.
Simple solution
The sequential code that solves this problem is shown in the example below. From
the definition of F(A), the calculation of a[i] does not rely on previously calculated
values of a[i]. You can overwrite the old value of each a[i] with its new value
during the computation, so the same buffer could be used for all of the
calculations. Also, there could be a boundary condition check when the index of A
approaches ″n-3″, then the index of elements in the calculation will be out of the
scope of the array. Because boundary condition checks during the calculation will
degrade the performance, it is better to avoid them. To solve this problem, append
the array that stores the vector with three extra elements that are initialized to
zero.
#include <time.h>
#include <stdlib.h>
#define N (1024*1024)
#define T (4096)
#define C0 ((float)0.30)
#define C1 ((float)0.30)
#define C2 ((float)0.20)
#define C3 ((float)0.20)
// define the array to have more elements to remove the need to
// perform boundary condition checks when the calculation approaches a[N-3]
float a[N+3];
float amin, amax;
int main(void)
{
unsigned int i,j;
// init the array for initial value
srand(time(NULL));
for(i=0; i<N; i++)
a[i] = (float)(rand()%16384 - 8192);
for(; i<N+3; i++)
a[i] = (float)0.0;
// init the amin, amax
amin = amax = a[0];
© Copyright IBM Corp. 2006, 2007 79
// for the initial A0
for(i=0; i<N; i++)
{
if(amin > a[i]) amin = a[i];
else if(amax < a[i]) amax = a[i];
}
for(j=1; j<=T; j++)
{
// calculate A[t]
for(i=0; i<N; i++)
{
a[i] = C0*a[i] + C1*a[i+1] + C2*a[i+2] + C3*a[i+3];
if(amin > a[i]) amin = a[i];
else if(amax < a[i]) amax = a[i];
}
}
// here is the result
}
Potential solution for parallel speed increase
The calculation of new element values of vector A is based only on old values, so
in theory, the values of these elements can be calculated at the same time for each
new At. In reality, the parallelism of a system is limited, and it is reasonable to
divide the vector into several blocks and compute the blocks in parallel. For the
vector min-max problem:
1. Divide work into multiple blocks
2. Calculate the result for each work block
3. Merge these results when all the calculations are complete
Partition scheme
The vector min-max problem can be solved by dividing the data buffer into work
blocks. The input data is a one-dimensional array, so the input array is divided
into smaller portions for the work blocks. In a parallel programming environment
like ALF, there is not strict control over the order of work block processing, so the
same buffer cannot be reused to store input and output.
In the vector min-max example, two buffers were used to hold the step t = 0,2,4,...
and t=1,3,5,... data respectively. When calculating step t, the reference data comes
from the buffer module(t-1, 2) and the result is written to the module(t, 2) buffer.
Task context buffer
In the vector min-max example, the task context buffer provides an opportunity to
accelerate the min-max process in a parallel model. Each instance of the task
running on the accelerators will have its own context, so it can save the most
recent min-max values found in the work blocks it processes. Then, after the task is
finished, the host program can do a trivial reduction of the local results to derive
the global results. Another usage of the task context is to communicate the global
parameters that apply to each work block. In this example, the parameters of the
function F() and the partition size can be passed to the accelerator by using the
task context buffer.
80 ALF Programmer’s Guide and API Reference
Overlapped I/O buffer
For a single work block, the input data can be overwritten during the computation
of the sequential code. The vector min-max example uses the overlapped I/O
buffer to reduce the memory requirement of the work block to make it possible to
enlarge the data partition to about half of the available memory.
Barrier
The ALF programming model provides a barrier synchronization point to ensure
that tasks are completed in the correct order.
In the figure below, you can see that one task depends on data from another task.
The computations of step t require the computations of step t-1 to be finished to
make sure the reference data is updated to the most recent values.
The code list
The following is a complete list of the code and not a step-by-step approach like
the previous examples. Note that the code listed here is not fully optimized so that
it is easier to read. The bold code lines show key parts of the code.
/***************************************************************
* my_header.h
* shared definitions header
***************************************************************/
typedef struct _my_parm_t
{
float * addr_a; //input/output data address
float * addr_b; //output/input data address
int size; //problem size
unsigned char dummy[16 - 12]; //dummy for alignment
} my_parm_t;
// writeable section of task context
typedef struct _my_task_context_w
{
.....
.....
.....
blk[0] blk[1]
blk[0]
blk[0] blk[1]
blk[1]
blk[n-2]
blk[n-2]
blk[n-2]
blk[n]
blk[n]
blk[n]
F(A) F(A) F(A)
F(A)
F(A)
F(A)F(A)F(A)
Buf[(t-1)%2]
Buf[(t+1)%2]
Step t-1
Barrier
Buf[t%2]
Step t
Padding
Padding
Padding
Task DataDependency
Figure 19. Data partition and task dependency of the vector computation example
Chapter 19. Vector min-max example 81
// writable section
float max; //max float
float min; //min float
unsigned char dummy[16 - 8]; //dummy for alignment
} my_task_context_w;
// read-only section of task context
typedef struct _my_task_context_r
{
// read-only section
float c0; //C0
float c1; //C1
float c2; //C2
float c3; //C3
} my_task_context_r;
/***************************************************************
* spu.c
* Accelerator code
***************************************************************/
#include <alf_accel.h>
#include "my_header.h"
int alf_comp_kernel(volatile void *p_task_context,
volatile void *p_parm_ctx_buffer,
volatile void *p_input_buffer,
volatile void *p_output_buffer,
unsigned int current_count,
unsigned int total_count)
{
my_task_context_r *p_ctx_r = (my_task_context_r *) p_task_context;
my_task_context_w *p_ctx_w =
(my_task_context_w *)((char *)p_task_context+sizeof(my_task_context_r));
my_parm_t *p_parm = (my_parm_t *) p_parm_ctx_buffer;
float *a = (float *)p_input_buffer;
float *b = (float *)p_output_buffer;
float c0 = p_ctx_r->c0;
float c1 = p_ctx_r->c1;
float c2 = p_ctx_r->c2;
float c3 = p_ctx_r->c3;
int size = p_parm->size;
int i;
for(i=0;i<size;i++)
{
b[i]=c0*a[i]+c1*a[i+1]+c2*a[i+2]+c3*a[i+3];
if(b[i]>p_ctx_w->max)
p_ctx_w->max = b[i];
if(b[i]<p_ctx_w->min)
p_ctx_w->min = b[i];
}
return 0;
}
int alf_prepare_input_list(void *p_task_context,
void *p_parm_ctx_buffer,
void *p_dt_list_buffer,
unsigned int current_count,
unsigned int total_count)
82 ALF Programmer’s Guide and API Reference
{
my_parm_t *p_parm = (my_parm_t *) p_parm_ctx_buffer;
float *addr = p_parm->addr_a;
int size;
int small_size;
int i,left;
addr64 ea;
size = p_parm->size + 4; // do not forget the boundary ones
// now decide the per DT list entry size
// this is because of the 16KB per DMA entry limitation of MFC
small_size = (size)/(16*1024/sizeof(float));
left = size%(16*1024/sizeof(float));
ALF_DT_LIST_CREATE(p_dt_list_buffer,0);
for(i=0;i<small_size;i++)
{
ea.ui[0] = 0;
ea.ui[1] = (unsigned int)addr+i*16*1024+current_count*size*sizeof(float);
ALF_DT_LIST_ADD_ENTRY(p_dt_list_buffer, 16*1024, ALF_DATA_BYTE, ea);
}
ea.ui[1] = (unsigned int)addr+small_size*16*1024
+current_count*size*sizeof(float);
ALF_DT_LIST_ADD_ENTRY(p_dt_list_buffer, left, ALF_DATA_FLOAT, ea);
return 0;
}
int alf_prepare_output_list(void* p_task_context,
void *p_parm_ctx_buffer,
void *p_dt_list_buffer,
unsigned int current_count,
unsigned int total_count)
{
my_parm_t *p_parm = (my_parm_t *) p_parm_ctx_buffer;
float *addr=p_parm->addr_b;
int size=p_parm->size;
int small_size;
int i,left;
addr64 ea;
// this is because of the 16KB per DMA entry limitation of MFC
small_size=size/(16*1024/sizeof(float));
left=size%(16*1024/sizeof(float));
ALF_DT_LIST_CREATE(p_dt_list_buffer,0);
for(i=0;i<small_size;i++)
{
ea.ui[0] = 0;
ea.ui[1] = (unsigned int)addr+i*16*1024+current_count*size*sizeof(float);
ALF_DT_LIST_ADD_ENTRY(p_dt_list_buffer, 16*1024, ALF_DATA_BYTE, ea);
}
ea.ui[1] = (unsigned int)addr+small_size*16*1024
+current_count*size*sizeof(float);
ALF_DT_LIST_ADD_ENTRY(p_dt_list_buffer, left, ALF_DATA_FLOAT, ea);
return 0;
}
/***************************************************************
* main.c
* Host code
***************************************************************/
#include <stdio.h>
Chapter 19. Vector min-max example 83
#include <libmisc.h>
#include <float.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <alf.h>
#include "my_header.h"
#define N (1024*1024)
#define T (128)
#define BLOCK_SIZE (16*1024)
#define C0 ((float)0.30)
#define C1 ((float)0.30)
#define C2 ((float)0.20)
#define C3 ((float)0.20)
// define the array to have more elements for simplification
// of boundary condition check when the calculation comes to a[N-3]
float a[2][N+4] __attribute__ ((aligned (128)));
float amin, amax;
extern spe_program_handle_t spe_ops;
void prepare_data()
{
int i;
for(i=0; i<N; i++)
{
a[0][i] = (float)(rand()%8192 - 4096);
}
// init the padding
for(i=N; i<N+4; i++)
{
a[1][i] = a[0][i] = (float)0.0f;
}
// init the amin, amax
amin = amax = a[0][0];
// for the initial A0
for(i=0; i<N; i++)
{
if(amin > a[0][i]) amin = a[0][i];
else if(amax < a[0][i]) amax = a[0][i];
}
}
int main(void)
{
alf_handle_t half;
alf_task_handle_t htask;
alf_wb_handle_t hwb;
alf_task_info_t tinfo;
alf_task_info_t_CBEA spe_tsk;
my_parm_t parm __attribute__ ((aligned (128)));
my_task_context_w *pctx_w;
my_task_context_r *pctx_r;
alf_wb_sync_handle_t hsync;
int i,j,instances;
unsigned int nodes;
prepare_data();
alf_configure(NULL);
84 ALF Programmer’s Guide and API Reference
alf_query_system_info(ALF_INFO_NUM_ACCL_NODES, &nodes);
alf_init(&half, nodes, ALF_INIT_PERSIST);
spe_tsk.spe_task_image = &spe_ops;
spe_tsk. max_stack_size = 4096; // make your best guess :-)
memset(&tinfo, 0, sizeof(tinfo));
tinfo. p_task_info = &spe_tsk;
tinfo.task_context_buffer_read_only_size = sizeof(my_task_context_r);
tinfo.task_context_buffer_writable_size = sizeof(my_task_context_w);
tinfo.parm_ctx_buffer_size = sizeof(my_parm_t);
tinfo.input_buffer_size = 0;
tinfo.overlapped_buffer_size = (BLOCK_SIZE+4)*sizeof(float);
//+4 for adjacent reference values
tinfo.output_buffer_size = 0;
tinfo. dt_list_entries = (BLOCK_SIZE+4)/(16*1024/sizeof(float)) + 1;
tinfo.task_attr = ALF_TASK_ATTR_PARTITION_ON_ACCEL;
instances = alf_task_create(&htask, half, &tinfo);
// prepare the task contexts
pctx_w = malloc_align(sizeof(my_task_context_w)*instances, 4);
pctx_r = malloc_align(sizeof(my_task_context_r), 4);
// init and assign the context buffers
pctx_r->c0=C0;
pctx_r->c1=C1;
pctx_r->c2=C2;
pctx_r->c3=C3;
for(i=0; i<instances; i++)
{
alf_task_context_handle_t hctx;
alf_task_context_create(&hctx, htask, 0 /* i+1 is ok too */);
alf_task_context_add_entry(hctx, pctx_r,sizeof(my_task_context_r),
ALF_DATA_BYTE, ALF_TASK_CONTEXT_READ);
pctx_w[i].min =amin;
pctx_w[i].max =amax;
alf_task_context_add_entry(hctx, &pctx_w[i], sizeof(my_task_context_w),
ALF_DATA_BYTE, ALF_TASK_CONTEXT_WRITABLE);
alf_task_context_register(hctx);
}
// now comes the real computation
for(j=0;j<T;j++)
{
for(i=0; i<N/BLOCKSIZE; i++)
{
parm.addr_a = &a[j%2][i*BLOCK_SIZE];
parm.addr_b = &a[(j+1)%2][i*BLOCK_SIZE];
parm.size = BLOCK_SIZE;
alf_wb_create (&hwb, htask, ALF_WB_SINGLE, 1);
alf_wb_add_parm (hwb, &parm, sizeof(my_parm_t), ALF_DATA_BYTE, 0);
alf_wb_enqueue(hwb);
}
// add a barrier to make sure the different steps do not overlap
alf_wb_sync(&hsync, htask, ALF_SYNC_BARRIER, NULL, NULL, 0);
}
alf_task_wait(htask, -1);
alf_task_destroy(&htask);
// all done, reduce the max value to get the final result
amax = pctx_w[0].max;
Chapter 19. Vector min-max example 85
amin = pctx_w[0].min;
for(i=1; i<instances; i++)
{
if(pctx_w[i].max > amax)
amax = pctx_w[i].max;
if(pctx_w[i].min < amin)
amin = pctx_w[i].min;
}
free_align(pctx_w);
free_align(pctx_r);
printf("The maximum element value is %f minimum element value is %f\n",
amax,amin);
alf_exit(&half, ALF_SHUTDOWN_WAIT);
return 0;
}
86 ALF Programmer’s Guide and API Reference
Part 4. Platform specific constraints for the ALF
implementation on Cell/B.E. architecture
© Copyright IBM Corp. 2006, 2007 87
88 ALF Programmer’s Guide and API Reference
Chapter 20. SPU resource reserved and used
Tags
ALF reserves the MFC DMA tags from 15 to 23 for internal use. SPU applications
should avoid using these reserved tags.
Cache line reservation
ALF uses the cache line reservation feature. For more information on cache line
reservations, refer to the SDK 2.1 Programmer’s Guide, which is listed in “Related
documentation” on page 105. Because only one reservation is allowed at a time in
the Atomic Unit of MFC, a cache line reservation might not be guaranteed when
the execution context returns to ALF. For example, all cache line reservations made
within alf_prepare_input_list, alf_prepare_output_list, and alf_comp_kernel
might not be preserved when these functions return control to the ALF runtime.
© Copyright IBM Corp. 2006, 2007 89
90 ALF Programmer’s Guide and API Reference
Chapter 21. Memory constraints
Local memory
The size of local memory on the accelerator is 256 KB and is shared by code and
data. Memory is not virtualized and is not protected. See Figure 20 for a typical
memory map of an SPU program. There is a runtime stack above the global data
memory section. The stack grows from the higher address to the lower address
until it reaches the global data section. Due to the limitation of programming
languages and compiler/linker tools, you cannot predict the maximum stack usage
when developing the application and when the application is loaded. If the stack
requires more memory than was allocated, there will be a stack overflow
exception. When there is a stack overflow, the SPU application is shut down and a
message is sent to the PowerPC Processing Element (PPE).
ALF allocates the work block buffers directly from the memory region above the
runtime stack, as shown in Figure 21 on page 92. This is implemented by moving
the stack pointer (or equivalently by pushing a large amount of data into the
stack). To ALF, the larger the buffer is, the better it can optimize the performance
of a task by using techniques like double buffering. It is better to let ALF allocate
as much memory as possible from the runtime stack. Estimate the size of the stack
and use that value in the alf_task_info_t_CBEA data structure when the task is
created. If the stack size is too small at runtime, a stack overflow occurs and it
causes unexpected exceptions such as incorrect results or a bus error.
0x3FFFF
0x00000
Reserved
Runtime Stack
Data
Text
SPU ABI Reserved Usage
Global Data
Code
(a) Common Cell/B.E. Application
Figure 20. SPU local memory map of a common Cell/B.E. application
© Copyright IBM Corp. 2006, 2007 91
0x3FFFF
0x00000
SPU ABI Reserved Usage
ALF’s Dynamic ManagedBuffer for Work Blocks
User Code + ALF Runtime
ALF Global Data
User Code + ALF Runtime
(b) ALF Application
Reserved
Runtime Stack
Data
Text
Work Block DataBuffer
Figure 21. SPU local memory map of an ALF application
92 ALF Programmer’s Guide and API Reference
Chapter 22. Data transfer list limitations
Data transfer information is used to describe the input, output, and input or
output data movement. The ALF implementation on Cell/B.E. has the following
constraints.
v Data transfer information for a single work block can consist of up to 8 data
transfer lists for each direction of transfer (accelerator memory to host memory
and the reverse).
v Each data transfer list consists of up to 2048 data transfer entries.
v Each entry can describe up to 16 KB of data transfer between the contiguous
area in host memory and accelerator memory.
v All of the entries within the same data transfer list share the same high 32-bits
effective address.
v The local store area described by each entry within the same data transfer list
must be contiguous.
v Transfer size and the low 32 bits of the effective address for each data transfer
entry must be 16 bytes aligned.
© Copyright IBM Corp. 2006, 2007 93
94 ALF Programmer’s Guide and API Reference
Part 5. Compile time options
Several compile time options are available when you build the ALF runtime
library. By changing or enabling these options, internal debug features can be
enabled. These are helpful when debugging the host or compute kernel code.
Enabling these extra debug features can significantly slow down your application
due to the amount of information dumped and the extra error checks, so they
should be disabled when debugging is complete.
alf_config.h
Compiler time options are located in the ALF global configuration header file. It is
stored in the same location as the other external header files, for example, in
alf\include\alf_config.h.
_ALF_DEBUG_LEVEL_
#define _ALF_DEBUG_LEVEL_ x // where x is a number between 0 to 9
0 Default value of this macro. Means no debug information is dumped
1 Enables dumping of textual error information in the host runtime, in addition to
returning the standard error codes
2 Enables dumping of textual error information in both the host and accelerator
runtimes, in addition to returning the standard error codes
>=3 Dumps internal debug trace information
_ALF_CFG_CHECK_DEBUG_
The _ALF_CFG_CHECK_DEBUG_ macro enables strict argument checking. It is
disabled by default. When this option is enabled on Cell/B.E. architecture, all of
the DMA source or destination addresses and transfer size combinations will be
checked for address alignment and size before a DMA request is issued to help
identify bugs related to improper address or size alignment problems. However,
the resulting code size on the SPU will be increased and performance will be
degraded.
#define _ALF_CFG_CHECK_DEBUG_ // disabled by default
}
© Copyright IBM Corp. 2006, 2007 95
96 ALF Programmer’s Guide and API Reference
Part 6. Appendixes
© Copyright IBM Corp. 2006, 2007 97
98 ALF Programmer’s Guide and API Reference
Appendix. Accessibility features
Accessibility features help users who have a physical disability, such as restricted
mobility or limited vision, to use information technology products successfully.
The following list includes the major accessibility features:
v Keyboard-only operation
v Interfaces that are commonly used by screen readers
v Keys that are tactilely discernible and do not activate just by touching them
v Industry-standard devices for ports and connectors
v The attachment of alternative input and output devices
IBM® and accessibility
See the IBM Accessibility Center at http://www.ibm.com/able/ for more
information about the commitment that IBM has to accessibility.
© Copyright IBM Corp. 2006, 2007 99
100 ALF Programmer’s Guide and API Reference
Notices
This information was developed for products and services offered in the U.S.A.
The manufacturer may not offer the products, services, or features discussed in this
document in other countries. Consult the manufacturer’s representative for
information on the products and services currently available in your area. Any
reference to the manufacturer’s product, program, or service is not intended to
state or imply that only that product, program, or service may be used. Any
functionally equivalent product, program, or service that does not infringe any
intellectual property right of the manufacturer may be used instead. However, it is
the user’s responsibility to evaluate and verify the operation of any product,
program, or service.
The manufacturer may have patents or pending patent applications covering
subject matter described in this document. The furnishing of this document does
not give you any license to these patents. You can send license inquiries, in
writing, to the manufacturer.
For license inquiries regarding double-byte (DBCS) information, contact the
Intellectual Property Department in your country or send inquiries, in writing, to
the manufacturer.
The following paragraph does not apply to the United Kingdom or any other
country where such provisions are inconsistent with local law: THIS
INFORMATION IS PROVIDED “AS IS ” WITHOUT WARRANTY OF ANY KIND,
EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR
FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of
express or implied warranties in certain transactions, therefore, this statement may
not apply to you.
This information could include technical inaccuracies or typographical errors.
Changes are periodically made to the information herein; these changes will be
incorporated in new editions of the publication. The manufacturer may make
improvements and/or changes in the product(s) and/or the program(s) described
in this publication at any time without notice.
Any references in this information to Web sites not owned by the manufacturer are
provided for convenience only and do not in any manner serve as an endorsement
of those Web sites. The materials at those Web sites are not part of the materials for
this product and use of those Web sites is at your own risk.
The manufacturer may use or distribute any of the information you supply in any
way it believes appropriate without incurring any obligation to you.
Licensees of this program who wish to have information about it for the purpose
of enabling: (i) the exchange of information between independently created
programs and other programs (including this one) and (ii) the mutual use of the
information which has been exchanged, should contact the manufacturer.
Such information may be available, subject to appropriate terms and conditions,
including in some cases, payment of a fee.
© Copyright IBM Corp. 2006, 2007 101
The licensed program described in this information and all licensed material
available for it are provided by IBM under terms of the IBM Customer Agreement,
IBM International Program License Agreement, IBM License Agreement for
Machine Code, or any equivalent agreement between us.
Any performance data contained herein was determined in a controlled
environment. Therefore, the results obtained in other operating environments may
vary significantly. Some measurements may have been made on development-level
systems and there is no guarantee that these measurements will be the same on
generally available systems. Furthermore, some measurements may have been
estimated through extrapolation. Actual results may vary. Users of this document
should verify the applicable data for their specific environment.
Information concerning products not produced by this manufacturer was obtained
from the suppliers of those products, their published announcements or other
publicly available sources. This manufacturer has not tested those products and
cannot confirm the accuracy of performance, compatibility or any other claims
related to products not produced by this manufacturer. Questions on the
capabilities of products not produced by this manufacturer should be addressed to
the suppliers of those products.
All statements regarding the manufacturer’s future direction or intent are subject to
change or withdrawal without notice, and represent goals and objectives only.
The manufacturer’s prices shown are the manufacturer’s suggested retail prices, are
current and are subject to change without notice. Dealer prices may vary.
This information is for planning purposes only. The information herein is subject to
change before the products described become available.
This information contains examples of data and reports used in daily business
operations. To illustrate them as completely as possible, the examples include the
names of individuals, companies, brands, and products. All of these names are
fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which
illustrate programming techniques on various operating platforms. You may copy,
modify, and distribute these sample programs in any form without payment to the
manufacturer, for the purposes of developing, using, marketing or distributing
application programs conforming to the application programming interface for the
operating platform for which the sample programs are written. These examples
have not been thoroughly tested under all conditions. The manufacturer, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs.
CODE LICENSE AND DISCLAIMER INFORMATION:
The manufacturer grants you a nonexclusive copyright license to use all
programming code examples from which you can generate similar function
tailored to your own specific needs.
SUBJECT TO ANY STATUTORY WARRANTIES WHICH CANNOT BE
EXCLUDED, THE MANUFACTURER, ITS PROGRAM DEVELOPERS AND
SUPPLIERS, MAKE NO WARRANTIES OR CONDITIONS EITHER EXPRESS OR
102 ALF Programmer’s Guide and API Reference
IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
OR CONDITIONS OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE, AND NON-INFRINGEMENT, REGARDING THE PROGRAM OR
TECHNICAL SUPPORT, IF ANY.
UNDER NO CIRCUMSTANCES IS THE MANUFACTURER, ITS PROGRAM
DEVELOPERS OR SUPPLIERS LIABLE FOR ANY OF THE FOLLOWING, EVEN
IF INFORMED OF THEIR POSSIBILITY:
1. LOSS OF, OR DAMAGE TO, DATA;
2. SPECIAL, INCIDENTAL, OR INDIRECT DAMAGES, OR FOR ANY
ECONOMIC CONSEQUENTIAL DAMAGES; OR
3. LOST PROFITS, BUSINESS, REVENUE, GOODWILL, OR ANTICIPATED
SAVINGS.
SOME JURISDICTIONS DO NOT ALLOW THE EXCLUSION OR LIMITATION OF
DIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, SO SOME OR ALL
OF THE ABOVE LIMITATIONS OR EXCLUSIONS MAY NOT APPLY TO YOU.
Each copy or any portion of these sample programs or any derivative work, must
include a copyright notice as follows:
© (your company name) (year). Portions of this code are derived from IBM Corp.
Sample Programs. © Copyright IBM Corp. _enter the year or years_. All rights
reserved.
If you are viewing this information in softcopy, the photographs and color
illustrations may not appear.
Trademarks
The following terms are trademarks of International Business Machines
Corporation in the United States, other countries, or both:
IBM
developerWorks
PowerPC
PowerPC Architecture
Resource Link
Adobe, Acrobat, Portable Document Format (PDF), and PostScript are either
registered trademarks or trademarks of Adobe Systems Incorporated in the United
States, other countries, or both.
Cell Broadband Engine and Cell/B.E. are trademarks of Sony Computer
Entertainment, Inc., in the United States, other countries, or both and is used under
license therefrom.
Linux® is a trademark of Linus Torvalds in the United States, other countries, or
both.
Other company, product or service names may be trademarks or service marks of
others.
Notices 103
Terms and conditions
Permissions for the use of these publications is granted subject to the following
terms and conditions.
Personal Use: You may reproduce these publications for your personal,
noncommercial use provided that all proprietary notices are preserved. You may
not distribute, display or make derivative works of these publications, or any
portion thereof, without the express consent of the manufacturer.
Commercial Use: You may reproduce, distribute and display these publications
solely within your enterprise provided that all proprietary notices are preserved.
You may not make derivative works of these publications, or reproduce, distribute
or display these publications or any portion thereof outside your enterprise,
without the express consent of the manufacturer.
Except as expressly granted in this permission, no other permissions, licenses or
rights are granted, either express or implied, to the publications or any data,
software or other intellectual property contained therein.
The manufacturer reserves the right to withdraw the permissions granted herein
whenever, in its discretion, the use of the publications is detrimental to its interest
or, as determined by the manufacturer, the above instructions are not being
properly followed.
You may not download, export or re-export this information except in full
compliance with all applicable laws and regulations, including all United States
export laws and regulations.
THE MANUFACTURER MAKES NO GUARANTEE ABOUT THE CONTENT OF
THESE PUBLICATIONS. THESE PUBLICATIONS ARE PROVIDED ″AS-IS″ AND
WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED,
INCLUDING BUT NOT LIMITED TO IMPLIED WARRANTIES OF
MERCHANTABILITY, NON-INFRINGEMENT, AND FITNESS FOR A
PARTICULAR PURPOSE.
104 ALF Programmer’s Guide and API Reference
Related documentation
All of the documentation listed in this section is available on the ISO image. The
latest versions of some documents may be available from the referenced web pages
or on your system after installing components of the SDK.
Cell/B.E. processor
There is a set of tutorial and reference documentation for the Cell/B.E. stored in
the IBM online technical library at:
http://www.ibm.com/chips/techlib/techlib.nsf/products/Cell_Broadband_Engine
v Cell Broadband Engine Architecture
v Cell Broadband Engine Programming Handbook
v Cell Broadband Engine Registers
v C/C++ Language Extensions for Cell Broadband Engine Architecture
v Synergistic Processor Unit (SPU) Instruction Set Architecture
v SPU Application Binary Interface Specification
v Assembly Language Specification
v Cell Broadband Engine Linux Reference Implementation Application Binary Interface
Specification
Cell/B.E. programming using the SDK
v SDK 2.1 Installation Guide
v SDK 2.1 Programmer’s Guide
v Cell Broadband Engine Programming Tutorial
v SIMD Math Library
v Accelerated Library Framework Programmer’s Guide and API Reference
After you have installed the SDK, you can also find the following PDFs in the
/opt/ibm/cell-sdk/prototype/docs directory:
v SDK Sample Library documentation
v IDL compiler documentation
The following documents are available as downloads from:
http://www.ibm.com/chips/techlib/techlib.nsf/products/Cell_Broadband_Engine
v Cell Broadband Engine Programming Tutorial documentation
v SPE Runtime Management library documentation Version 1.2
v SPE Runtime Management library documentation Version 2.1 (beta)
v SPE Runtime Management library Version 1.2 to Version 2.0 Migration Guide
IBM XL C/C++ Compiler
After you have installed the SDK, you can find the following PDFs in the
/opt/ibmcmp/xlc/8.2/doc directory.
v Getting Started with IBM XL C/C++ Compiler
v IBM XL C/C++ Compiler Language Reference
© Copyright IBM Corp. 2006, 2007 105
v IBM XL C/C++ Compiler Programming Guide
v IBM XL C/C++ Compiler Reference
v IBM XL C/C++ Compiler Installation Guide
IBM Full-System Simulator
After you have installed the SDK, you can also find the following PDFs in the
/opt/ibm/systemsim-cell/doc directory.
v IBM Full-System Simulator Users Guide
v IBM Full-System Simulator Command Reference
v Performance Analysis with the IBM Full-System Simulator
v IBM Full-System Simulator BogusNet HowTo
PowerPC Base
The following documents can be found on the developerWorks® Web site at:
http://www.ibm.com/developerworks/eserver/library
v PowerPC Architecture™ Book, Version 2.02
– Book I: PowerPC User Instruction Set Architecture
– Book II: PowerPC Virtual Environment Architecture
– Book III: PowerPC Operating Environment Architecture
v PowerPC Microprocessor Family: Vector/SIMD Multimedia Extension Technology
Programming Environments Manual Version 2.07c
106 ALF Programmer’s Guide and API Reference
Glossary
Accelerator
General or special purpose processing element in
a hybrid system. An accelerator might have a
multi-level architecture with both host elements
and accelerator elements. An accelerator, as
defined here, is a hierarchy with potentially
multiple layers of hosts and accelerators. An
accelerator element is always associated with one
host. Aside from its direct host, an accelerator
cannot communicate with other processing
elements in the system. The memory subsystem
of the accelerator can be viewed as distinct and
independent from a host. This is referred to as the
subordinate in a cluster collective.
All-reduce operation
Output from multiple accelerators is reduced and
combined into one output.
Compute kernel
Part of the accelerator code that does stateless
computation task on one piece of input data and
generates corresponding output results.
Compute task
An accelerator execution image that consists of a
compute kernel linked with the accelerated
library framework accelerator runtime library.
Host
A general purpose processing element in a hybrid
system. A host can have multiple accelerators
attached to it. This is often referred to as the
master node in a cluster collective.
Main thread
The main thread of the application. In many
cases, Cell/B.E. architecture programs are
multi-threaded using multiple SPEs running
concurrently. A typical scenario is that the
application consists of a main thread that creates
as many SPE threads as needed and the
application organizes them.
PPE
PowerPC Processor Element. The general-purpose
processor in the Cell/B.E. processor.
SIMD
Single Instruction Multiple Data. Processing in
which a single instruction operates on multiple
data elements that make up a vector data-type.
Also known as vector processing. This style of
programming implements data-level parallelism.
SPE
Synergistic Processor Element. Extends the
PowerPC 64 architecture by acting as cooperative
offload processors (synergistic processors), with
the direct memory access (DMA) and
synchronization mechanisms to communicate
with them (memory flow control), and with
enhancements for real-time management. There
are 8 SPEs on each Cell/B.E. processor.
SPMD
Single Program Multiple Data. A common style of
parallel computing. All processes use the same
program, but each has its own data.
SPU
Synergistic Processor Unit. The part of an SPE
that executes instructions from its local store (LS).
Work block
A basic unit of data to be managed by the
framework. It consists of one piece of the
partitioned data, the corresponding output buffer,
and related parameters. A work block is
associated with a task. A task can have as many
work blocks as necessary.
Work queue
An internal data structure of the accelerated
library framework that holds the lists of work
blocks to be processed by the active instances of
the compute task.
© Copyright IBM Corp. 2006, 2007 107
108 ALF Programmer’s Guide and API Reference
Index
AAccelerator API 47
address calculations 14
ALF API reference 23
alf_comp_kernel 47
alf_configure 26
ALF_DT_LIST_ADD_ENTRY 49
ALF_DT_LIST_CREATE 49
ALF_ERR_POLICY_T 25
alf_exit 29
alf_handle_t 25
alf_init 28
alf_prepare_input_list 47
alf_prepare_output_list 48
alf_query_system_info 27
alf_register_error_handler 30
alf_task_context_add_entry 35
alf_task_context_create 33
alf_task_context_handle_t 31
alf_task_context_register 36
alf_task_create 32
alf_task_destroy 38
alf_task_handle_t 31
alf_task_info_t 31
alf_task_info_t_CBEA 51
alf_task_query 36
alf_task_wait 37
alf_wb_add_io_buffer 41
alf_wb_add_parm 41
alf_wb_create 39
alf_wb_enqueue 40
alf_wb_sync 43
alf_wb_sync_wait 44
application structure 53
Bbasic framework API 25
buffer layouts 11
buffers 11
task context buffer 11, 81
work block input data buffer 11
work block output data buffer 11
work block overlapped input and
output data buffer 11
work block parameter and context
buffer 11
CCell/B.E. architecture platform-dependent
API 51
compute kernel 68, 73
compute task 1, 3, 7
compute task API 31
conventions 23
Ddata partitioning 17
data structures 39
data transfer list 7, 68, 73, 93
debugging 63, 76
documentation 105
double buffering 59
Eerror handling 21
Hhost API 25
host memory 17
Llocal memory allocation 14
Mmain thread 68
matrix addition 65
matrix transpose 71, 72
memoryhost memory 91
local memory 91
memory constraints 17, 91
Ooverlapped I/O buffer 81
overview of ALF 1
Ppartition scheme 80
performance tuning 63
SSDK documentation 105
SPE 76
sync_callback_func 44
synchronization points 19
barrier 19
notify 19
Ttwo-dimensional array 66
Vvector min-max 79
Wwork block API 39
work blocks 1, 3
multi-use 9, 17
single-use 9, 17
work queue 53
© Copyright IBM Corp. 2006, 2007 109
110 ALF Programmer’s Guide and API Reference
����
Printed in USA
SC33-8333-01