scicomp15 - programming ibm powerxcell 8i

ScicomP 15 5/18/2009 © 2009 IBM Corporation1

Programming IBM PowerXcell 8i / QS22libSPE2, ALF, DaCS

Jordi Caubet. IBM Spain.IBM Innovation Initiative @ BSC-CNS

ScicomP 1518-22 May 2009, Barcelona, Spain

© 2009 IBM CorporationScicomP 15 5/18/20092

Overview

� SPE Runtime Management Library

� Communication Mechanisms

– Mailboxes, DMA, Signals

� SIMD

� Accelerator Library Framework (ALF)

� Data Communications and Synchronization (DaCS)


SPE Runtime Management Library v2.3 (libSPE2)

� The SPE runtime management library (libspe) contains the SPE thread

programming model for Cell BE applications

� is used to control SPE program execution from the PPE program

� Licensed under the GNU LPGL

� Design is independent of the underlying operating system

– can run on a variety of OS platforms

� In Linux implementation, it uses of the SPU File System (SPUFS) as the Linux

kernel interface

� Handles SPEs as virtual objects called SPE contexts

– SPE programs can be loaded and executed by operating SPE contexts

� The elfspe enables direct SPE executable execution from a Linux shell without the need for a PPE application creating an SPE thread


SPE Runtime Management Library v2.3Terminology

� SPE context: – Holds all persistent information about

a "logical SPE" used by the application.

� Gang context:– Holds all persistent information about

a group of SPE contexts that should be treated as a gang, that is, be executed together with certain properties.

Notes: Context’s data structure should not be accessed directly; instead the application uses a pointer to an SPE gang context as an identifier for the SPE gang it is dealing with through libspeAPI calls

� Main thread: – The application’s main thread.

– A typical CellBE application consists a main thread that creates as many SPE threads as needed and “orchestrates” them

� SPE thread: – A regular thread executing on the PPE that actually

runs an SPE context

– The API call spe_context_run is a synchronous, blocking call from the perspective of the thread using it, that is, while an SPE program is executed, the associated SPE thread blocks and is usually put to “sleep” by the operating system

� SPE event: – A mechanism for asynchronous notification.

– Typical usage is the main thread sets up an event handler to receive notification about certain events caused by the asynchronously running SPE threads.

– The current library supports events to indicate that an SPE has stopped execution, mailbox messages being written or read by an SPE, and PPE-initiated DMA operations have completed.


SPE Runtime Management Library v2.3Functionality

� In order to provide this functionality, the SPE Runtime Management Library

consists of various sets of PPE functions:

1. A set of PPE functions to create and destroy SPE and gang contexts.

2. A set of PPE functions to load SPE objects into SPE local store memory for

execution.

3. A set of PPE functions to start the execution of SPE programs and to obtain

information on reasons why an SPE has stopped running.

4. A set of PPE functions to receive asynchronous events generated by an SPE.

5. A set of PPE functions used to access the MFC (Memory Flow Control) problem

state facilities, which includes

a. MFC proxy command issue

b. MFC proxy tag-group completion facility

c. Mailbox facility

d. SPE signal notification facility

6. A set of PPE functions that enable direct application access to an SPE’s local store

and problem state areas.

7. Means to register PPE-assisted library calls for an SPE program.


SPE Runtime Management Library v2.3Function list

� SPE Context Creation

– spe_context_create

– spe_context_destroy

– spe_gang_context_create

– spe_gang_context_destroy

� SPE Program Image Handling

– spe_image_open

– spe_image_close

– spe_program_load

� SPE Run Control

– spe_context_run

– spe_stop_info_read

� SPE Event Handling

– spe_event_handler_create

– spe_event_handler_destroy

– spe_event_handler_register

– spe_event_handler_deregister

– spe_event_wait

� SPE MFC Proxy Command Issue

– spe_mfcio_put, spe_mfcio_putb, spe_mfcio_putf

– spe_mfcio_get, spe_mfcio_getb, spe_mfcio_getf

� SPE MFC Proxy Tag-Group Completion Facility

– spe_mfcio_tag_status_read

– spe_out_mbox_read

– spe_out_mbox_status

– spe_in_mbox_write

– spe_in_mbox_status

– spe_out_intr_mbox_read

– spe_out_intr_mbox_status

� SPE SPU Signal Notification Facility

– spe_signal_write

� Direct SPE Access for Applications

– spe_ls_area_get

– spe_ls_size_get

– spe_ps_area_get

� PPE-assisted Library Calls

– spe_callback_handler_register

– spe_callback_handler_deregister


SPE Runtime Management Library v2.3Single thread

� A simple application uses a single PPE thread, that is, the application’s PPE

thread

� The basic scheme for a simple application using an SPE is as follows:

1. Create an SPE context

2. Load an SPE executable object into the SPE context’s local store

3. Run SPE context – this transfers control to the operating

system requesting the actual scheduling of the context to a physical SPE in the system

4. Destroy SPE context

� Note that step 3 represents a synchronous call to the operating system.

The calling application’s PPE thread blocks until the SPE stops execution

and the operating system returns from the system call invoking the SPE

execution


SPE Runtime Management Library v2.3Hello World single thread sample

#include <stdio.h>

int main( unsigned long long speid, unsigned long long argp,

unsigned long long envp ) {

printf("Hello World!\n");

return 0;

}

#include <errno.h>

…

#include <libspe2.h>

extern spe_program_handle_t hello_spu;

int main( void) {

spe_context_ptr_t speid; // Structure for an SPE context

unsigned int flags = 0; unsigned int entry = SPE_DEFAULT_ENTRY;

void * argp = NULL; void * envp = NULL;

spe_stop_info_t stop_info;

int rc;

// Create an SPE context

speid = spe_context_create(flags, NULL);

if (speid == NULL) { perror("spe_context_create"); return -2; }

// Load an SPE executable object into the SPE context local store

if (spe_program_load(speid, &hello_spu) )

{ perror("spe_program_load"); return -3; }

// Run the SPE context

rc = spe_context_run(speid, &entry, 0, argp, envp, &stop_info);

if (rc < 0) perror("spe_context_run");

spe_context_destroy(speid); // Destroy the SPE context

return 0;

}

SPE program hello_spuPPE program

spe_context_ptr_t spe_context_create(unsigned int flags, spe_gang_context_ptr_t gang)

flags A bit-wise OR of modifiers that are applied when the SPE context is created.

gang Associate the new SPE context with this gang context. If NULL is specified, the new SPE context is

not associated with any gang.

On success, a pointer to the newly created SPE context is returned.

int spe_program_load (spe_context_ptr_t spe, spe_program_handle_t *program)

spe A valid pointer to the SPE context for which an SPE program should be loaded.

program A valid address of a mapped SPE program

On success, 0 (zero) is returned. spe_program_load loads an SPE main program that has been mapped to memory at theaddress pointed to by program into the local store of the SPE identified by the SPE context spe. This is mandatory before

running the SPE context with spe_context_run.

int spe_context_run( spe_context_ptr_t spe, unsigned int *entry, unsigned int runflags, void *argp, void *envp, spe_stop_info_t *stopinfo)

spe A pointer to the SPE context that should be run.

entry Input: The entry point, that is, the initial value of the SPU instruction pointer, at which the SPE program should start executing. If the value of entry is SPE_DEFAULT_ENTRY, the entry point for the SPU main program is obtained from the loaded SPE image. This is usually the local store address of the initialization function crt0.

Output: The SPU instruction pointer at the moment the SPU stopped execution, that is, the local store address of the next instruction that would be have been executed.

runflags A bit mask that can be used to request certain specific behavior for the execution of the SPE context. If the value is 0, this indicates default behavior (see Usage).

argp An (optional) pointer to application specific data, and is passed as the second parameter to the SPE program.

envp An (optional) pointer to environment specific data, and is passed as the third parameter to the SPE program.

Stopinfo An (optional) pointer to a structure of type spe_stop_info_t .

On success, 0 (zero) or a positive number is returned.


SPE Runtime Management Library v2.3Request execution of an SPE context

� A SPE program must be loaded (using spe_program_load) before you can run the SPE context.

� The thread calling spe_context_run blocks and waits until the SPE stops, due to normal termination of the SPE program, or an SPU stop and signal instruction, or an error condition. When spe_context_runreturns, the calling thread must take appropriate actions depending on the application logic.

� spe_context_run returns information about the termination of the SPE program in three ways. This allows applications to deal with termination conditions on various levels.

– First, the most common usage for many applications is covered by the return value of the function and the errno value being set appropriately.

– Second, the optional stopinfo structure provides detailed information about the termination condition in a structured way that allows applications more fine-grained error handling and enables implementation of special scenarios.

– Third, the stopinfo structure contains the field spu_status that contains the value of the CBEA SPU Status Register (SPU_Status) as specified in the Cell Broadband Engine Architecture, Version 1, section 8.5.2 upon termination of the SPE program. This can be very useful, especially in conjunction with the SPE_NO_CALLBACKS flag, for applications that run non-standard SPE programs and want to react to all possible conditions flexibly and not rely on predefined conventions.


SPE Runtime Management Library v2.3Single Thread Asynchronous Execution Example

� The application creates one PPE thread

� The PPE thread will run an SPE context at a time

� The basic scheme for a simple application running 1 SPE contextsasynchronously is

1. Create an SPE context

2. Load the appropriate SPE executable object into the SPE context’s

local store

3. Create a PPE thread

a. Run the SPE context in the PPE thread

b. Terminate the PPE thread

4. Wait for the PPE thread to terminate

5. Destroy the SPE context


SPE Runtime Management Library v2.3Single Thread Asynchronous Execution Example

#include <errno.h>

#include <stdio.h>

#include <stdlib.h>


#include <pthread.h>

// Structure for an SPE thread

typedef struct ppu_pthread_data{

spe_context_ptr_t context;

pthread_t pthread;

unsigned int entry;

unsigned int flags;

void *argp;

void *envp;

spe_stop_info_t stopinfo;

} ppu_pthread_data_t;

// pthread function to run the SPE context

void *ppu_pthread_function(void *arg)

{

ppu_pthread_data_t *datap = (ppu_pthread_data_t *)arg;

int rc;

rc = spe_context_run(datap->context, &datap->entry, datap->flags, datap->argp, datap->envp, &datap->stopinfo);

pthread_exit(NULL);

}

extern spe_program_handle_t hello_spu;

int main(void)

{

ppu_pthread_data_t data;

data.context = spe_context_create(0, NULL);

spe_program_load(data.context, &hello_spu);

data.entry = SPE_DEFAULT_ENTRY;

data.flags = 0;

data.argp = NULL;

data.envp = NULL;

pthread_create( &data.pthread, NULL, &ppu_pthread_function, &data);

pthread_join(data.pthread, NULL);

spe_context_destroy(data.context);

return 0;

}

pthread

header file

Add pthread_t to

the SPE context

data structure

Define a pthread

function so we

can run the SPE

contextCall the ppu_thread_function to

run the context in the PPE

thread

Define data and fill it with

necessary context data

and then load SPE program

image

Wait until the PPE

thread terminate


SPE Runtime Management Library v2.3Multi-thread

� Many applications need to use multiple SPEs concurrently

� In this case, it is necessary for the application to create at least as many PPE threads as concurrent SPE contexts are required

� Each of these PPE threads may run a single SPE context at a time

� If N concurrent SPE contexts are needed, it is common to have a main application thread plus N PPE threads dedicated to SPE context execution

� The basic scheme for a simple application running N SPE contexts is

1. Create N SPE contexts

2. Load the appropriate SPE executable object into each SPE context’s local store

3. Create N PPE threads

a. In each of these PPE threads run one of the SPE contexts

b. Terminate the PPE thread

4. Wait for all N PPE threads to terminate

5. Destroy all N SPE contexts


SPE Runtime Management Library v2.3Multi Thread Asynchronous Execution Example

#include <stdlib.h>

#include <stdio.h>

#include <errno.h>


#include <pthread.h>

extern spe_program_handle_t simple_spu;

#define SPU_THREADS 6

void *ppu_pthread_function(void *arg) {

spe_context_ptr_t ctx;

unsigned int entry = SPE_DEFAULT_ENTRY;

ctx = *((spe_context_ptr_t *)arg);

if (spe_context_run(ctx, &entry, 0, NULL, NULL, NULL) < 0) {

perror ("Failed running context");

exit (1);

}

pthread_exit(NULL);

}

Define a pthreadfunction so we can

run the SPE context


SPE Runtime Management Library v2.3Multi Thread Asynchronous Execution Example

int main()

{

int i;

spe_context_ptr_t ctxs[SPU_THREADS];

pthread_t threads[SPU_THREADS];

/* Create several SPE-threads to execute 'simple_spu'.

*/

for(i=0; i<SPU_THREADS; i++) {

/* Create context */

if ((ctxs[i] = spe_context_create (0, NULL)) == NULL) {

perror ("Failed creating context");

exit (1);

}

/* Load program into context */

if (spe_program_load (ctxs[i], &simple_spu)) {

perror ("Failed loading program");

exit (1);

}

/* Create thread for each SPE context */

if (pthread_create (&threads[i], NULL, &ppu_pthread_function, &ctxs[i])) {

perror ("Failed creating thread");

exit (1);

}

}

/* Wait for SPU-thread to complete execution.

*/

for (i=0; i<SPU_THREADS; i++) {

if (pthread_join (threads[i], NULL)) {

perror("Failed pthread_join");

exit (1);

}

}

printf("\nThe program has successfully executed.\n");

return (0);

}

Fork a pthread thread

giving thread id,

function, and data


Overview



– MFC / DMA, Mailboxes, Signals

� SIMD




Cell’s Primary Communication Mechanisms

� DMA transfers, mailbox messages,

and signal-notification

� All three are implemented and

controlled by the SPE’s MFC


Storage Domain Communication Concepts

� An SPE program references its own LS using a Local Store Address (LSA).

� The LS of each SPE is also assigned a Real Address (RA) range within the system's memory map.

PPE maps LS areas into the Effective Address (EA) space, where the PPE and other SPEsgenerate EAs can access the LS like any regular component on the main storage.

Allows SPEs to use DMA to directly transfer data between their LS to another SPE’s LS.

This mode of data transfer is very efficient, because the DMA transfers go directly from SPE to SPE on the high performance local bus without involving the system memory.

� A code that runs on an SPU can only fetch instructions from its own LS, and loads and stores can only access that LS.

� Data transfers between the SPE's LS and main storage are primarily executed using DMA transfers controlled by the MFC DMA controller for that SPE.

� Each SPE's MFC serves as a data-transfer engine.

� DMA transfer requests contain both an LSA and an EA.

EIB (up to 96B/cycle)

SPE

LS

SXU

SPU

MFC

LS

SXU

SPU

MFC

LS

SXU

SPU

MFC

LS

SXU

SPU

MFC

LS

SXU

SPU

MFC

LS

SXU

SPU

MFC

LS

SXU

SPU

MFC

LS

SXU

SPU

MFC


MFC channels and MMIO interfaces and queues

NOTES:

� Since accessing the channel remains local within a certain SPE it has low latency (for non blocking commands about 6 cycles if channel is not full) and also doesn’t have any negative influence EIB bandwidth.

� When SPE reads or writes a non-blocking channel, the operation executes without delay. However, when SPE software reads or writes a blocking channel, the SPE might stall for an arbitrary length if the associated channel count (which is its remaining capacity) is ‘0’. In this case, the SPE will remain stalled until the channel count becomes ‘1’ or more.


MFC Commands

� Main mechanism for SPUs to

– access main storage (DMA commands)

– maintain synchronization with other processors and devices in the system (Synchronization commands)

� Can be issued either SPU via its MFC by PPE or other device, as follows:

– Code running on the SPU issues an MFC command by executing a series of writes and/or reads using channel instructions - read channel (rdch), write channel (wrch), and read channel count (rchcnt).

– Code running on the PPE or other devices issues an MFC command by performing a series of stores and/or loads to memory-mapped I/O (MMIO) registers in the MFC

� MFC commands are queued in one of two independent MFC command queues:

– MFC SPU Command Queue — For channel-initiated commands by the associated SPU

– MFC Proxy Command Queue — For MMIO-initiated commands by the PPE or other device

SPU Core (SXU)

Channel Unit

Local Store

MFC(DMA Unit)

SPU

SPE

To Element Interconnect Bus


Sequences for Issuing MFC Commands

�All operations on a given channel are

unidirectional (they can be only read or

write operations for a given channel,

not bidirectional)

�Accesses to channel-interface

resources through MMIO addresses do

not stall

�Channel operations are done in

program order

�Channel read operations to reserved

channels return ‘0’s

�Channel write operations to reserved

channels have no effect

�Reading of channel counts on

reserved channels returns ‘0’

�Channel instructions use the 32-bit

preferred slot in a 128-bit transfer


DMA Overview


DMA Commands

� MFC commands that transfer data are referred to as DMA commands

� Transfer direction for DMA commands referenced from the SPE

� Into an SPE (from main storage to local store) � get

� Out of an SPE (from local store to main storage) � put


DMA Commands

Channel Control

Intrinsics

spu_writech

Composite

Intrinsics

spu_mfcdma32

MFC Commands

mfc_get

#include “spu_mfcio.h“

mfc_get(lsa, ea, size, tag, tid, rid);

// Implemented as the following composite intrinsic:

spu_mfcdma64(lsa, mfc_ea2h(ea), mfc_ea2l(ea), size, tag,

((tid<<24)|(rid<<16)|MFC_GET_CMD));

// wait until DMA transfer is complete (or do other things before that)

defined as macros in

spu_mfcio.h

mfc_get implemented as MFC command

and composite intrinsics

spu_writech(MFC_LSA, lsa);

spu_writech(MFC_EAH, eah);

spu_writech(MFC_EAL, eal);

spu_writech(MFC_Size, size);

spu_writech(MFC_TagID, tag);

spu_writech(MFC_CMD, 0x0040);

mfc_getimplemented as channel control

intrinsics


DMA Get and Put Command (SPU)

� DMA get from main memory into local store

(void) mfc_get( volatile void *ls, uint64_t ea, uint32_t size, uint32_t tag, uint32_t tid, uint32_t rid )

� DMA put into main memory from local store

(void) mfc_put( volatile void *ls, uint64_t ea, uint32_t size, uint32_t tag, uint32_t tid, uint32_t rid )

� To ensure order of DMA request execution:

– mfc_putf : fenced (all commands executed before within the same tag group must finish first, later ones could be before)

– mfc_putb : barrier (the barrier command and all commands issued thereafter are notexecuted until all previously issued commands in the same tag group have been performed)

� 5-bit DMA Tag for all DMA commands (except getllar, putllc, and putlluc) – Tag can be used to determine status for entire group/command or check/wait on the completion of all

queued commands in one or more tag groups

– Tagging is optional but can be useful when using barriers to control the ordering of MFC commands within a single command queue.

– Synchronization of DMA commands within a tag group: fence and barrier

– Execution of a fenced command option is delayed until all previously issued commands within the same tag group have been performed.

– Execution of a barrier command option and all subsequent commands is delayed until all previously issued commands in the same tag group have been performed.


Barriers and Fences – Synchronization Commands


DMA Characteristics

� DMA transfers

– transfer sizes can be 1, 2, 4, 8, and n*16 bytes (n integer)

– maximum is 16KB per DMA transfer

– 128B alignment is preferable

� DMA command queues per SPU

– 16-element queue for SPU-initiated requests

– 8-element queue for PPE-initiated requests

� SPU-initiated DMA is always preferable

� DMA tags

– each DMA command is tagged with a 5-bit identifier

– same identifier can be used for multiple commands

– tags used for polling status or waiting on completion of DMA commands

� DMA lists

– a single DMA command can cause execution of a list of transfer requests (in LS)

– lists implement scatter-gather functions

– a list can contain up to 2K transfer requests


PPE – SPE DMA Transfer


PPE � SPE: Transfer through MMIO Interface

� Two methods for accessing SPE from PPE through the MMIO interface.

– MFC functions

• Simple to use

• Less development time

– Direct problem state access (or Direct SPE access)

• Offer better performance than MFC functions

#include “libspe2.h“

spe_mfcio_get ( spe_ctx, lsa, ea, size, tag, tid, rid);

// wait till data was transfered to LS, or do other things...

Example of using MFC functions

spe_context_ptr_t spe_ctx : a pointer to the context of the relevant SPE.This context is

created when the SPE thread is created.


DMA Example: Read into local store

#include <spu_mfcio.h>

inline void dma_ls_to_mem(unsigned int mem_addr,volatile

void *ls_addr, unsigned int size)

{

unsigned int tag = 0;

unsigned int mask = 1;

mfc_get(ls_addr,mem_addr,size,tag,0,0);

mfc_write_tag_mask(mask);

mfc_read_tag_status_all();

}

Initiate a DMA transfer -Write contents of

mem_addr into ls_addr

Set tag mask to determine which tag ID

to notify upon completion

Wait until all tagged DMA commands

completed

MFC functions header file


DMA Example: Write to Main Memory


inline void dma_ls_to_mem(unsigned int mem_addr,volatile

void *ls_addr, unsigned int size)

{

unsigned int tag = 0;

unsigned int mask = 1;

mfc_put(ls_addr,mem_addr,size,tag,0,0);

mfc_write_tag_mask(mask);


}

Initiate a DMA transfer -Write contents of

mem_addr into ls_addr

Set tag mask to determine which tag ID

to notify upon completion

Wait until all tagged DMA commands

completed

MFC functions header file


Tips to Achieve Peak Bandwidth for DMAs

� The performance of a DMA data transfer is best when the source and

destination addresses have the same quadword offsets within a PPE

cache line.

� Quadword-offset-aligned data transfers generate full cache-line bus

requests for every unrolling, except possibly the first and last unrolling.

� Transfers that start or end in the middle of a cache line transfer a partial

cache line (less than 8 quadwords) in the first or last bus request, respectively.


Supported and recommended values for DMA parameters

� Direction: Data transfer may be in any of the two directions as referenced from the

perspective of an SPE:

– get commands: transfer data to a LA from the main storage.

– put commands: transfers data out of the LS to the main storage.

� Size: Transfer size should obey the following guidelines:

– Supported transfer sizes are 1, 2, 4, 8, or 16 bytes, and multiples of 16-bytes

– Maximum transfer size is 16 KB.

– Peak performance is achieved when transfer size is a multiple of 128 bytes.

� Alignment: Alignment of the LSA and the EA should obey the following guidelines:

– Source and destination addresses must have the same 4 least significant bits.

– For transfer sizes less than 16 bytes, address must be naturally aligned (bits 28 through

31 must provide natural alignment based on the transfer size).

– For transfer sizes of 16 bytes or greater, address must be aligned to at least a 16-byte

boundary (bits 28 through 31 must be ‘0’).

– Peak performance is achieved when both source and destination are aligned on a

128-byte boundary (bits 25 through 31 cleared to ‘0’).


Data transfer example

#include <stdio.h>

//#include <libspe.h>

//#include <libmisc.h>

#include <string.h>


//spu program

extern spe_program_handle_t getbuf_spu;

//local buffer

unsigned char buffer[128] __attribute__ ((aligned(128)));

//spe context

spe_context_ptr_t speid;

unsigned int flags = 0;

unsigned int entry = SPE_DEFAULT_ENTRY;

spe_stop_info_t stop_info;

int rc;

int main (void)

{

strcpy (buffer, "Good morning!");

printf("Original buffer is %s\n", buffer);

speid = spe_context_create(flags, NULL);

spe_program_load(speid, &getbuf_spu);

rc = spe_context_run(speid, &entry, 0, buffer, NULL, &stop_info);

spe_context_destroy(speid);

printf("New modified buffer is %s\n", buffer);

return 0;

}

#include <stdio.h>

#include <string.h>

//#include <libmisc.h>


unsigned char buffer[128] __attribute__ ((aligned(128)));

int main( unsigned long long speid, unsigned long long argp, unsigned long long envp )

{

int tag = 31, tag_mask = 1<<tag;

// DMA in buffer from PPE

mfc_get(buffer, (unsigned long long)argp, 128, tag, 0, 0);

mfc_write_tag_mask(tag_mask);

mfc_read_tag_status_any();

printf("SPE received buffer \"%s\"\n", buffer);

// modify buffer

strcpy (buffer, "Guten Morgen!");

printf("SPE sent to PPU buffer \"%s\"\n", buffer);

// DMA out buffer to PPE

mfc_put( buffer, (unsigned long long) argp, 128, tag, 0, 0);

mfc_write_tag_mask(tag_mask);

mfc_read_tag_status_any();

return 0;

}

SPE programPPE program


Mailboxes


Mailboxes

� To communicate messages up to 32 bits in length, such as buffer completion flags or program status

– e.g., When the SPE places computational results in main storage via DMA. After requesting the DMA transfer, the SPE waits for the DMA transfer to complete and then writes to an outbound mailbox to notify the PPE that its computation is complete

� Can be used for any short-data transfer purpose, such as sending of storage addresses, function parameters, command parameters, and state-machine parameters

� Can also be used for communication between an SPE and other SPEs, processors, or devices

– Privileged software needs to allow one SPE to access the mailbox register in another SPE by mapping the target SPE’s problem-state area into the EA space of the source SPE. If software does not allow this, then only atomic operations and signal notifications are available for SPE-to-SPE communication.


Mailboxes

Each MFC provides three mailbox queues of 32 bit each:

� Two Outbound mailboxes

To send messages from the local SPE to the PPE or other SPEs:

– SPU Write Outbound mailbox (SPU_WrOutMbox)

– SPU Write Outbound Interrupt mailbox (SPU_WrOutIntrMbox)

� One Inbound mailbox

To send messages to the local SPE from the PPE or other SPEs:

– SPU (“SPU read inbound”) mailbox queue.

� Each mailbox entry is a fullword


Mailboxes - Characteristics


Mailboxes API – libspe2 (See more from libspe2 documents)

MFC

PPE mbox

out_mbox

dataflow

spu_stat_out_mbox

spu_write_out_mbox

spe_out_mbox_status(<speid>)

spe_out_mbox_read(<speid>, &<data>))

PPE intr mbox

out_intr_mbox

spu_stat_out_intr_mbox

spu_write_out_intr_mbox

spe_out_intr_mbox_status(<speid>)

spe_get_event

dataflow

SPE mbox

in_mbox

spu_stat_in_mbox

spu_read_in_mbox

spe_in_mbox_status(<speid>)

spe_in_mbox_write(<speid>,<data>)

dataflow

PPU (libspe2.h) SPU (spu_mfcio.h)


Mailboxes MFC Functions

NOTE: To access a mailbox of the local SPU from other SPU, need the following steps:

1. PPU code map the SPU’s controls area to main storage using libspe2.h file’s spe_ps_area_get function with SPEE_CONTROL_AREA flag set.

2. PPE forward the SPU’s control area base address to another SPU.

3. The other SPU uses ordinary DMA transfers to access the mailbox. Effective address should be control area base plus offset

to specific mailbox register.


Example of Using Mailboxes – PPE CodeAdding of two arrays on the SPU

#define MY_ALIGN(_my_var_def_, _my_al_) _my_var_def_ __attribute__((__aligned__(_my_al_)))

MY_ALIGN(float array_a[ARRAY_SIZE],128);

MY_ALIGN(float array_b[ARRAY_SIZE],128);

MY_ALIGN(float array_c[ARRAY_SIZE],128);

extern spe_program_handle_t array_spu_add;

int status;

void init_array() {

int i;

for(i=0;i<ARRAY_SIZE; i++)

{

array_a[i] = (float)(i * NUM);

array_b[i] = (float)((i * NUM)*2);

array_c[i] = 0.0f;

}

}


#include <…

#define ARRAY_SIZE 1024

#define NUM 512

typedef struct ppu_thread_data {

spe_context_ptr_t context;

pthread_t pthread;

unsigned int entry;

unsigned int flags;

void *argp;

void *envp;

spe_stop_info_t stopinfo;

}ppu_pthread_data_t;


Example of Using Mailboxes – PPE Code

/* Create a spe context that we'll use to load our spe

program into */

threadData.context = spe_context_create(0, NULL);

/* Load our spe program: array_spu_add*/

spe_program_load( threadData.context,

&array_spu_add);

/* Setup our SPE program parameters*/

threadData.entry = SPE_DEFAULT_ENTRY;

threadData.flags = 0;

threadData.argp = NULL;

threadData.envp = NULL;

/* Now we can create thread to run on a SPE*/

pthread_create( &threadData.pthread, NULL,

&ppu_pthread_function,

&threadData);

// Setup mailbox messages (32-Bit ONLY)

mbox_data[0] = (unsigned int)&array_a[0];

mbox_data[1] = (unsigned int)&array_b[0];

mbox_data[2] = (unsigned int)&array_c[0];

(continues from previous page)

…

/* New thread*/

void *ppu_pthread_function(void *arg)

{

ppu_pthread_data_t *datap = (ppu_pthread_data_t *)arg;

spe_context_run( datap->context,

&datap->entry,

datap->flags,

datap->argp,

datap->envp,

&datap->stopinfo);

pthread_exit(NULL);

}

int main(void)

{

int i, retVal;

ppu_pthread_data_t threadData;

unsigned int *mbox_dataPtr;

unsigned int mbox_data[3];

init_array();

/* check for working hardware */

if (spe_cpu_info_get(SPE_COUNT_PHYSICAL_SPES, -1) < 1)

{

fprintf(stderr,

"System has no working SPEs. Exiting\n");

return -1;

}

…

(continues on right column)


Example of Using Mailboxes – PPE Code

/* wait for spe to finish */

pthread_join (threadData.pthread, NULL);

spe_context_destroy(threadData.context);

__asm__ __volatile__ ("sync" : : : "memory");

printf("Array Addition complete. Verifying

results...\n");

/* verifying */

for (i=0; i<ARRAY_SIZE; i++)

{

if (array_c[i] != (float)((i * NUM)*3))

{

printf("ERROR in matrix addition\n");

return 0;

}

}

printf(" Correct!\n");

return 0;

}

/* Send effective addresses of arrays to spe through

mailboxes*/

while (spe_in_mbox_status(threadData.context) == 0);

do

{

mbox_dataPtr = &mbox_data[0];

retVal = spe_in_mbox_write( threadData.context,

mbox_dataPtr, 1,

SPE_MBOX_ANY_NONBLOCKING);

}

while (retVal != 1);


do

{


retVal = spe_in_mbox_write(threadData.context,

mbox_dataPtr, 1,


}

while (retVal !=1);


do

{


retVal = spe_in_mbox_write(threadData.context,

mbox_dataPtr, 1,


}


Example of Using Mailboxes – SPE Code#include <stdio.h>


#define ARRAY_SIZE 1024

#define MY_ALIGN(_my_var_def_, _my_al_) _my_var_def_ __attribute__((__aligned__(_my_al_)))

MY_ALIGN(float array_a[ARRAY_SIZE],128);

MY_ALIGN(float array_b[ARRAY_SIZE],128);

MY_ALIGN(float array_c[ARRAY_SIZE],128);

int main(unsigned long long speid, unsigned long long argp, unsigned long long envp){

int i;

unsigned int Aaddr, Baddr, Caddr;

/* First: Read mailbox 3 times to get our array effective addresses */

Aaddr = (unsigned int) spu_read_in_mbox();

Baddr = (unsigned int) spu_read_in_mbox();

Caddr = (unsigned int) spu_read_in_mbox();

/*Now that we have the array EAs we can DMA our data over */

mfc_get(&array_a, Aaddr, (sizeof(float)*ARRAY_SIZE), 31, 0, 0);

mfc_get(&array_b, Baddr, (sizeof(float)*ARRAY_SIZE), 31, 0, 0);

mfc_write_tag_mask(1<<31);


/*array add*/

for(i=0; i<ARRAY_SIZE; i++)

{

array_c[i] = array_a[i] + array_b[i];

}

/* Finally, DMA computed array back to main memory*/

mfc_put(&array_c, Caddr, (sizeof(float)*ARRAY_SIZE), 31, 0, 0);

mfc_write_tag_mask(1<<31);


return 0;

}


SPU Write Outbound Mailbox

– The value written to the SPU Write Outbound Mailbox channel SPU_WrOutMbox is

entered into the outbound mailbox in the MFC if the mailbox has capacity to accept

the value.

– If the mailbox can accept the value, the channel count for SPU_WrOutMbox is

decremented by ‘1’.

– If the outbound mailbox is full, the channel count will read as ‘0’.

– If SPE software writes a value to SPU_WrOutMbox when the channel count is ‘0’, the

SPU will stall on the write.

– The SPU remains stalled until the PPE or other device reads a message from the

outbound mailbox by reading the MMIO address of the mailbox.

– When the mailbox is read through the MMIO address, the channel count is

incremented by ‘1’.


SPU Write Outbound Interrupt Mailbox

– The value written to the SPU Write Outbound Interrupt Mailbox channel (SPU_WrOutIntrMbox) is entered into the outbound interrupt mailbox if the mailbox has capacity to accept the value.

– If the mailbox can accept the message, the channel count for SPU_WrOutIntrMbox is decremented by ‘1’, and an interrupt is raised in the PPE or other device, depending on interrupt enabling and routing.

– There is no ordering of the interrupt and previously issued MFC commands.

– If the outbound interrupt mailbox is full, the channel count will read as ‘0’.

– If SPE software writes a value to SPU_WrOutIntrMbox when the channel count is ‘0’, the SPU will stall on the write.

– The SPU remains stalled until the PPE or other device reads a mailbox message from the outbound interrupt mailbox by reading the MMIO address of the mailbox.

– When this is done, the channel count is incremented by ‘1’.


PPU reads SPU Outbound Mailboxes

� PPU must check Mailbox Status Register first

– check that unread data is available in the SPU Outbound Mailbox or SPU Outbound

Interrupt Mailbox

– otherwise, stale or undefined data may be returned

� To determine that unread data is available

– PPE reads the Mailbox Status register

– extracts the count value from the SPU_Out_Mbox_Count field

� count is

– non-zero � at least one unread value is present

– zero � PPE should not read but poll the Mailbox Status register


SPU Read Inbound Mailbox Channel

� Mailbox is FIFO queue

– If the SPU Read Inbound Mailbox channel (SPU_RdInMbox) has a message, the value read from the mailbox is the oldest message written to the mailbox.

� Mailbox Status (empty: channel count =0)

– If the inbound mailbox is empty, the SPU_RdInMbox channel count will read as ‘0’.

� SPU stalls on reading empty mailbox

– If SPE software reads from SPU_RdInMbox when the channel count is ‘0’, the SPU will stall on the read. The SPU remains stalled until the PPE or other device writes a message to the mailbox by writing to the MMIO address of the mailbox.

� When the mailbox is written through the MMIO address, the channel count is incremented by ‘1’.

� When the mailbox is read by the SPU, the channel count is decremented by '1'.

� The SPU Read Inbound Mailbox can be overrun by a PPE in which case, mailbox message data will be lost.

� A PPE writing to the SPU Read Inbound Mailbox will not stall when this mailbox is full.


MFC Synchronization Commands

MFC synchronization commands

� Used to control the order in which DMA storage accesses are performed

� Four atomic commands (getllar, putllc, putlluc, and putqlluc),

� Three send-signal commands (sndsig, sndsigf, and sndsigb), and

� Three barrier commands (barrier, mfcsync, and mfceieio).


Overview




� SIMD




Cell Multi-core Programming Effort Roadmap

Cell BE Porting & Optimizing

Port appto Linux,if needed

Port app to Power,run on PPE

Begin movingfunctionto SPE’s

Tune SIMD

Optimizingfunction on SPE’s

Optimizingfunction

- Local StoreManagement

Requires mostly the same effort to port to any multi-core architecture.

•Exploit Parallelism at Task

Level•Exploit Parallelism at

instruction / data level•Data and Instruction

Locality Tuning

Writing for Cell BE speeds up code on all multi-core architectures

because it uses the same parallel best practices – Cell architecture

just gains more from them because of its design.

Writing for Cell BE speeds up code on all multi-core architectures

because it uses the same parallel best practices – Cell architecture

just gains more from them because of its design.


SIMD Architecture

� SIMD = “single-instruction multiple-data”

� SIMD exploits data-level parallelism

– a single instruction can apply the same operation to multiple data elements in

parallel

� SIMD units employ “vector registers” which holds multiple data elements

� SIMD is pervasive in the BE

– PPE includes VMX (SIMD extensions to PPC architecture)

– SPE is a native SIMD architecture (VMX-like)

� SIMD in VMX and SPE

– 128bit-wide datapath

– 128bit-wide registers

– 4-wide fullwords, 8-wide halfwords, 16-wide bytes

– SPE includes support for 2-wide doublewords


SIMD (Single Instruction Multiple Data) processing

Scalar code

read the next instruction and decode it

get this number

get that number

add them

put the result here


get this number

get that number

add them

put the result here


get this number

get that number

add them

put the result here.


get this number

get that number

add them

put the result there

Vector code

read instruction and decode it

get these 4 numbers

get those 4 numbers

add them

put the results here

• Data-level parallelism / operate on multiple data elements simultaneously

• (Significantly) smaller amount of code => improved execution efficiency• Number of elements processed in parallel = (size of vector / size of element)

Result 1

+

=

Instr. Decode

Number 1

Number 2

Repeat

4 times

Do

Once

Result 3

+

=

Instr. Decode

Number 1c

Number 2c

Result 4

+

=

Number 1d

Number 2d

Result 1

+

=

Number 1a

Number 2a

Result 2

+

=

Number 1b

Number 2bMultiple Data

Single Instruction


SPU Instruction Set

� Operates primarily on SIMD vector operands, both fixed-point and floating-point,

� Supports some scalar operands

� 128 of 128-bit General-Purpose Registers (GPRs) that can be used to store all data types

� Supports big-endian data ordering,

– lowest-address byte and lowest-numbered bit are the most-significant (high) byte and bit

� Supports data types

– byte—8 bits

– halfword—16 bits

– word—32 bits

– doubleword—64 bits

– quadword—128 bits

Vector Data

Types


SPE – Pipelines and Dual-Issues Rules

� SPU has two pipelines

– even (pipeline 0)

– odd (pipeline 1)

� Dual-issue occurs when a fetch group has two issue-able instructions in which the first instruction can be executed on the even pipeline and the second instruction can be executed on the odd pipeline.

SPU can issue and complete up to two

instructions per cycle, one in each of the

pipelines


SPE and Scalar Code

� SPU only loads and stores a quadword at a time

� Value of scalar operands (including addresses) is kept in the preferred slot of a SIMD register

� Scalar (sub quadword) loads and stores require several instructions to format the data for use on the SIMD architecture of the SPE

– e.g., scalar stores require a read, scalar insert, and write operation

� Strategies to make operations on scalar data more efficient:

– Change the scalars to quadword vectors to eliminate three extra instructions associated with loading and storing scalars

– Cluster scalars into groups, and load multiple scalars at a time using a quadwordmemory access. Manually extract or insert the scalars as needed. This will eliminate redundant loads and stores.

Intrinsics for Changing Scalar and Vector Data Types


Register Layout of Data Types and Preferred Slot

When instructions use or produce scalar operands or addresses, the values are

in the preferred scalar slot

The left-most word (bytes 0, 1, 2, and 3) of a

register is called the preferred slot


SIMD Programming

� “Native SIMD” programming

– algorithm vectorized by the programmer

– coding in high-level language (e.g. C, C++) using intrinsics

– intrinsics provide access to SIMD assembler instructions

• e.g. c = spu_add(a,b) � add vc,va,vb

� “Traditional” programming

– algorithm coded “normally” in scalar form

– compiler does auto-vectorization

� but auto-vectorization capabilities remain limited


SPE C/C++ Language Extensions (Intrinsics)

� Specific Intrinsics: intrinsics that have a one-to-one mapping with a single assembly-language instruction (si_asembly-inst-name) and are provided for all instructions except some branch and interrupt related ones.

� Generic / Buit-In

– Constant formation (spu_splats)

– Conversion (spu_convtf, spu_convts, ..)

– Arithmetic (spu_add, spu_madd, spu_nmadd, ...)

– Byte operations (spu_absd, spu_avg,...)

– Compare and branch (spu_cmpeq, spu_cmpgt,...)

– Bits and masks (spu_shuffle, spu_sel,...)

– Logical (spu_and, spu_or, ...)

– Shift and rotate (spu_rlqwbyte, spu_rlqw,...)

– Control (spu_stop, spu_ienable, spu_idisable, ...)

– Channel Control (spu_readch, spu_writech,...)

– Scalar (spu_insert, spu_extract, spu_promote)

� Composite: DMA (spu_mfcdma32, spu_mfcdma64, spu_mfcstat)

� Vector datatypes: vector [unsigned] {char, short, float, double}

– e.g. “vector float”, “vector signed short”, “vector unsigned int”, …

� Vector pointers:

– e.g. “vector float *p” (p+1 points to the next vector (16B) after that pointed to by p)


Simple SIMD Operations

� Example of basic SIMD code

#include <spu_intrinsics.h>

vector float vec1={8.0,8.0,8.0,8.0}, vec2={2.0,4.0,8.0,16.0};

vec1 = vec1 + vec2;

� Example of SIMD intrinsics

#include <spu_intrinsics.h>

vector float vec1={8.0,8.0,8.0,8.0}, vec2={2.0,4.0,8.0,16.0};

vec1 = spu_sub( (vector float)spu_splats((float)3.5), vec1);

vec1 = spu_mul( vec1, vec2);


Vectorizing a Loop – A Simple Example

� Loop does term-by-term multiply of two arrays

– arrays assumed here to remain scalar outside of subroutine

� If arrays not quadword-aligned, extra work is necessary

� If array size is not a multiple of 4, extra work is necessary

/* scalar version */

int mult1(float *in1, float *in2, float *out, int N){ int i;

for (i = 0, i < N, i++) {

out[i] = in1[i] * in2[i]; }

return 0;

/* vectorized version */

int vmult1(float *in1, float *in2, float *out, int N){ /* assumes that arrays are quadword aligned */ /* assumes that N is divisible by 4 */

int i, Nv; Nv = N >> 2; /* N/4 vectors */ vector float *vin1 = (vector float *) (in1); vector float *vin2 = (vector float *) (in2); vector float *vout = (vector float *) (out);

for (i = 0, i < Nv, i++) {

vout[i] = spu_mul(vin1[i],vin2[i]); }

return 0;


Overview




� SIMD




ALF and DaCS: IBM’s Software Enablement Strategy for MulticoreMemory-Hierarchy Systems


ALF - Accelerator Library Framework

� Aims at workloads that are highly parallelizable– e.g. Raycasting, FFT, Monte Carlo, Video Codecs

� Provides a simple user-level programming framework for Cell

library developers that can be extended to other hybrid

systems.� Division of Labor approach

– ALF provides wrappers for computational kernels

– Frees programmers from writing their own architectural-dependent code

including: data transfer, task management, double buffering, data

communication

� Manages data partitioning– Provides efficient scatter/gather implementations via CBE DMA

– Extensible to variety of data partitioning patterns

– Host and accelerator describe the scatter/gather operations

– Accelerators gather the input data and scatter output data from host’s

memory

– Manages input/output buffers to/from SPEs

� Remote error handling

� Utilizes DaCS library for some low-level operations (on Hybrid)


ALF Data Partitioning


Basic structure of an ALF application


ALF – Workblock

� Workblock is the basic data unit of ALF� Workblock = Partition Information ( Input

Data, Output Data ) + Parameters

� Input Data and Output Data for a work

load can be divided into many work

blocks


ALF – Data Transfer List

� Work Block input & output data descriptions are stored as Data Transfer List (DTL)

� Input data transfer list is used to gather data from host memory

� Output data transfer list is used to scatter data to host memory


ALF – Workblock

alf_wb_handle_t wb_handle;add_parms_t parm __attribute__((aligned(128)));

parm.h = H; /* horizontal size of the block */parm.v = V; /* vertical size of the block */

/* creating work blocks and adding param & io buffer */for (i = 0; i < NUM_ROW; i += H)`{

alf_wb_create(task_handle, ALF_WB_SINGLE, 0,&wb_handle);

/* begins a new Data Transfer List for INPUT */ alf_wb_dtl_set_begin(wb_handle, ALF_BUF_IN, 0); /* Add H*V element of mat_a as Input */ alf_wb_dtl_set_entry_add(wb_handle, &matrix_a[i][0], H * V, ALF_DATA_FLOAT); /* Add H*V element ofmat_b as Input */ alf_wb_dtl_set_entry_add(wb_handle, &matrix_b[i][0], H * V, ALF_DATA_FLOAT); alf_wb_dtl_set_end(wb_handle);

/* begins a new Data Transfer List OUTPUT */ alf_wb_dtl_set_begin(wb_handle, ALF_BUF_OUT, 0); /* Add H*V element of mat_c as Output */ alf_wb_dtl_set_entry_add(wb_handle, &matrix_c[i][0], H * V, ALF_DATA_FLOAT); alf_wb_dtl_set_end(wb_handle); /* pass parameters H and V to spu */

alf_wb_parm_add(wb_handle, (void *) (&parm), sizeof(parm), ALF_DATA_BYTE, 0); /* enqueuing work block */ alf_wb_enqueue(wb_handle);

}

alf_task_finalize(task_handle);

� Create NUM_ROW Work Blocks from Matrix addition and enqueue them to be

processed.


ALF – Compute task

� Compute Task processes a Work Block� A task takes in the input data, context.

parameters and produces the output

data


ALF – Compute Task

/* variable declarations */

alf_task_desc_handle_t task_desc_handle;alf_task_handle_t task_handle;const char* spe_image_name;const char* library_path_name;const char* comp_kernel_name = “alf_accel_comp_kernel”;

/* describing a task that’s executable on the SPE*/

alf_task_desc_create(alf_handle, ALF_ACCEL_TYPE_SPE, &task_desc_handle;);

alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_TSK_CTX_SIZE, 0);

alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_WB_PARM_CTX_BUF_SIZE, sizeof(add_parms_t));

alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_WB_IN_BUF_SIZE, H * V * 2 sizeof(float));

alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_WB_OUT_BUF_SIZE, H * V * sizeof(float));

alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_NUM_DTL_ENTRIES, 8);

alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_MAX_STACK_SIZE, 4096);

/* providing the SPE executable name */

alf_task_desc_set_int64(task_desc_handle, ALF_TASK_DESC_ACCEL_IMAGE_REF_L,(unsigned long long) spe_image_name);

alf_task_desc_set_int64(task_desc_handle, ALF_TASK_DESC_ACCEL_LIBRARY_REF_L,(unsigned long) library_path_name);

alf_task_desc_set_int64(task_desc_handle, ALF_TASK_DESC_ACCEL_KERNEL_REF_L,(unsigned long) comp_kernel_name);

int alf_accel_comp_kernel( void *p_task_context, void *p_parm_context, void *p_input_buffer, void *p_output_buffer,void *p_inout_buffer, unsigned int current_count, unsigned int total_count) {

unsigned int i, cnt;

vector float *sa, *sb, *sc;add_parms_t *p_parm = (add_parms_t *)p_parm_context;cnt = p_parm->h * p_parm->v / 4;sa = (vector float *) p_input_buffer;sb = sa + cnt;

sc = (vector float *) p_output_buffer;for (i = 0; i < cnt; i += 4) {

sc[i] = spu_add(sa[i], sb[i]);sc[i + 1] = spu_add(sa[i + 1], sb[i + 1]);sc[i + 2] = spu_add(sa[i + 2], sb[i + 2]);

sc[i + 3] = spu_add(sa[i + 3], sb[i + 3]); }return 0;

}

� Define the Compute task.

� Compute Kernel on SPE


ALF – Task Context

� Provides persistent data buffer across work blocks

� Can be used for all-reduce operations such as min, max, sum, average, etc.

� Can have both read-only section and writable section

– Writable section can be returned to host memory once the task is finished– ALF runtime does not provide data coherency support if there is conflict in writing the section back to

host memory– Programmers should create unique task context for each instance of a compute kernel on an

accelerator

/* the context merge function */

int ctx_merge( void* p_task_context_to_be_merged, void* p_task_context )

{

my_task_context_t *p_ctx = (my_task_context_t *) p_task_context;

my_task_context_t *p_mgr_ctx = (my_task_context_t *)

p_task_context_to_be_merged;

if(p_mgr_ctx->max > p_ctx->max)

p_ctx->max = p_mgr_ctx->max;

if(p_mgr_ctx->min < p_ctx->min)

p_ctx->min = p_mgr_ctx->min; return 0;

}


ALF – Queues� Two different queues are important to programmers

– Work block queue for each task• Pending work blocks will be issued to the work queue• Task instance on each accelerator node will fetch from this queue

– Task queue for each ALF runtime• Multiple Tasks executed at one time

– except where programmer specifies dependencies• Future tasks can be issued. They will be placed on the task queue awaiting execution.

� ALF runtime manages both queues

ALF – Buffer management on accelerators

� ALF manages buffer allocation on accelerators’ local memory

� ALF runtime provides pointers to 5 different buffers to a computational kernel– Task context buffer (RO and RW sections)– Work block parameters– Input Buffer– Input/Output Buffer– Output Buffer

� ALF implements a best effort double buffering scheme– ALF determines if there is enough local memory for double buffering

� Double buffer scenarios supported by ALF– 4 buffers: [In0, Out0; In1, Out1]– 3 buffers: [In0, Out0; In1] : [Out1; In2, Out2]


ALF – Synchronization Support

� Barrier– All work blocks enqueued before the barrier are

guaranteed to be finished before any additional work block added after the barrier can be processed on any of the accelerators

– Programmers can register a callback function once the barrier is encountered. The main PPE application thread cannot proceed until the callback function returns.

� Notification– Allows programmers query for a specific work

block completion

– Allows programmers to register a callback function once this particular work block has been finished

– Does not provide any ordering


ALF – Dependencies

� Task dependencies can be specified by:

• alf_task_depends_on( tsk2, tsk1 );


ALF Example (PPU code)/* Compute PI by Buffon-Needle method based on ALF */

#include "alf.h"

#include "pi.h"

result_t result;

char spu_image_path[PATH_BUF_SIZE]; // Used to hold the complete path to SPU image

char library_name[PATH_BUF_SIZE]; // Used to hold the name of spu library

char spu_image_name[] = "alf_pi_spu";

char kernel_name[] = "comp_kernel";

char ctx_setup_name[] = "context_setup";

char ctx_merge_name[] = "context_merge";

int main(int argc __attribute__ ((__unused__)), char *argv[] __attribute__ ((__unused__))) {

/* Declare variables for ALF runtime */

void *config_parms = NULL;

alf_handle_t alf_handle;

alf_task_desc_handle_t desc_info_handle;

alf_task_handle_t task_handle;

alf_wb_handle_t wb_handle;

if (alf_init(config_parms, &alf_handle) < 0) { printf("Failed to call alf_init\n"); return 1; }

if (alf_query_system_info(alf_handle,ALF_QUERY_NUM_ACCEL, ALF_ACCEL_TYPE_SPE, &nodes) < 0) {

printf("Failed to call alf_query_system_info.\n");

return 1;

} else if( nodes <= 0 ) {

printf("Cannot allocate spe to use.\n");

return 1;

}

alf_num_instances_set( alf_handle, nodes );

/* Create a task descriptor for the task */

alf_task_desc_create( alf_handle, ALF_ACCEL_TYPE_SPE, &desc_info_handle );

…

(continues on next page)


ALF Example (PPU code)(continues from previous page)

…

alf_task_desc_set_int32( desc_info_handle, ALF_TASK_DESC_WB_PARM_CTX_BUF_SIZE, sizeof(udata_t) );

alf_task_desc_set_int32( desc_info_handle, ALF_TASK_DESC_TSK_CTX_SIZE, sizeof(result_t) );

alf_task_desc_set_int64( desc_info_handle, ALF_TASK_DESC_ACCEL_IMAGE_REF_L, (unsigned long long) spu_image_name );

alf_task_desc_set_int64( desc_info_handle, ALF_TASK_DESC_ACCEL_LIBRARY_REF_L, (unsigned long long) library_name );

alf_task_desc_set_int64( desc_info_handle, ALF_TASK_DESC_ACCEL_KERNEL_REF_L, (unsigned long long) kernel_name );

alf_task_desc_set_int64( desc_info_handle, ALF_TASK_DESC_ACCEL_CTX_SETUP_REF_L, (unsigned long long) ctx_setup_name );

alf_task_desc_set_int64( desc_info_handle, ALF_TASK_DESC_ACCEL_CTX_MERGE_REF_L, (unsigned long long) ctx_merge_name );

alf_task_desc_ctx_entry_add( desc_info_handle, ALF_DATA_DOUBLE, 1 ); /* result_t.value */

alf_task_desc_ctx_entry_add( desc_info_handle, ALF_DATA_INT32, 2 ); /* result_t.wbs and result_t.pad */

// Create task

alf_task_create( desc_info_handle, &result, 0, 0, 1, &task_handle );

result.value = 0.0; my_udata.iter = ITERATIONS;

for (i = 0; i < WBS; i++) {

alf_wb_create( task_handle, ALF_WB_SINGLE, 1, &wb_handle );

my_udata.seed = rand();

alf_wb_parm_add(wb_handle, (void *)&(my_udata.seed), 1, ALF_DATA_INT32, 2);

alf_wb_parm_add(wb_handle, (void *)&(my_udata.iter), 1, ALF_DATA_INT32, 2);

alf_wb_enqueue( wb_handle );

}

alf_task_finalize(task_handle);

/* Wait until the task is finished */

alf_task_wait(task_handle, -1);

/* Cleanup */

alf_task_desc_destroy(desc_info_handle);

alf_exit(alf_handle, ALF_EXIT_POLICY_FORCE, 0);

pi = result.value / result.wbs;

printf("PI = %f\n", pi);

return 0;

}


ALF Example (SPU code)#include "alf_accel.h"

#include "../pi.h"

int comp_kernel(

void *p_task_ctx,

void *p_parm_ctx_buffer,

void *p_input_buffer __attribute__ ((unused)),

void *p_output_buffer __attribute__ ((unused)),

void *p_inout_buffer __attribute__ ((unused)),

unsigned int current_count __attribute__ ((unused)),

unsigned int total_count __attribute__ ((unused)))

{

udata_t *my_udata_ptr = (udata_t *)p_parm_ctx_buffer;

result_t *tsk_ctx_ptr = (result_t *)p_task_ctx;

srand(my_udata_ptr->seed);

for (i = 0; i < my_udata_ptr->iter; i++) {

x = rand() * 1.0 / RAND_MAX;

y = rand() * 1.0 / RAND_MAX;

if (x*x + y*y <= 1.0) result += 1;

}

tsk_ctx_ptr->value += (result * 4.0 / ITERATIONS);

tsk_ctx_ptr->wbs ++ ;

return 0;

}

int context_setup(void *p_task_ctx)

{


tsk_ctx_ptr->value = 0.0;

tsk_ctx_ptr->wbs = 0;

return 0;

}

int context_merge(void* p_dst_task_ctx, void* p_task_ctx)

{


result_t *dst_tsk_ctx_ptr = (result_t *)p_dst_task_ctx;

tsk_ctx_ptr->value += dst_tsk_ctx_ptr->value;

tsk_ctx_ptr->wbs += dst_tsk_ctx_ptr->wbs;

return 0;

}

ALF_ACCEL_EXPORT_API_LIST_BEGIN

ALF_ACCEL_EXPORT_API("", comp_kernel);

ALF_ACCEL_EXPORT_API("", context_setup);

ALF_ACCEL_EXPORT_API("", context_merge);

ALF_ACCEL_EXPORT_API_LIST_END


Overview




� SIMD




DaCS - Data Communications and Synchronization

� Focused on data movement primitives– DMA like interfaces (put, get)

– Message interfaces (send/recv)

– Mailbox

– Endian Conversion

� Provides Process Management, Accelerator Topology Services

� Based on Remote Memory windows and Data channels architecture

� Common API – Intra-Accelerator (CellBE), Host - Accelerator� Double and multi-buffering

– Efficient data transfer to maximize available bandwidth and minimize inherent

latency

– Hide complexity of asynchronous compute/communicate from developer

� Supports ALF, directly used by US National Lab for Host-Accelerator HPC environment

� Thin layer of API support on CELLBE native hardware

� Hybrid DaCS (not applicable to BSC Prototype)


DaCS - APIs

� Init / Term– dacs_runtime_init– dacs_runtime_exit

� Reservation Service– dacs_get_num_

avail_children– dacs_reserve_children

– dacs_release_de_list

� Process Management– dacs_de_start

– dacs_num_processes_supported

– dacs_num_processes_running

– dacs_de_wait– dacs_de_test

� Group Functions– dacs_group_init– dacs_group_add_member

– dacs_group_close– dacs_group_destroy– dacs_group_accept– dacs_group_leave– dacs_barrier_wait

� Data Communication– dacs_remote_mem_create– dacs_remote_mem_share

– dacs_remote_mem_accept– dacs_remote_mem_release– dacs_remote_mem_destroy– dacs_remote_mem_query– dacs_put– dacs_get– dacs_put_list– dacs_get_list– dacs_send– dacs_recv– dacs_mailbox_write– dacs_mailbox_read– dacs_mailbox_test– dacs_wid_reserve– dacs_wid_release– dacs_test– dacs_wait

� Locking Primitives– dacs_mutex_init– dacs_mutex_share

– dacs_mutex_accept– dacs_mutex_lock– dacs_mutex_try_lock– dacs_mutex_unlock– dacs_mutex_release– dacs_mutex_destroy

� Error Handling– dacs_errhandler_reg– dacs_strerror

– dacs_error_num– dacs_error_code– dacs_error_str– dacs_error_de– dacs_error_pid


DaCS Example (PPU code)

program main

integer, parameter :: N=64

!must match DATA_BUFFER_ENTRIES in control_blk.h

real, dimension(N) :: arr

integer(kind=8) :: aptr

integer :: amod

integer :: j

interface

subroutine update_array(x, i)

real x(*)

integer :: I

end subroutine

end interface

pointer(aptr,arr)

aptr = malloc(N*4+16)

write(*,'(A,z8,A,z8,A)') 'loc arr[',loc(arr),'] aptr[',aptr,']'

arr = (/(real(i),i=1, N)/);

call update_array(arr, N)

write(*,'(8f6.1)') arr

end program

#include <dacs.h>

#include "init_cb.h"

extern dacs_program_handle_t array_spu;

uint32_t local_error_handler(dacs_error_t);

void update_array_(float *carr, int *nptr) {

DACS_ERR_T rc;

int32_t status;

de_id_t rsvd_child_des[MAX_UNIT_COUNT];

dacs_process_id_t rsvd_child_pids[MAX_UNIT_COUNT];

dacs_remote_mem_t carr_remote_mem;

dacs_wid_t ppu_wid[MAX_UNIT_COUNT];

rc = dacs_init(0); ERRCHK("PPE: dacs_init", rc);

rc = dacs_errhandler_reg(local_error_handler, 0);

ERRCHK("PPE: dacs_errhandler_reg", rc);

// reserve the max spes we want

count_spes = MAX_UNIT_COUNT;

rc = dacs_reserve_children( DACS_DE_SPE, &count_spes, rsvd_child_des );

ERRCHK("PPE: dacs_reserve_children", rc);

// release extra children

if (count_spes > num_rsvd_child) {

rc = dacs_release_de_list( count_spes-num_rsvd_child,

&rsvd_child_des[ num_rsvd_child ] );

}

…

rc = dacs_remote_mem_create( carr, DATA_BUFFER_SIZE,

DACS_READ_WRITE,

&carr_remote_mem );

ERRCHK("PPE: dacs_remote_mem_create", rc);

…

(continues in next page)


DaCS Example (PPU code)(continues from previous page)

…

for (i=0; i<unit_count; i++) {

/* load the address into the control block */

cb1[i].offset = i*unit_buffer_size;

cb1[i].chunk_size = unit_buffer_size;

cb1[i].index = i;

}

/* allocate SPE tasks */


int index = cb1[i].index;

PRINTF("ppu unit.%i starting\n",index);

rc = dacs_de_start( rsvd_child_des[i], &array_spu, NULL, NULL,

DACS_PROC_EMBEDDED, &rsvd_child_pids[i] );

ERRCHK("PPE: dacs_de_start", rc);

rc = dacs_wid_reserve( &ppu_wid[i] );

ERRCHK("SPE: dacs_wid_reserve", rc);

PRINTF("ppu unit.%i sending cb\n",index);

rc = dacs_send( &cb1[i], sizeof(cb1[i]),rsvd_child_des[i],

rsvd_child_pids[i],STREAM_ID, ppu_wid[i],

DACS_BYTE_SWAP_DISABLE);

ERRCHK("PPE: dacs_send", rc);

}



rc = dacs_wait(ppu_wid[i]); ERRCHK("SPE: dacs_wait", rc);

PRINTF("ppu unit.%i sharing mem\n",index);

rc = dacs_remote_mem_share(rsvd_child_des[i], rsvd_child_pids[i], carr_remote_mem);

ERRCHK("PPE: dacs_remote_mem_share", rc);

}

…

(cotinues on right column)

(continues from lest column)

…

PRINTF("SPEs 0 - %d are now computing...\n",unit_count);

/* wait for SPEs to all finish */

for (i=0; i<unit_count; ++i) {


PRINTF("ppu unit.%i waiting on completion\n",index);

rc = dacs_de_wait(rsvd_child_des[i], rsvd_child_pids[i], &status);

if (rc != DACS_STS_PROC_FINISHED) {

PRINTF("dacs_de_wait rc[%d] - %s\n", rc, dacs_strerror(rc));

}

}

PRINTF("ppu destroying memory share\n");

rc = dacs_remote_mem_destroy(&carr_remote_mem);

ERRCHK("PPE: dacs_remote_mem_destroy", rc);

PRINTF("ppu exiting %i\n",i);

dacs_exit();

PRINTF("SPEs done.\n");

}


DaCS Example (SPU code)#include <dacs.h>

#include "init_cb.h"

/* local copy of the data array, to be filled by the DMA */

float *data;

/* local copy of the control block, to be filled by the receive */

control_block cb1 __attribute__ ((aligned(128)));

int main(unsigned long long speid, addr64 argp, addr64 envp) {

DACS_ERR_T rc;

dacs_wid_t spu_wid;

dacs_remote_mem_t spu_remote_mem;

rc = dacs_init(0); ERRCHK("array_spu: dacs_init", rc);

rc = dacs_wid_reserve(&spu_wid); ERRCHK("array_spu: dacs_wid_reserve", rc);

PRINTF("array_spu: receiving cb\n");

rc = dacs_recv(&cb1, sizeof(cb1), DACS_DE_PARENT, DACS_PID_PARENT, STREAM_ID, spu_wid,DACS_BYTE_SWAP_DISABLE);

ERRCHK("array_spu: dacs_recv", rc);

rc = dacs_wait(spu_wid);

ERRCHK("array_spu: dacs_wait", rc);

rc = dacs_remote_mem_accept(DACS_DE_PARENT, DACS_PID_PARENT, &spu_remote_mem);

n = cb1.chunk_size/sizeof(float);

data = (float *)_malloc_align(cb1.chunk_size, 7);

rc = dacs_get(data, spu_remote_mem, cb1.offset,cb1.chunk_size,spu_wid,DACS_ORDER_ATTR_NONE, DACS_BYTE_SWAP_DISABLE);


PRINTF("array_spu: unit.%i invoking compute\n",index);

compute_f_(data, &n);

rc = dacs_put(spu_remote_mem, cb1.offset, data, cb1.chunk_size,spu_wid,DACS_ORDER_ATTR_NONE, DACS_BYTE_SWAP_DISABLE);


rc = dacs_remote_mem_release(&spu_remote_mem);

rc = dacs_wid_release(&spu_wid);

dacs_exit();

return 0;

}

subroutine compute_f(x, n)

real :: x(*)

integer :: n

real :: factr(n)

factr = (/ (rand(), i=1, n) /)

x(1:n) = x(1:n) * factr

end subroutine


This document was developed for IBM offerings in the United States as of the date of publication. IBM may not make these offerings available in

other countries, and the information is subject to change without notice. Consult your local IBM business contact for information on the IBM

offerings available in your area. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained

in this document.

Information in this document concerning non-IBM products was obtained from the suppliers of these products or other public sources. Questions

on the capabilities of non-IBM products should be addressed to the suppliers of those products.

IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give

you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, New Castle Drive, Armonk, NY

10504-1785 USA.

All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives

only.

The information contained in this document has not been submitted to any formal IBM test and is provided "AS IS" with no warranties or

guarantees either expressed or implied.

All examples cited or described in this document are presented as illustrations of the manner in which some IBM products can be used and the

results that may be achieved. Actual environmental costs and performance characteristics will vary depending on individual client configurations

and conditions.

IBM Global Financing offerings are provided through IBM Credit Corporation in the United States and other IBM subsidiaries and divisions

worldwide to qualified commercial and government clients. Rates are based on a client's credit rating, financing terms, offering type, equipment

type and options, and may vary by country. Other restrictions may apply. Rates and offerings are subject to change, extension or withdrawal

without notice.

IBM is not responsible for printing errors in this document that result in pricing or information inaccuracies.

All prices shown are IBM's United States suggested list prices and are subject to change without notice; reseller prices may vary.

IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply.

Many of the features described in this document are operating system dependent and may not be available on Linux. For more information,

please check: http://www.ibm.com/systems/p/software/whitepapers/linux_overview.html

Any performance data contained in this document was determined in a controlled environment. Actual results may vary significantly and are

dependent on many factors including system hardware configuration and software design and configuration. Some measurements quoted in this

document may have been made on development-level systems. There is no guarantee these measurements will be the same on generally-

available systems. Some measurements quoted in this document may have been estimated through extrapolation. Users of this document

should verify the applicable data for their specific environment.

Revised January 19, 2006

Special Notices -- Trademarks


The following terms are trademarks of International Business Machines Corporation in the United States and/or other countries: alphaWorks, BladeCenter,

Blue Gene, ClusterProven, developerWorks, e business(logo), e(logo)business, e(logo)server, IBM, IBM(logo), ibm.com, IBM Business Partner (logo),

IntelliStation, MediaStreamer, Micro Channel, NUMA-Q, PartnerWorld, PowerPC, PowerPC(logo), pSeries, TotalStorage, xSeries; Advanced Micro-

Partitioning, eServer, Micro-Partitioning, NUMACenter, On Demand Business logo, OpenPower, POWER, Power Architecture, Power Everywhere, Power

Family, Power PC, PowerPC Architecture, POWER5, POWER5+, POWER6, POWER6+, Redbooks, System p, System p5, System Storage, VideoCharger,

Virtualization Engine.

A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml.

Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc. in the United States, other countries,

or both.

Rambus is a registered trademark of Rambus, Inc.

XDR and FlexIO are trademarks of Rambus, Inc.

UNIX is a registered trademark in the United States, other countries or both.

Linux is a trademark of Linus Torvalds in the United States, other countries or both.

Fedora is a trademark of Redhat, Inc.

Microsoft, Windows, Windows NT and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries or both.

Intel, Intel Xeon, Itanium and Pentium are trademarks or registered trademarks of Intel Corporation in the United States and/or other countries.

AMD Opteron is a trademark of Advanced Micro Devices, Inc.

Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States and/or other countries.

TPC-C and TPC-H are trademarks of the Transaction Performance Processing Council (TPPC).

SPECint, SPECfp, SPECjbb, SPECweb, SPECjAppServer, SPEC OMP, SPECviewperf, SPECapc, SPEChpc, SPECjvm, SPECmail, SPECimap and

SPECsfs are trademarks of the Standard Performance Evaluation Corp (SPEC).

AltiVec is a trademark of Freescale Semiconductor, Inc.

PCI-X and PCI Express are registered trademarks of PCI SIG.

InfiniBand™ is a trademark the InfiniBand® Trade Association

Other company, product and service names may be trademarks or service marks of others.

Revised July 23, 2006

Special Notices (Cont.) -- Trademarks


(c) Copyright International Business Machines Corporation 2005.

All Rights Reserved. Printed in the United Sates September 2005.

The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both. IBM IBM Logo Power Architecture

Other company, product and service names may be trademarks or service marks of others.

All information contained in this document is subject to change without notice. The products described in this document are NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could result

in death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or change IBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity under the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specific

environments, and is presented as an illustration. The results obtained in other operating environments may vary.

While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied upon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made.

THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document.

IBM Microelectronics Division The IBM home page is http://www.ibm.com

1580 Route 52, Bldg. 504 The IBM Microelectronics Division home page is Hopewell Junction, NY 12533-6351 http://www.chips.ibm.com

Special Notices - Copyrights

scicomp15 - programming ibm powerxcell 8i

Documents