application acceleration with the cell broadband engine

Download Application Acceleration with the Cell Broadband Engine

If you can't read please download the document

Upload: m

Post on 23-Sep-2016

215 views

Category:

Documents


2 download

TRANSCRIPT

  • 76 Copublished by the IEEE CS and the AIP 1521-9615/10/$26.00 2010 IEEE Computing in SCienCe & engineering

    N o v E l A r C h I t E C t u r E S

    Editors: Volodymyr Kindratenko, [email protected]

    Pedro Trancoso, [email protected]

    ApplicAtion AccelerAtion with the cell BroAdBAnd engineBy Guochun Shi, Volodymyr Kindratenko, Frederico Pratas, Pedro Trancoso, and Michael Gschwind

    T he Cell Broadband Engine Architecture (CBEA),1 jointly developed by Sony, Toshiba, and IBM, was conceived as a next- generation chip architecture for multimedia and compute-intensive processing. The first target system for the new architecture was the Sony PlayStation 3 game console. The CBEA is a heterogeneous chip mul-tiprocessing architecture designed to provide the flexibility and perfor-mance that game applications require. Key design goals were to

    support a large degree of parallel-ism, as the target applications are highly parallel; and offer high memory bandwidth, as the target applications process large amounts of data.

    Because all modern processor designs are increasingly bandwidth limited, it was of paramount importance to effi-ciently exploit available off-chip mem-ory bandwidth to fully exploit CBEAs performance potential. CBEAs design gives programmers the opportunity to better utilize available chip bandwidth by performing explicit data transfers in parallel with computation by exploit-ing application knowledge about data reference patterns. These design tar-gets resonate well with computational scientists needs in high-performance computing and CBEAs applicability

    to general-purpose computation has been widely recognized in the scien-tific computing community.2

    Currently, IBM offers two proces-sors based on the CBEA: the Cell Broadband Engine (Cell/B.E.) pro-cessor, which is used in Sonys Play-Station 3 and IBMs first Cell blades, and the PowerXCell 8i processor, which is at the core of Roadrunner, the worlds first petaflop supercomputer.

    Cell/B.E. ArchitectureThe CBEA is a system architecture that extends the industry-standard 64-bit IBM Power Architecture with an accelerator architecture for compute-intensive workloads. To optimize data access for compute-intensive work-loads and avoid the performance pen-alties associated with cache misses in traditional architectures, the CBEA includes novel high-bandwidth data movement engines to transfer data blocks between system memory and on-chip storage. The CBEA also in-cludes advanced communication and synchronization primitives between the processor elements.

    Figure 1 shows a Cell/B.E. proces-sors major components:

    the main processing element (the Power processor element, or PPE), the parallel processing accelerators (synergistic processor elements, or SPEs),

    the on-chip interconnect (a bidirec-tional data ring known as the ele-ment interconnect bus, or EIB), andthe I/O interfaces (the memory in-terface controller, or MIC, and the Cell Broadband Engine interface, or BEI).

    The PPEs main function is to run the operating system, manage the sys-tem resources and SPE threads, and act as controller of the SPEs, which handle the computational workload. The PPE contains a 64-bit PowerPC processor unit (PPU), two separate 32-Kbyte level 1 caches for instructions and data, and a unified 512-Kbyte level 2 cache for instructions and data. The PPU supports two-way simultaneous multi-threading. The PPE can complete two double precision operations per clock cycle, resulting in a peak performance of 6.4 Gflops at 3.2 GHz.

    Each SPE is an accelerator core based on a single-instruction multiple-data (SIMD) reduced instruction set computer processor. This processor is composed of a dual-issue pipelined synergistic processor unit (SPU) and a memory flow controller (MFC) that can perform computation and data transfer in parallel. The SPU archi-tecture implements a new SIMD instruction set operating on 128-bit SIMD vectors of integer or floating point elements. The SPU operation semantics are similar to those of the

    The Cell Broadband Engine is a heterogeneous chip multiprocessor that combines a PowerPC processor core with eight single-instruction multiple-data accelerator cores and delivers high performance on many computationally intensive codes.

  • January/February 2010 77

    PowerPC SIMD extensions. The SPE has a unified 128-entry 128-bit wide SIMD register file to store operands of all data types. The SPE compute instructions always operate on a full vector of data. If a scalar result is needed, the first vector element (the preferred slot) can be used in scalar computations or as memory address in the local store. Each SPE can directly address a local store of 256 Kbytes for instruction and data references. This local store is explicitly managed by softwarethat is, the compiler or programmerby way of MFC block transfers between system memory and the local store. The MFC can issue up to 16 simultaneous direct memory access (DMA) operations of up to 16 Kbytes each between the local store and system memory. Unlike tradition-al hardware caches, this organization lets the system prefetch large memory operands into on-chip memory in par-allel with program execution, avoiding the performance-degrading effects of frequent cache misses commonly as-sociated with hardware caches. Be-cause large data-transfer blocks can be efficiently scheduled using the MFCs, this organization also makes efficient

    use of off-chip memory bandwidth, one of the primary bottlenecks in modern systems.

    The PPE and SPEs communicate with each other and with the main storage and I/O through the EIB. The EIB implements a bidirectional data ring with four communication chan-nels between the various system units: the PPE, the memory controller, the eight SPEs, and the two off-chip I/O interfaces (BEI and MIC). The EIB consists of four 16-byte-wide data rings, each capable of transferring 128 bytes at a time, with a maximum bandwidth of 96 bytes per processor clock cycle.

    The memory-coherent EIB has two external interfaces:

    The MIC provides the interface be-tween the EIB and the main mem-ory, allowing memory accesses of 1 to 8, 16, 32, 64, or 128 bytes. The BEI manages data transfers be-tween the EIB and I/O devices via two Rambus FlexIO external I/O channels.

    For I/O operations, the BEI trans-lates addresses, processes commands,

    implements an internal interrupt con-troller, and provides bus interfacing. It can also be configured to build systems consisting of multiple Cell/B.E. pro-cessors, a function exploited in IBMs Cell blades to provide Cell systems comprising two PPEs and 16 SPEs.

    The PowerXCell 8i processor is an enhanced implementation of the Cell/B.E. architecture designed for high-performance, double-precision floating-point-intensive workloads that benefit from large capacity main memory. The PowerXCell 8i imple-mentation is also optimized for sci-entific workloads. The enhancements improve the double-precision floating point performance by approximately a factor of eight, achieving an aggre-gate throughput of 108 Gflops per chip, or 12.8 Gflops per SPE. This is achieved by fully pipelining the double-precision arithmetic unit, as well as by reducing each double-precision operations latency from 13 cycles in the original implementation to only nine cycles. Moreover, replacing the original processors Rambus XDR bus with a double data rate memory has expanded the systems memory ca-pacity from 2 to 16 Gbytes.

    Figure 1. the Cell/B.E. processor. the processors key components are the Power processor element (PPE), eight synergistic processor elements (SPEs), the element interconnect bus (EIB), and two input/output units: the memory interface controller (MIC) and the Cell/B.E. interface (BEI).

    SPE

    SPU SPU SPU SPU

    SPU SPU SPU SPU

    SXUSXUSXUSXU

    SXUSXUSXUSXU

    LS LS LS LS

    LS LS

    MFC MFC

    Element interconnect bus (EIB)

    MFC MFC

    MFC MFC MFC MFC

    LS LS

    SPE

    Power processor element (PPE)

    PPU

    PXULevel 1cache

    Level 2cache

    SPE SPE

    I/O

    BEI

    BEI

    MIC

    FlexIO

    Dual XDR/DDR2

    SPE SPE SPE SPE

    SPE - Synergistic processor elementSPU - Synergistic processor unitSXU - Synergistic execution unitLS - Local storeMFC - Memory ow controllerPPU - Power processor unitPXU - Power execution unitBEI - Broadband engine interfaceMIC - Memory interface controllerXDR/DDR2 - Extreme data rate/double data rate 2

  • N o v E l A r C h I t E C t u r E S

    78 Computing in SCienCe & engineering

    Numerical Computing with Cell/B.E.To help illustrate how we can use the Cell/B.E. for numerical comput-ing, well use an example of a 3 3 complex matrix vector multiplication, mult_su3_mat_vec. Figure 2 shows how we can implement the compu-tation for this multiplication on a single-threaded CPU without data-parallel SIMD extensions. In a typical

    application, mult_su3_mat_vec is called multiple times to process the input dataset consisting of two arrays: an array of 3 3 complex matrices and an array of 3 1 complex vectors.

    Converting this code to exploit the Cell/B.E.s capabilities consists of par-titioning the code into two parts:

    the PPE part, which handles the creation of SPE threads and other

    general tasks such as memory allo-cation and data initialization; andthe SPE part, which implements the compute part of the application.3

    The IBM Cell/B.E. software develop-ers kit (www.ibm.com/developerworks/power/cell) includes an SPE manage-ment library libspe that we use to access and manage SPEs. Alterna-tively, the SDKs accelerated library framework has a set of interfaces to simplify the offloading of computa-tionally intensive work to the SPEs.

    The PPE creates a separate thread to schedule work for each SPE. The input dataset is usually divided into partitions and each SPE will process one of them. As Figure 3 shows, the matrix and vector input arrays are evenly divided into N partitions of equal size, where N is the number of SPEs, and each partition is as-signed to one SPE for computa-tion. The computations on each data partition are independent of each oth-er, therefore no synchronization or communication is necessary among SPEs during code execution. After the computations are done, the SPEs send the results back to the main memory.

    As we now show, we can further optimize the C code in Figure 2 to exploit the SPE architectures capa-bilities. Programming SPEs for high performance requires code optimiza-tion to exploit the SIMD processing units, typically with intrinsics or auto-vectorizing compilers. To exploit the SPEs data-parallel vector capabilities, we must use appropriately formatted vectors. At times, this can require data reformatting to make good use of the data-parallel vector operations.

    Take the SPE intrinsic spu_madd(a, b, c) as an example: it mul-tiplies each element of vector a by an element from vector b and adds it to

    Figure 2. A 3 3 complex matrix-vector multiplication kernel example taken from a math library used in quantum chromodynamics applications. the kernel performs 72 floating point operations processing 120 bytes of data.

    struct su3_matrix { complex e[3][3]; }; struct su3_vector { complex c[3]; };

    void mult_su3_mat_vec(su3_matrix *a, su3_vector *b, su3_vector *c){ int i, j; complex x, y; for (i=0; ic[j].real - a->e[i][j].imag *

    b->c[j].imag; y.imag = a->e[i][j].real * b->c[j].imag + a->e[i][j].imag *

    b->c[j].real; x.real += y.real; x.imag += y.imag; } c->c[i] = x; } }

    Figure 3. Each synergistic processor unit (SPu) independently computes on a part of the input dataset. the results are assembled on the main processing unitthe Power processor element, or PPE.

    PPE

    SPE0

    PPEOutput array of vectors

    SPE7

    Input array of matrixes

    Input array of vectors

    // DMA input data// mat_src and vec_src

    for (i=start_i; i < end_i; i ++){ mult_su3_mat_vec( mat_src, vec_src, vec_dest);}

    // DMA out output data// vec_dest

  • January/February 2010 79

    the corresponding element of vector c. Each vector is 16 bytes long and can thus hold four single-precision float-ing point values or two 8-byte double-precision floating point values. When spu_madd is called, it performs com-putations on all elements of vectors a, b, and c at once, thus producing, for example, four single-precision floating point results on each call. In contrast, simply looping through all four vector elements and computing a[i] b[i] + c[i] will take four times as long.

    In our complex matrix-vector multi-plier example, one 128-bit long vector can hold two complex variables, which in turn have a real and imaginary part expressed as single-precision floating point numbers. To fully use the vector capabilities, we cant reference the real and imaginary part directly as we did in the scalar CPU code. Instead, we must either reformat the register files data using appropriate SPE intrinsics (shuffle the data on the fly) to form vec-tors that have only real or imaginary parts for four complex numbers, or use an alternative data layout in memory.

    Data shuffling is straightforward to implement using SPE intrin-sics; Figure 4 shows the process for our matrix-vector multiplication ex-ample. Each complex variables real and imaginary parts are shuffled into separate vectors for both inputs us-ing the spu_shuffle(a,b,pattern) in trinsic. This instruction constructs a vector based on the pattern from the combined 32 bytes from input vectors a and b. After the real and imaginary parts are separated into different vec-tors, multiplication and addition opera-tions necessary to implement complex arithmetic are performed on these vectors using spu_madd, spu_mul, and spu_nmsub SPE intrinsics. The resultswhich effectively multiply elements of one complex vector by

    anotherare again stored as vectors that hold real and imaginary parts of computed values. Each such vector is then summed up to obtain the dot prod-uct of two complex vectors. To compute the rest of the matrix-vector product, we repeat the same steps for the remaining two rows of the input matrix and shuffle the computed results to generate the fi-nal output, in which real and imaginary parts are stored together.

    In addition to code vectorization, several other optimizations are im-portant when tuning for high perfor-mance on the Cell/B.E.3 Among these, improving data alignment helps both data transfer and computation efficien-cy. Improving data alignment helps

    data transfer efficiency because the Cell/B.E. processor delivers the best memory-to-local store bandwidth when both source and destination addresses are aligned at 128-byte boundaries and the amount of data to be transferred is a multiple of 128 bytes. Non-aligned DMA requests run at slower speeds. Alignment also contributes to im-proved computation efficiency because the use of SPE intrinsics requires the data to be treated as 16-byte vectors.

    In Figure 2s example, su3_ matrix stores a 3 3 complex single-precision matrix with total size of 72 bytes and su3_vector stores three complex single-precision values with a total size of 24 bytes; thus, theyre

    Figure 4. the synergistic processor elements (SPE) implementation of the compute kernel. the implementation involves data shuffling to reassemble the data into short vectors of four single-precision elements, which are then used in a single-instruction multiple-data computations to generate the final output.

    Input

    src

    src[0]

    real img

    real img

    mat

    mat[0,0]

    va

    va_r

    v_r = spu_mul(va_r, vb_r);v_r = spu_nmsub(va_i, vb_i, v_r);v_i = spu_mul(va_r, vb_i);v_i = spu_madd(va_i, vb_r, v_i);

    = Complex product

    vb_i

    v_iv_r

    v_r_sum_0 v_i_sum_0

    Output

    vout0

    vc

    va[1] vb[0]va[0]

    vector oat *vb = (vector oat*)src

    vb[1]

    v_r_sum_0 = vsum(v_r); v_i_sum_0 = vsum(v_i);

  • N o v E l A r C h I t E C t u r E S

    80 Computing in SCienCe & engineering

    not aligned at 128-byte boundar-ies. Because the su3_matrix and su3_vector are typically contained in large aggregates, one approach to meet the alignment requirement is to

    gather the matrix and vector data into aligned and continuous mem-ory regions in system memory before transferring it to the SPEs using DMA block transfers, perform the computations in the SPEs, transfer the results back to system memory using DMA block trans-fers, andscatter the data to their discrete lo-cations in the host memory.

    However, an initial estimate of this so-lution suggested that the overhead of gathering and scattering the data was too high for this application. To address this, while simultaneously minimizing code change and improving memory bandwidth utilization, we changed the su3_matrix definition from a 3 3 matrix to a 4 4 matrix, thus changing its size to 128 bytes, and changed the su3_vector definition from a vector of three complex variables to a vector of four complex variables. The latter vec-tor is typically used in an array of four such vectors and, after the padding, the size of this array is also 128 bytes (we filled the unused slots in these padded memory structures with zeros).

    Other important code optimizations include double buffering, loop unroll-ing, and software pipelining. Double buffering refers to software pipelining of memory transfer requests wherein memory transfers are ahead of time to overlap the data transfer with a prior

    data blocks computation. Double buff-ering is an important code transforma-tion that exploits the SPEs ability to perform data transfer and computation in parallel, thereby reducing overall ex-ecution time. These techniques increase each loop iterations computational con-tent, eliminate unnecessary stalls, and help improve instruction scheduling for the dual issue execution pipelines.

    Case Study: MILCOur complex matrix-vector multipli-cation code example is drawn from MILC (the multiple instruction, mul-tiple data lattice computation quantum chromodynamics) application. MILC simulates four-dimensional SU(3) lat-tice gauge theory. At the US National Center for Supercomputing Applica-tions (NCSA), weve ported one of the MILC distributions applications, called clover_dynamical, to the Cell/B.E. using a bottom up approach,4 identifying all time-consuming com-pute kernels and reimplementing them to execute on the SPEs. Altogether, 27 kernels are responsible for over 98 percent of overall execution time. Figure 5 shows performance compari-sons of the NCSAs MILC Cell/B.E. implementation running on a 3.2 GHz Cell processor with a parallel imple-mentation running on a quad-core 2.33 GHz Intel Xeon CPU.

    In MILC, compute kernel per-formance is often limited not by the ability to run calculations on SPEs, but by the ability to deliver data fast enough to sustain the full-speed cal-culations. In the Cell/B.E., the peak SPE-to-main-memory bandwidth is 25.6 Gbytes, which is one of the highest memory-to-CPU bandwidth designs

    today. However, the SPEs have even more compute bandwidth than the MILC application can exploit based on its compute-to-memory band-width ratio. With double buffering to overlap communication and computa-tion, the Cell/B.E. has a computetomemory bandwidth ratio of 8 flops per byte of data fetched from the system memory (204.8 Gflops/25.6 Gbytes). In comparison, for many important MILC kernels, the ratio of flops per byte of data is much lower. As an ex-ample, the complex matrix multiplica-tion subroutine we described earlier exhibits a compute-to-memory band-width ratio of approximately 0.6 flops per byte prior to data padding.

    The Cell/B.E. architecture is supported by a complete SDK that you can download from IBMs Cell Broadband Engine resource cen-ter (www.ibm.com/developerworks/power/cell). The SDK provides a full range of development tools, including a compiler, a debugger, performance analysis tools, and sample code. In ad-dition, it includes a simulator that you can use to develop code on other plat-forms before executing it natively on the Cell/B.E. To compile code for the Cell/B.E., you can use either the IBM xlc or the GNU gcc (ppu-gcc and spu-gcc) compilers. (For more infor-mation on the Cell/B.E. architecture and programming, see the IBM Red Book, Programming the Cell Broad-band Engine Architecture.3)

    In addition to the original Pthread-based programming model for develop-ing Cell/B.E. applications, developers have ported several other programming models to this architecture, including OpenMP,5 message-passing interface,6 Googles MapReduce,7 the data-parallel RapidMind model,8 and the data- driven Cell Superscalar (CellSs) model.9

    Figure 5. Performance of the clover_dynamical MIlC (multiple instruction, multiple data lattice computation quantum chromodynamics) application for different lattice configurations. Measurements were obtained by researchers at the uS National Center for Supercomputing Applications (NCSA) at the university of Illinois and previously reported by Guochun Shi and his colleagues.4 Intel Xeon measurements are based on the original MIlC implementation; the Cell/B.E. measurements are based on NCSAs MIlC implementation for the Cell/B.E.

    881616 lattice 16161616 lattice

    Execution timeIntel Xeon Cell/B.E.

    4.5 sec

    speedup Intel XeonExecution time

    speedup

    15.4 sec 100.2 sec3.4 5.717.5 secCell/B.E.

  • January/February 2010 81

    The application range available on the Cell/B.E. is diverse and reflects the broad appeal of compute-intensive ap-plications. In the game console space, various applications take advantage of the new architectures advanced features to deliver cutting-edge entertainment. The Cell/B.E. is also used in medical applications and for medical research. Finally, developers have ported a wide range of scientific applications to the Cell/B.E. systems covering a spectrum of application areas, including bioinfor-matics,10 physics,4 and cosmology,11 to name a few. In addition to Cell-blades-based systems, some researchers are us-ing clusters based on Sonys Play Station 3 in scientific research such as the UMass Gravity Grid (http://gravity.phy.umassd.edu/ps3.html).

    In 2008, the PowerXCell 8i-based Roadrunner system became the worlds first petaflops supercomputer reflect-ing the processors attractive high- performance architecture and its util-ity for high-performance applications. As testament to the Cell/B.E. architec-tures efficiency, Cell/B.E.-based sys-tems are also the top energy-efficient systems on the green500 list (www.green500.org).

    ReferencesM. Gschwind, Chip Multiprocessing 1.

    and the Cell Broadband Engine, Proc.

    3rd ACM Conf. Computing Frontiers,

    ACM Press, 2006, pp. 18.

    S. Williams et al., the Potential of the 2.

    Cell Processor for Scientific Comput-

    ing, Proc. 3rd ACM Conf. Computing

    Frontiers, ACM Press, pp. 920.

    IBM redbooks, 3. Programming the Cell

    Broadband Engine Architecture: Examples

    and Best Practices, IBM, 2008; www.

    redbooks.ibm.com/redpieces/abstracts/

    sg247575.html.

    G. Shi, v. Kindratenko, and S. Gottlieb, 4.

    the Bottom-up Implementation of

    one MIlC lattice QCD Application

    on the Cell Blade, Intl J. Parallel

    Programming, vol. 37, no. 5, 2009,

    pp. 488507.

    K. oBrien et al., Supporting openMP 5.

    on Cell, Intl J. Parallel Programming,

    vol. 36, no. 3, 2008, pp. 289311.

    M. ohara et al., MPI Microtask for 6.

    Programming the Cell Broadband

    Engine Processor, IBM Systems J.,

    vol. 45, no. 1, 2006, pp. 85102.

    M. de Kruijf, and K. Sankaralingam, 7.

    Mapreduce for the Cell Broadband

    Engine Architecture, IBM J. Research

    and Development, vol. 53, no. 3, 2009.

    M.D. McCool, Data-Parallel Program-8.

    ming on the Cell BE and the GPu using

    the rapidMind Development Platform,

    Proc. GSPx Multicore Applications Conf.,

    2006; www.cs.ucla.edu/~palsberg/

    course/cs239/papers/mccool.pdf.

    P. Bellens et al., CellSs: A Programming 9.

    Model for the Cell BE Architecture,

    Proc. ACM/IEEE Supercomputing Conf.

    (SC06), IEEE Press, 2006, pp. 5.

    F. Blagojevic et al., rAxMl-Cell: Paral-10.

    lel Phylogenetic tree Inference on the

    Cell Broadband Engine, Proc. IEEE Intl

    Parallel and Distributed Processing Symp.,

    IEEE Press, 2007, pp. 110.

    Salman habib et al., hybrid Peta-11.

    computing Meets Cosmology: the

    roadrunner universe Project, J. Phys.

    Conf. Series, vol. 180, 2009; doi:

    10.1088/1742-6596/180/1/012019.

    Guochun Shi is a research programmer at the uS National Center for Supercomputing Appli-

    cations at the university of Illinois. his research

    interests are in high-performance computing.

    Shi has an MS in computer science from the

    university of Illinois at urbana-Champaign.

    Contact him at [email protected].

    Volodymyr Kindratenko is a senior research scientist at the uS National Center for Su-

    percomputing Applications at the university

    of Illinois. his research interests include

    high-performance computing and special-

    purpose computing architectures. Kin-

    dratenko has a DSc in analytical chemistry

    from the university of Antwerp. he is a

    senior member of the IEEE and the ACM.

    Contact him at [email protected].

    Frederico Pratas is a PhD student at INESC-ID/ISt at the technical university of lisbon,

    Portugal. his research interests include com-

    puter architectures, high-performance com-

    puting, parallel and distributed computing,

    and reconfigurable computing. Pratas has an

    MSc in electrical and computer engineering

    from the technical university of lisbon. he is

    a member of IEEE. Contact him at Frederico.

    [email protected].

    Pedro Trancoso is an assistant professor at the Department of Computer Science at the

    university of Cyprus, Cyprus. his research in-

    terests include computer architecture, multi-

    core architectures, memory hierarchy, parallel

    programming models, database workloads,

    and high-performance computing. trancoso

    has a PhD in computer science from the uni-

    versity of Illinois at urbana-Champaign. he is

    a member of the IEEE, the IEEE Computer So-

    ciety, and the ACM. Contact him at pedro@

    cs.ucy.ac.cy.

    Michael Gschwind is manager of systems architecture in IBMs Systems and technol-

    ogy Group at Poughkeepsie, New York,

    where his team is responsible for mainframe,

    PowerPC, and I/o architecture. his research

    interests include computer architecture, mi-

    croarchitecture, and compilation technology.

    Gschwind has a PhD in computer engineer-

    ing from technische universitt Wien. he is

    an IBM Master Inventor, a member of the

    IBM Academy of technology, and a fellow of

    the IEEE. Contact him at [email protected].

    Selected articles and columns from IEEE Computer Society publica-

    tions are also available for free at http://ComputingNow.computer.org.