energy-efficient many-core systems at jsctau.dlr.de › fileadmin › documents › meetings ›...
TRANSCRIPT
DLR Braunschweig, 14th and 15th of October 2009 W.Homberg
Mitg
lied
der
Hel
mh
oltz
-Gem
ein
sch
aft
Energy-Efficient Many-Core Systems at JSC
Symposium:CFD on Future Architectures – Many-Cores, GPUs, FPGAs, …
October 14-15, 2009 Slide 2
Outline
Top500 / Green500
JUICE■ Cell BE / PowerXCell 8i
JUPACE
■ QPACE (SFB TR 55)
■ FPGA-based Torus Network
■ PRACE/WP8 Prototype eQPACE
JUGIPSY
■ GPU System NIVIDIA Tesla S1070
Outlook
October 14-15, 2009 Slide 3
Jülich Supercomputing Centre
Jugene:
IBM BlueGene/P
294912 processors
1 Petaflops
2.3 MW
Juropa:
Bull, Sun
3288 procs / 26394 cores
308 Teraflops
1.5 MW
October 14-15, 2009 Slide 4
Top500 – June 2009 Top 10
October 14-15, 2009 Slide 5
Green500 – June 2009 Top 10
October 14-15, 2009 Slide 6
The JUICE (JUelich Initiative CEll cluster) Project
By 2006 the usage of the Cell processor is gradually widening to scientific applications.
One of the first Cell-based systems was IBM's QS20 blade equipped with two processors and 512 MB memory each and external connectivity provided by a Gigabit Ethernet and an optional InfiniBand network interface.
In the beginning of 2007 a cluster of 12 QS20 blades was established at JSC providing a single-precision floating point peak performance of about 5 TFLOPS.
A performance of 4.5 TFLOPS (90% peak) could be attained with a matrix multiplication using all nodes via InfiniBand network.
The Cell Cluster Meeting 2007 in Jülich with international attandence was organized at JSC:
http://www.fz-juelich.de/jsc/juice/cell_cluster_meeting
October 14-15, 2009 Slide 7
JUICE: Activities at JSC
Lanczos Implementation for Hubbard Models (Andreas Dolfen, Erik Koch)
MPI on Cell (Norbert Eicker)
Benchmark Matrix-Matrix Multiplication (Inge Gutheil)
A Multigrid Method for the Solution of the Poisson Equation (Matthias Bolten)
CellSs and Triple-Matrix-Multiply (Liang Yang, Annika Schiller, Godehard Sutmann, B.Wylie)
A Fast Wavelet Based Implementation to Calculate Coulomb Potentials (Annika Schiller, Godehard Sutmann, Liang Yang)
Comparison of CellSs and native programming with a Jacobi solver and triple-matrix-multiply on Cell/B.E. (Liang Yang, Annika Schiller, Godehard Sutmann, Brian J. N. Wylie, Ralph Altenfeld, and Felix Wolf)
JUICE - Jülich Intitiative Cell Cluster - Report 2007 (FZJ-JSC-IB-2007-13)
October 14-15, 2009 Slide 8
JUICE: External Partners
50 registered users and 30 sites
Master's Theses:
Porting the NestStep Run-time System to the CELL Broadband Engine, Daniel Johansson, Linköping 2007
A Fast Multigrid Solver for Molecular Dynamics on the Cell Broadband Engine, Daniel Ritter, Erlangen 2008
Über die Cell/B.E.-Architektur: Optionen zur Generierung von Programm-Traces, Daniel Hackenberg, Dresden 2008
Beschleunigung von Bildrekonstruktionsverfahren in der Positronen-Emissions-Tomographie unter Einsatz der Cell Broadband Engine, Michael Sievers, Paderborn
October 14-15, 2009 Slide 9
JUICEnext
Hardware:
35 QS22 blades
70 PowerXCell 8i
7 TFLOPS peak
Main memory: 35 * 8 GB (aggregate 280 GB)
Infiniband 4x SDR
BladeCenter management / terminal server
ParaStation:
MPI stack, GridMonitor
Torque/Maui job management for interactive work and batch processing
October 14-15, 2009 Slide 10
Activities on JUICEnext
Stochastic Rotation Dynamics (SRD)
Energy Function for Protein Folding (ECEPP/3)
High Performance LINPACK (HPL) using ParaStation MPI
Evaluation of IBM BLAS/LAPACK Library
October 14-15, 2009 Slide 11
Stochastic Rotation Dynamics on Cell/B.E.(Annika Schiller, Godehard Sutmann)Stochastic Rotation Dynamics (SRD):
also called Multiparticle Collision Dynamics (MPC)
is a particle-based mesoscale algorithm for the simulation of fluid and flows
solves the linearized Navier-Stokes equations
employs a discrete-time dynamics with continuous velocities and local multiparticle collisions
basic steps:
● Free Streaming Step: particles are moved without interaction with each other according to their velocities for the duration of one time step dt
● Cell Filling Step: a virtual grid is superimposed over the simulation box and the particles are assigned to the grid cells, the grid is shifted randomly at each time step
● Multiparticle Collision Step: particle velocities are rotated relative to the centre of mass velocity of the collision cell
October 14-15, 2009 Slide 12
Stochastic Rotation Dynamics on Cell/B.E.(Annika Schiller, Godehard Sutmann)
Cell/BE implementation via CellSs: data locality and independence is
achieved by storing the particle information cellwise instead of using linked-cell lists
free streaming step and multiparticle collision step are calculated on the SPEs in parallel
Current status: speedup could be optimized to nearly linear scaling by reducing the data
transfer ⇒ communication is not dominating over computation anymore linear scaling shows that SRD is a suitable application for Cell/BEFuture work: parallelization of the cell filling step with CellSs in cooperation with BSC
October 14-15, 2009 Slide 13
Protein Folding with ECEPP/3 on Cell/B.E.(Lauretta Schubert, Jan Meinke) Protein Folding:
model
■ classic, all atom
■ fixed bond length and angles,free torsion angles
force field
■ ECEPP/3(Empirical Conformational Energy for Peptides and Proteins)
■ includes Coulomb, van-der-Waals, hydrogen bonding and torsion terms with selection rules
algorithms
■ Parallel Tempering
■ Monte Carlo Simulation with Metropolis algorithm
October 14-15, 2009 Slide 14
Protein Folding with ECEPP/3 on Cell/B.E.(Lauretta Schubert, Jan Meinke)
Implementation:
■ calculate all interactions first (ignores selection rules)
■ substract non-interacting pairs→ reduce memory use→ avoid time consuming branching
● current status:
■ linear speedup
■ power efficient
● further steps:
■ reduce amount of transfer data
■ switch of parallelisation method
■ double buffering
October 14-15, 2009 Slide 15
HP LINPACK
HP LINPACK measured with MPI-1 under SDK-3.0
■ not all nodes on same SDK level
■ only up to 24 blades and 48 MPI processes
■ only ~40% of peak performance
Implementation of MPI-2 with SDK-3.1 performance degradation
Analysis with VampirTrace for small problems
October 14-15, 2009 Slide 16
Performance of HPL on JUICEnext with MPI 1
October 14-15, 2009 Slide 17
Analysis of HPL with VampirTrace
October 14-15, 2009 Slide 18
Evaluation of BLAS/LAPACK Library
Easy way to use Cell SPEs
Good scaling for DGEMM DGEMV memory bound
October 14-15, 2009 Slide 19
QPACE (QCD Parallel Computing on the Cell)
Collaborative development effort
■ Main funding: DFG (SFB TR 55)
■ Main partners: Universities of Regensburg and Wuppertal, University of Ferrara, DESY-Zeuthen, and the IBM development lab in Böblingen
Installed systems in Jülich and Wuppertal
Main features:
■ PowerXCell 8i processor
■ Custom FPGA-based network processor (3D-Torus)
■ SPE-centric nearest neighbour communication
■ Communication from Local Store to Local Store
October 14-15, 2009 Slide 20
JUPACE – QPACE System at Jülich
4 QPACE racks, each with:
■ 256 node cards
■ 26 TFLOPS peak
■ 1 TB memory
■ 35 KW power consumption
Cooltrans rack
Front-end system rack
■ login/master node
■ 3 TB Lustre filesystem
■ 2 Meta Data Servers
■ 4 Object Storage Servers
October 14-15, 2009 Slide 21
The PRACE Project
EU approved the PRACE Preparatory Phase Project (Grant: INFSO-RI-211528)
• 16 Partners from 14 countries• Project duration:
January 2008 – December 2009• Project budget: 20 M € ,
EC funding: 10 M €• Kickoff: Jülich, January 29-30,
2008
October 14-15, 2009 Slide 22
PRACE Objectives
Provide world-class systems for world-class science Create a single European entity Deploy 3 – 5 systems of the highest performance level (tier-0) Ensure diversity of architectures Provide support and training
October 14-15, 2009 Slide 23
PRACE Work Packages
WP1 Management WP2 Organizational concept WP3 Dissemination, outreach and training WP4 Distributed computing WP5 Deployment of prototype systems WP6 Software enabling for prototype systems WP7 Petaflop/s systems for 2009/2010
WP8 Future petaflop/s technologies
October 14-15, 2009 Slide 24
Future Petaflop/s Computer Technologies beyond 2010
Assessment and evaluation of emerging multi-petascale technology following the requirements of HPC users
Implementation of a strategy that guarantees a continuous HPC technology evaluation and system evolution within the PRACE Research Infrastructure
Fostering the development of components for future multi-petascale production systems in cooperation with European and international HPC industry
Creation of STRATOS - PRACE advisory group for Strategic Technologies
October 14-15, 2009 Slide 25
PRACE WP8 Prototypes
Sites Hardware/Software Porting effort
CEA“GPU/CAPS”
1U Tesla Server T1070 (CUDA, CAPS, DDT)Intel Harpertown nodes
“Evaluate GPU accelerators and GPGPU programming models and middleware.” (e.g., pollutant migration code (ray tracing algorithm) to CUDA and HMPP)
CINES-LRZ“LRB/CS”
Hybrid SGI ICE2/UV/Nehalem-EP & Nehalem-EX/ClearSpeed/Larrabee
Gadget,SPECFEM3D_GLOBE, RaXml, Rinf, RandomAccess, ApexMap, Intel MPI BM
CSCS“UPC/CAF”
Prototype PGAS language compilers (CAF + UPC for Cray XT systems)
“The applications chosen for this analysis will include some of those already selected as benchmark codes”
EPCC“FPGA”
Maxwell – FPGA prototype (VHDL support & consultancy + software licenses (e.g., Mitrion-C))
“We wish to port several of the PRACE benchmark codes to the system. The codes will be chosen based on their suitability for execution on such a system.”
October 14-15, 2009 Slide 26
PRACE WP8 Prototypes (cont'd)
Sites Hardware/Software Porting effort
FZJ (BSC)“Cell & FPGA interconnect”
eQPACE (PowerXCell cluster with special network processor)
Extend FPGA-based interconnect beyond QCD applications.
LRZ“RapidMind”
RapidMind (Streaming Processing Programming Paradym)X86, GPGPU, Cell
ApexMap, Multigrid, FZJ (QCD), CINECA (linear algebra kernels involved in solvers for ordinary differential equations), SNIC
NCF“ClearSpeed”
ClearSpeed CATS 700 units Astronomical many-body simulation, Iterative sparse solvers with preconditioning, finite element code, cryomicrotome image analysis
CINECA I/O Subsystem (SSD, Lustre, pNFS) -
-
October 14-15, 2009 Slide 27
QPACE-green / eQPACE
Highlight activities towards energy-efficient HPC systems
Prototype of leadership class system defined in PRACE WP8
Top500 / Green500 publish planned for SC09 HPL communication requirements differ from QCD
Make system suitable for a wider class of applications
Study of HPC requirements on QPACE
Extensions to network processor protocol necessary
Cooperation between IBM and JSC
October 14-15, 2009 Slide 28
QPACE Torus Network Processor
October 14-15, 2009 Slide 29
TNW Configuration: Red=Z-Direction, PHY=2,5
Network reconfiguration possible by selecting primary and secondary interface
October 14-15, 2009 Slide 30
TNW Configuration: Green=Y-Direction, PHY=0,3
1 backplane (32 node-cards) : X x Y x Z – 1 x (1,2,4) x (1,2,4,8)
1 rack (256 node-cards): X x Y x Z – (1,2) x (1,2,4,8,16) x (1,2,4,8)
October 14-15, 2009 Slide 31
TNW-Setup: Ring of 32 Nodecards in a Backplane
Example: X x Y x Z = 1 x 4 x 8, ring of 32 nodes of backplane 6
tnwsetup 6:0 ... 6:7 6:15 … 6:8 6:24 … 6:31 6:23 .. 6:16 –action start
October 14-15, 2009 Slide 32
Torus Network Ping-Pong (PPE-Side)
October 14-15, 2009 Slide 33
Torus Network Ping-Pong (SPE-Side)
October 14-15, 2009 Slide 34
Jupace: 1st Test Application (Wave Equation)
Motion of a swinging string
October 14-15, 2009 Slide 35
Wave Equation on JUICEnext: PPE-Centric
for (t=2*dt; t<=t_end; t+=dt) { ++iter; // set halo points halo[1][0]=lblocal; halo[0][NUM_SPES+1]=rblocal; for( num=0; num<NUM_SPES; num++){ bstart = num*SPE_BUFF_SIZE; halo[0][num+1]=u[bstart]; halo[1][num+1]=u[bstart+SPE_N-1]; } // send message to SPEs to synchronise each iteration start for(num=0;num<NUM_SPES;num++){ spe_in_mbox_write(data[num].spe_ctx,(uint32_t *)&iter, 1,SPE_MBOX_ALL_BLOCKING); } // ... now SPE performs iteration step // read message from SPEs to synchronise each iteration end for(num=0;num<NUM_SPES;num++){ while(!spe_out_mbox_status(data[num].spe_ctx)); spe_out_mbox_read(data[num].spe_ctx,(uint32_t*)&mbx[num],1); }
// exchange MPI boundary values if (C_size>1) { /* send to the right */ if (C_rank!=0) MPI_Recv(&lblocal, 1, MPI_FLOAT, C_rank-1, right_copy, MPI_COMM_WORLD, &status); if (C_rank!=C_size-1) MPI_Send((void *)&u[BUFF_SIZE-SPE_BUFF_SIZE + SPE_N-1], 1, MPI_FLOAT, C_rank+1, right_copy,MPI_COMM_WORLD); /* send to the left */ if (C_rank!=C_size-1) MPI_Recv(&rblocal, 1, MPI_FLOAT, C_rank+1, left_copy, MPI_COMM_WORLD, &status); if (C_rank!=0) MPI_Send((void *)&u[0], 1, MPI_FLOAT, C_rank-1, left_copy, MPI_COMM_WORLD); }}
October 14-15, 2009 Slide 36
Wave Equation on Jupace: SPE-Centric
tnw_credit_base(zminus,chid);tnw_notify_base(zminus,chid,notify_buf);tnw_credit_base(zplus,chid);tnw_notify_base(zplus,chid,notify_buf);do{ iter = spu_read_in_mbox(); if (iter == ITER_STOP) break; // get halo values mfc_get(halo,ctx.ea_halo,32*sizeof(float),tag_id,0,0); waitag(tag_id); // perform iteration string(u_l,u_old_l,u_new_l,SPE_N, eps,halo[1][myid],i halo[0][myid+2],myid); u_temp=u_old_l; u_old_l=u_l; u_l=u_new_l; u_new_l=u_temp; // put output buffer u mfc_put(u_l,ctx.ea_u,(SPE_BUFF_SIZE)*sizeof(float), tag_id,0,0); waitag(tag_id); // inform ppe that u values are up-to-date spu_write_out_mbox(myid);
if (myid == NUM_SPES-1) tnw_put(zplus,0,u_l,0,MSGSIZE,5); if (myid == 0) { // TNW: send to the right tnw_credit(zminus,chid,(uint)recv_buf0,MSGSIZE,0); tnw_wait_recv(notify_buf,0); if (C_rank!=0) halo[1][0]=recv_buf0[SPE_N-1]; // send to the left tnw_credit(zplus,chid,(uint)recv_buf0,MSGSIZE,0); tnw_put(zminus, chid,u_l,0,MSGSIZE,5); tnw_wait_recv(notify_buf,0); if (C_rank!=C_size-1) halo[0][NUM_SPES+1]=recv_buf0[0]; // put halo buffer mfc_put(halo,ctx.ea_halo,32*sizeof(float),tag_id,0,0); waitag(tag_id); } // inform ppe that iteration is done spu_write_out_mbox(myid);} while(1);
October 14-15, 2009 Slide 37
QPACE-green / eQPACE
First modifications of FPGA done (F. Schifano, B.Krill)
■ reduce FPGA resource consumption
■ DMA controller for main-memory to main-memory transfers
October 14-15, 2009 Slide 38
High Performance LINPACK
October 14-15, 2009 Slide 39
QPACE-green / eQPACE
HPL for QPACE (S.Rinke, H,Böttiger) support a performance critical subset of MPI
Open MPI BTL and COLL components
MPI uses torus nearest neighbour communication where possible, Ethernet otherwise
MPI communication operations in HPL (S.Wunderlich)
MPI_Send
MPI_Recv
MPI_Allgatherv
MPI_Scatterv
MPI_Allreduce
MPI_Bcast
October 14-15, 2009 Slide 40
OpenMPI Modulear Component Architecture
Organized in layers
Frameworks, components, modules
Architecture allows development of components independently
New Torus Network device requires:
■ Byte Transfer Layer module
■ Short messages (eager protocol)
■ Long messages (rendez-vous protocol)
■ Collective component
October 14-15, 2009 Slide 41
QPACE-green / eQPACE
PingPong torus benchmark for main-memory to main-memory
HPL based on QS22 patch has been run successfully on QPACE architecture using torus network nearest-neighbour whenever possible
October 14-15, 2009 Slide 42
Planned Activities
Porting applications from JUICEnext to QPACE
Stochastic Rotation Dynamics (simulation of fluid and flows)
ECEPP/3 force field (protein folding)
Multigrid method
Extension of FPGA capabilities for better application support
DMA engine
Routing feature
October 14-15, 2009 Slide 43
JuGiPSY: Jülich's GPU System
NVIDIA Tesla S1070
4 x GPU / 1U blade, 1,45 GHz
4 x 4 GB DDR3 memory
4 x 102 GB/s memory bandwidth
4 TFLOPS single-precision
345 GFLOPS double-precision
700W
Host System (2x)
Xeon E5430@2,66 GHz quad-core
32 GB Memory
1 TB disk
Host Interface cards
2x PCIe 16x
(Transtec 1000R S1070)
October 14-15, 2009 Slide 44
Tesla S1070 Features
A multiprocessor (Thread Processing Array, TPA) includes:
8 processor cores (floating point / integer unit; move, compare, logic, branch unit)
1 double-precision unit (IEEE 754 floating point)
Hardware driver information: Product Name: NVIDIA Tesla S1070 -400GPU 0: Tesla C1060, GPU 1: Tesla C1060
CUDA Capability Major revision number: 1CUDA Capability Minor revision number: 3Total amount of global memory: 4294705152 bytesNumber of multiprocessors: 30Number of cores: 240Total amount of constant memory: 65536 bytesTotal amount of shared memory per block: 16384 bytesTotal number of registers available per block: 16384Warp size: 32Maximum number of threads per block: 512
Host to Device: 33554432 2157.0Device to Host: 1886.4Device to Device: 73458.1
Bandwidth: Transfer Size (Bytes) Bandwidth(MB/s)
October 14-15, 2009 Slide 45
CUDA Example (Matrix Multiply): Host Function// Thread block size#define BLOCK_SIZE 16// Forward declaration of the device multiplication function__global__ void Muld(float*, float*, int, int, float*);
// Host multiplication function: Compute C = A * B// hA is the height of A, wA is the width of A, wB is the width of Bvoid Mul(const float* A, const float* B, int hA, int wA, int wB, float* C){ int size; // Load A and B to the device float* Ad; size = hA * wA * sizeof(float); cudaMalloc((void**)&Ad, size); cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice); float* Bd; size = wA * wB * sizeof(float); cudaMalloc((void**)&Bd, size); cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice);
// Allocate C on the device float* Cd; size = hA * wB * sizeof(float); cudaMalloc((void**)&Cd, size);
// Compute the execution configuration assuming // the matrix dimensions are multiples of BLOCK_SIZE dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); dim3 dimGrid(wB / dimBlock.x, hA / dimBlock.y);
// Launch the device computation Muld<<<dimGrid, dimBlock>>>(Ad, Bd, wA, wB, Cd);
// Read C from the device cudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost); // Free device memory cudaFree(Ad); cudaFree(Bd); cudaFree(Cd);}
October 14-15, 2009 Slide 46
CUDA Example (Matrix Multiply): Device Function
//// Device multiplication function called by Mul()//// Compute C = A * B//// wA is the width of A// wB is the width of B//
__global__ void Muld(float* A, float* B, int wA, int wB,float* C){ // Block index int bx = blockIdx.x; int by = blockIdx.y;
// Thread index int tx = threadIdx.x; int ty = threadIdx.y;
// Index of the first sub-matrix of A processed by the block int aBegin = wA * BLOCK_SIZE * by;
// Index of the last sub-matrix of A processed by the block int aEnd = aBegin + wA – 1;
// Step size used to iterate through the sub-matrices of A int aStep = BLOCK_SIZE;
// Index of the first sub-matrix of B processed by the block int bBegin = BLOCK_SIZE * bx;
// Step size used to iterate through the sub-matrices of B int bStep = BLOCK_SIZE * wB;
// The element of the block sub-matrix that is computed // by the thread float Csub = 0;
// Loop over all the sub-matrices of A and B // required to compute the block sub-matrix for (int a = aBegin, b = bBegin;
a <= aEnd;a += aStep, b += bStep) {
// Shared memory for the sub-matrix of A__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
// Shared memory for the sub-matrix of B__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
// Load the matrices from global memory to // shared memory; each thread loads// one element of each matrixAs[ty][tx] = A[a + wA * ty + tx];Bs[ty][tx] = B[b + wB * ty + tx];
// Synchronize to make sure the matrices are // loaded__syncthreads();
// Multiply the two matrices together;// each thread computes one element// of the block sub-matrixfor (int k = 0; k < BLOCK_SIZE; ++k)
Csub += As[ty][k] * Bs[k][tx];
// Synchronize to make sure that the preceding // computation is done before loading two new // sub-matrices of A and B in the next iteration__syncthreads();
} // Write the block sub-matrix to global memory; // each thread writes one element int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx; C[c + wB * ty + tx] = Csub;}
October 14-15, 2009 Slide 47
CUDA Example (Matrix Multiply): Performance
Matrix Dimensions N were chosen as multiples of block size (16)
Using shared memory instead of global memory strongly improves performance
Available BLAS library for CUDA (CUBLAS) is highly optimized and should be used whenever possible
October 14-15, 2009 Slide 48
Outlook
GPU-based projects:
■ T. Neuhaus: Ising Model
■ J. Meinke: Protein Folding
■ O. Zimmermann: Support Vector Machine
■ L. Westphal: Stochastic Rotation Dynamics (SRD) Simulation of fluid and flows (CUDA, OpenCL)
■ J. Kreutz: Matrix Multiplikation Performance Black-Scholes Model (Assessment of stock price)
CELL-based projects:
■ S. Rinke: OpenMPI subset for eQPACE
■ J. Koutsou: QCD
October 14-15, 2009 Slide 49
Further Information
http://www.fz-juelich.de/jsc/juice
http://www.physik.uni-regensburg.de/sfbtr55
http://en.wikipedia.org/wiki/QPACE
People involved, contacts:■ [email protected]
DLR Braunschweig, 14th and 15th of October 2009 W.Homberg
Mitg
lied
der
Hel
mh
oltz
-Gem
ein
sch
aft
Appendix
October 14-15, 2009 Slide 51
Cell Architecture
Element Interconnect Bus (EIB)
■ 16-byte ring busses, 96 byte/cycle in total
■ up to 128 open requests
Memory Interface Controller (MIC)
■ Access to memory: 25.6 GB/s
■ max. 512 MB per CBE processor
Bus Interface Controller (BIC)
■ I/O or dual CBE interconnect
■ 12 FlexIO links, each 6.4 GB/s
October 14-15, 2009 Slide 52
Cell Cluster
35 QS22 blades @ 3.2 Ghz
■ 2 Cell processors (CBE)
■ 8 GB Memory
4x InfiniBand adapter (SDR)
Frontend: x86 compatible
Interconnect switch: IB (Cisco)
Peak performance:
■ 1 blade: 217,6 GFLOPS peak (64 bit float)
■ 35 blades: 7 TFLOPS peak (64 bit float)
October 14-15, 2009 Slide 53
Programming Model (1st Generation Cell Processor)
large ratio between SP/DP speed
e.g. first implementation of HPL benchmark on Cell uses mixed-precision (J.Kurzk, J.Dongarra)
■ compute intensive operations SP
■ necessary iterative refinement DP
■ 1.5 times memory consumption
October 14-15, 2009 Slide 54
DGEMV BLAS using SPEs <--> GotoBLAS
Matrix not transposed Matrix transposed
October 14-15, 2009 Slide 55
TNW Configuration: Blue=X-Direction, PHY=1,4
October 14-15, 2009 Slide 56
CUDA Example (Matrix Multiply): Performance
Matrix Dimensions N were chosen as multiples of block size (16)
Using shared memory instead of global memory strongly improves performance
Available BLAS library for CUDA (CUBLAS) is highly optimized and should be used whenever possible
October 14-15, 2009 Slide 57
Accelerator Performance Results
October 14-15, 2009 Slide 58
Development Time