energy-efficient many-core systems at jsctau.dlr.de › fileadmin › documents › meetings ›...

DLR Braunschweig, 14th and 15th of October 2009 W.Homberg

Mitg

lied

der

Hel

mh

oltz

-Gem

ein

sch

aft

Energy-Efficient Many-Core Systems at JSC

Symposium:CFD on Future Architectures – Many-Cores, GPUs, FPGAs, …

October 14-15, 2009 Slide 2

Outline

Top500 / Green500

JUICE■ Cell BE / PowerXCell 8i

JUPACE

■ QPACE (SFB TR 55)

■ FPGA-based Torus Network

■ PRACE/WP8 Prototype eQPACE

JUGIPSY

■ GPU System NIVIDIA Tesla S1070

Outlook


Jülich Supercomputing Centre

Jugene:

IBM BlueGene/P

294912 processors

1 Petaflops

2.3 MW

Juropa:

Bull, Sun

3288 procs / 26394 cores

308 Teraflops

1.5 MW


Top500 – June 2009 Top 10


Green500 – June 2009 Top 10


The JUICE (JUelich Initiative CEll cluster) Project

By 2006 the usage of the Cell processor is gradually widening to scientific applications.

One of the first Cell-based systems was IBM's QS20 blade equipped with two processors and 512 MB memory each and external connectivity provided by a Gigabit Ethernet and an optional InfiniBand network interface.

In the beginning of 2007 a cluster of 12 QS20 blades was established at JSC providing a single-precision floating point peak performance of about 5 TFLOPS.

A performance of 4.5 TFLOPS (90% peak) could be attained with a matrix multiplication using all nodes via InfiniBand network.

The Cell Cluster Meeting 2007 in Jülich with international attandence was organized at JSC:

http://www.fz-juelich.de/jsc/juice/cell_cluster_meeting


JUICE: Activities at JSC

Lanczos Implementation for Hubbard Models (Andreas Dolfen, Erik Koch)

MPI on Cell (Norbert Eicker)

Benchmark Matrix-Matrix Multiplication (Inge Gutheil)

A Multigrid Method for the Solution of the Poisson Equation (Matthias Bolten)

CellSs and Triple-Matrix-Multiply (Liang Yang, Annika Schiller, Godehard Sutmann, B.Wylie)

A Fast Wavelet Based Implementation to Calculate Coulomb Potentials (Annika Schiller, Godehard Sutmann, Liang Yang)

Comparison of CellSs and native programming with a Jacobi solver and triple-matrix-multiply on Cell/B.E. (Liang Yang, Annika Schiller, Godehard Sutmann, Brian J. N. Wylie, Ralph Altenfeld, and Felix Wolf)

JUICE - Jülich Intitiative Cell Cluster - Report 2007 (FZJ-JSC-IB-2007-13)


JUICE: External Partners

50 registered users and 30 sites

Master's Theses:

Porting the NestStep Run-time System to the CELL Broadband Engine, Daniel Johansson, Linköping 2007

A Fast Multigrid Solver for Molecular Dynamics on the Cell Broadband Engine, Daniel Ritter, Erlangen 2008

Über die Cell/B.E.-Architektur: Optionen zur Generierung von Programm-Traces, Daniel Hackenberg, Dresden 2008

Beschleunigung von Bildrekonstruktionsverfahren in der Positronen-Emissions-Tomographie unter Einsatz der Cell Broadband Engine, Michael Sievers, Paderborn


JUICEnext

Hardware:

35 QS22 blades

70 PowerXCell 8i

7 TFLOPS peak

Main memory: 35 * 8 GB (aggregate 280 GB)

Infiniband 4x SDR

BladeCenter management / terminal server

ParaStation:

MPI stack, GridMonitor

Torque/Maui job management for interactive work and batch processing


Activities on JUICEnext

Stochastic Rotation Dynamics (SRD)

Energy Function for Protein Folding (ECEPP/3)

High Performance LINPACK (HPL) using ParaStation MPI

Evaluation of IBM BLAS/LAPACK Library


Stochastic Rotation Dynamics on Cell/B.E.(Annika Schiller, Godehard Sutmann)Stochastic Rotation Dynamics (SRD):

also called Multiparticle Collision Dynamics (MPC)

is a particle-based mesoscale algorithm for the simulation of fluid and flows

solves the linearized Navier-Stokes equations

employs a discrete-time dynamics with continuous velocities and local multiparticle collisions

basic steps:

● Free Streaming Step: particles are moved without interaction with each other according to their velocities for the duration of one time step dt

● Cell Filling Step: a virtual grid is superimposed over the simulation box and the particles are assigned to the grid cells, the grid is shifted randomly at each time step

● Multiparticle Collision Step: particle velocities are rotated relative to the centre of mass velocity of the collision cell


Stochastic Rotation Dynamics on Cell/B.E.(Annika Schiller, Godehard Sutmann)

Cell/BE implementation via CellSs: data locality and independence is

achieved by storing the particle information cellwise instead of using linked-cell lists

free streaming step and multiparticle collision step are calculated on the SPEs in parallel

Current status: speedup could be optimized to nearly linear scaling by reducing the data

transfer ⇒ communication is not dominating over computation anymore linear scaling shows that SRD is a suitable application for Cell/BEFuture work: parallelization of the cell filling step with CellSs in cooperation with BSC


Protein Folding with ECEPP/3 on Cell/B.E.(Lauretta Schubert, Jan Meinke) Protein Folding:

model

■ classic, all atom

■ fixed bond length and angles,free torsion angles

force field

■ ECEPP/3(Empirical Conformational Energy for Peptides and Proteins)

■ includes Coulomb, van-der-Waals, hydrogen bonding and torsion terms with selection rules

algorithms

■ Parallel Tempering

■ Monte Carlo Simulation with Metropolis algorithm


Protein Folding with ECEPP/3 on Cell/B.E.(Lauretta Schubert, Jan Meinke)

Implementation:

■ calculate all interactions first (ignores selection rules)

■ substract non-interacting pairs→ reduce memory use→ avoid time consuming branching

● current status:

■ linear speedup

■ power efficient

● further steps:

■ reduce amount of transfer data

■ switch of parallelisation method

■ double buffering


HP LINPACK

HP LINPACK measured with MPI-1 under SDK-3.0

■ not all nodes on same SDK level

■ only up to 24 blades and 48 MPI processes

■ only ~40% of peak performance

Implementation of MPI-2 with SDK-3.1 performance degradation

Analysis with VampirTrace for small problems


Performance of HPL on JUICEnext with MPI 1


Analysis of HPL with VampirTrace


Evaluation of BLAS/LAPACK Library

Easy way to use Cell SPEs

Good scaling for DGEMM DGEMV memory bound


QPACE (QCD Parallel Computing on the Cell)

Collaborative development effort

■ Main funding: DFG (SFB TR 55)

■ Main partners: Universities of Regensburg and Wuppertal, University of Ferrara, DESY-Zeuthen, and the IBM development lab in Böblingen

Installed systems in Jülich and Wuppertal

Main features:

■ PowerXCell 8i processor

■ Custom FPGA-based network processor (3D-Torus)

■ SPE-centric nearest neighbour communication

■ Communication from Local Store to Local Store


JUPACE – QPACE System at Jülich

4 QPACE racks, each with:

■ 256 node cards

■ 26 TFLOPS peak

■ 1 TB memory

■ 35 KW power consumption

Cooltrans rack

Front-end system rack

■ login/master node

■ 3 TB Lustre filesystem

■ 2 Meta Data Servers

■ 4 Object Storage Servers


The PRACE Project

EU approved the PRACE Preparatory Phase Project (Grant: INFSO-RI-211528)

• 16 Partners from 14 countries• Project duration:

January 2008 – December 2009• Project budget: 20 M € ,

EC funding: 10 M €• Kickoff: Jülich, January 29-30,

2008


PRACE Objectives

Provide world-class systems for world-class science Create a single European entity Deploy 3 – 5 systems of the highest performance level (tier-0) Ensure diversity of architectures Provide support and training


PRACE Work Packages

WP1 Management WP2 Organizational concept WP3 Dissemination, outreach and training WP4 Distributed computing WP5 Deployment of prototype systems WP6 Software enabling for prototype systems WP7 Petaflop/s systems for 2009/2010

WP8 Future petaflop/s technologies


Future Petaflop/s Computer Technologies beyond 2010

Assessment and evaluation of emerging multi-petascale technology following the requirements of HPC users

Implementation of a strategy that guarantees a continuous HPC technology evaluation and system evolution within the PRACE Research Infrastructure

Fostering the development of components for future multi-petascale production systems in cooperation with European and international HPC industry

Creation of STRATOS - PRACE advisory group for Strategic Technologies


PRACE WP8 Prototypes

Sites Hardware/Software Porting effort

CEA“GPU/CAPS”

1U Tesla Server T1070 (CUDA, CAPS, DDT)Intel Harpertown nodes

“Evaluate GPU accelerators and GPGPU programming models and middleware.” (e.g., pollutant migration code (ray tracing algorithm) to CUDA and HMPP)

CINES-LRZ“LRB/CS”

Hybrid SGI ICE2/UV/Nehalem-EP & Nehalem-EX/ClearSpeed/Larrabee

Gadget,SPECFEM3D_GLOBE, RaXml, Rinf, RandomAccess, ApexMap, Intel MPI BM

CSCS“UPC/CAF”

Prototype PGAS language compilers (CAF + UPC for Cray XT systems)

“The applications chosen for this analysis will include some of those already selected as benchmark codes”

EPCC“FPGA”

Maxwell – FPGA prototype (VHDL support & consultancy + software licenses (e.g., Mitrion-C))

“We wish to port several of the PRACE benchmark codes to the system. The codes will be chosen based on their suitability for execution on such a system.”


PRACE WP8 Prototypes (cont'd)

Sites Hardware/Software Porting effort

FZJ (BSC)“Cell & FPGA interconnect”

eQPACE (PowerXCell cluster with special network processor)

Extend FPGA-based interconnect beyond QCD applications.

LRZ“RapidMind”

RapidMind (Streaming Processing Programming Paradym)X86, GPGPU, Cell

ApexMap, Multigrid, FZJ (QCD), CINECA (linear algebra kernels involved in solvers for ordinary differential equations), SNIC

NCF“ClearSpeed”

ClearSpeed CATS 700 units Astronomical many-body simulation, Iterative sparse solvers with preconditioning, finite element code, cryomicrotome image analysis

CINECA I/O Subsystem (SSD, Lustre, pNFS) -

-


QPACE-green / eQPACE

Highlight activities towards energy-efficient HPC systems

Prototype of leadership class system defined in PRACE WP8

Top500 / Green500 publish planned for SC09 HPL communication requirements differ from QCD

Make system suitable for a wider class of applications

Study of HPC requirements on QPACE

Extensions to network processor protocol necessary

Cooperation between IBM and JSC


QPACE Torus Network Processor


TNW Configuration: Red=Z-Direction, PHY=2,5

Network reconfiguration possible by selecting primary and secondary interface


TNW Configuration: Green=Y-Direction, PHY=0,3

1 backplane (32 node-cards) : X x Y x Z – 1 x (1,2,4) x (1,2,4,8)

1 rack (256 node-cards): X x Y x Z – (1,2) x (1,2,4,8,16) x (1,2,4,8)


TNW-Setup: Ring of 32 Nodecards in a Backplane

Example: X x Y x Z = 1 x 4 x 8, ring of 32 nodes of backplane 6

tnwsetup 6:0 ... 6:7 6:15 … 6:8 6:24 … 6:31 6:23 .. 6:16 –action start


Torus Network Ping-Pong (PPE-Side)


Torus Network Ping-Pong (SPE-Side)


Jupace: 1st Test Application (Wave Equation)

Motion of a swinging string


Wave Equation on JUICEnext: PPE-Centric

for (t=2*dt; t<=t_end; t+=dt) { ++iter; // set halo points halo[1][0]=lblocal; halo[0][NUM_SPES+1]=rblocal; for( num=0; num<NUM_SPES; num++){ bstart = num*SPE_BUFF_SIZE; halo[0][num+1]=u[bstart]; halo[1][num+1]=u[bstart+SPE_N-1]; } // send message to SPEs to synchronise each iteration start for(num=0;num<NUM_SPES;num++){ spe_in_mbox_write(data[num].spe_ctx,(uint32_t *)&iter, 1,SPE_MBOX_ALL_BLOCKING); } // ... now SPE performs iteration step // read message from SPEs to synchronise each iteration end for(num=0;num<NUM_SPES;num++){ while(!spe_out_mbox_status(data[num].spe_ctx)); spe_out_mbox_read(data[num].spe_ctx,(uint32_t*)&mbx[num],1); }

// exchange MPI boundary values if (C_size>1) { /* send to the right */ if (C_rank!=0) MPI_Recv(&lblocal, 1, MPI_FLOAT, C_rank-1, right_copy, MPI_COMM_WORLD, &status); if (C_rank!=C_size-1) MPI_Send((void *)&u[BUFF_SIZE-SPE_BUFF_SIZE + SPE_N-1], 1, MPI_FLOAT, C_rank+1, right_copy,MPI_COMM_WORLD); /* send to the left */ if (C_rank!=C_size-1) MPI_Recv(&rblocal, 1, MPI_FLOAT, C_rank+1, left_copy, MPI_COMM_WORLD, &status); if (C_rank!=0) MPI_Send((void *)&u[0], 1, MPI_FLOAT, C_rank-1, left_copy, MPI_COMM_WORLD); }}


Wave Equation on Jupace: SPE-Centric

tnw_credit_base(zminus,chid);tnw_notify_base(zminus,chid,notify_buf);tnw_credit_base(zplus,chid);tnw_notify_base(zplus,chid,notify_buf);do{ iter = spu_read_in_mbox(); if (iter == ITER_STOP) break; // get halo values mfc_get(halo,ctx.ea_halo,32*sizeof(float),tag_id,0,0); waitag(tag_id); // perform iteration string(u_l,u_old_l,u_new_l,SPE_N, eps,halo[1][myid],i halo[0][myid+2],myid); u_temp=u_old_l; u_old_l=u_l; u_l=u_new_l; u_new_l=u_temp; // put output buffer u mfc_put(u_l,ctx.ea_u,(SPE_BUFF_SIZE)*sizeof(float), tag_id,0,0); waitag(tag_id); // inform ppe that u values are up-to-date spu_write_out_mbox(myid);

if (myid == NUM_SPES-1) tnw_put(zplus,0,u_l,0,MSGSIZE,5); if (myid == 0) { // TNW: send to the right tnw_credit(zminus,chid,(uint)recv_buf0,MSGSIZE,0); tnw_wait_recv(notify_buf,0); if (C_rank!=0) halo[1][0]=recv_buf0[SPE_N-1]; // send to the left tnw_credit(zplus,chid,(uint)recv_buf0,MSGSIZE,0); tnw_put(zminus, chid,u_l,0,MSGSIZE,5); tnw_wait_recv(notify_buf,0); if (C_rank!=C_size-1) halo[0][NUM_SPES+1]=recv_buf0[0]; // put halo buffer mfc_put(halo,ctx.ea_halo,32*sizeof(float),tag_id,0,0); waitag(tag_id); } // inform ppe that iteration is done spu_write_out_mbox(myid);} while(1);



First modifications of FPGA done (F. Schifano, B.Krill)

■ reduce FPGA resource consumption

■ DMA controller for main-memory to main-memory transfers


High Performance LINPACK



HPL for QPACE (S.Rinke, H,Böttiger) support a performance critical subset of MPI

Open MPI BTL and COLL components

MPI uses torus nearest neighbour communication where possible, Ethernet otherwise

MPI communication operations in HPL (S.Wunderlich)

MPI_Send

MPI_Recv

MPI_Allgatherv

MPI_Scatterv

MPI_Allreduce

MPI_Bcast


OpenMPI Modulear Component Architecture

Organized in layers

Frameworks, components, modules

Architecture allows development of components independently

New Torus Network device requires:

■ Byte Transfer Layer module

■ Short messages (eager protocol)

■ Long messages (rendez-vous protocol)

■ Collective component



PingPong torus benchmark for main-memory to main-memory

HPL based on QS22 patch has been run successfully on QPACE architecture using torus network nearest-neighbour whenever possible


Planned Activities

Porting applications from JUICEnext to QPACE

Stochastic Rotation Dynamics (simulation of fluid and flows)

ECEPP/3 force field (protein folding)

Multigrid method

Extension of FPGA capabilities for better application support

DMA engine

Routing feature


JuGiPSY: Jülich's GPU System

NVIDIA Tesla S1070

4 x GPU / 1U blade, 1,45 GHz

4 x 4 GB DDR3 memory

4 x 102 GB/s memory bandwidth

4 TFLOPS single-precision

345 GFLOPS double-precision

700W

Host System (2x)

Xeon E5430@2,66 GHz quad-core

32 GB Memory

1 TB disk

Host Interface cards

2x PCIe 16x

(Transtec 1000R S1070)


Tesla S1070 Features

A multiprocessor (Thread Processing Array, TPA) includes:

8 processor cores (floating point / integer unit; move, compare, logic, branch unit)

1 double-precision unit (IEEE 754 floating point)

Hardware driver information: Product Name: NVIDIA Tesla S1070 -400GPU 0: Tesla C1060, GPU 1: Tesla C1060

CUDA Capability Major revision number: 1CUDA Capability Minor revision number: 3Total amount of global memory: 4294705152 bytesNumber of multiprocessors: 30Number of cores: 240Total amount of constant memory: 65536 bytesTotal amount of shared memory per block: 16384 bytesTotal number of registers available per block: 16384Warp size: 32Maximum number of threads per block: 512

Host to Device: 33554432 2157.0Device to Host: 1886.4Device to Device: 73458.1

Bandwidth: Transfer Size (Bytes) Bandwidth(MB/s)


CUDA Example (Matrix Multiply): Host Function// Thread block size#define BLOCK_SIZE 16// Forward declaration of the device multiplication function__global__ void Muld(float*, float*, int, int, float*);

// Host multiplication function: Compute C = A * B// hA is the height of A, wA is the width of A, wB is the width of Bvoid Mul(const float* A, const float* B, int hA, int wA, int wB, float* C){ int size; // Load A and B to the device float* Ad; size = hA * wA * sizeof(float); cudaMalloc((void**)&Ad, size); cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice); float* Bd; size = wA * wB * sizeof(float); cudaMalloc((void**)&Bd, size); cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice);

// Allocate C on the device float* Cd; size = hA * wB * sizeof(float); cudaMalloc((void**)&Cd, size);

// Compute the execution configuration assuming // the matrix dimensions are multiples of BLOCK_SIZE dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); dim3 dimGrid(wB / dimBlock.x, hA / dimBlock.y);

// Launch the device computation Muld<<<dimGrid, dimBlock>>>(Ad, Bd, wA, wB, Cd);

// Read C from the device cudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost); // Free device memory cudaFree(Ad); cudaFree(Bd); cudaFree(Cd);}


CUDA Example (Matrix Multiply): Device Function

//// Device multiplication function called by Mul()//// Compute C = A * B//// wA is the width of A// wB is the width of B//

__global__ void Muld(float* A, float* B, int wA, int wB,float* C){ // Block index int bx = blockIdx.x; int by = blockIdx.y;

// Thread index int tx = threadIdx.x; int ty = threadIdx.y;

// Index of the first sub-matrix of A processed by the block int aBegin = wA * BLOCK_SIZE * by;

// Index of the last sub-matrix of A processed by the block int aEnd = aBegin + wA – 1;

// Step size used to iterate through the sub-matrices of A int aStep = BLOCK_SIZE;

// Index of the first sub-matrix of B processed by the block int bBegin = BLOCK_SIZE * bx;

// Step size used to iterate through the sub-matrices of B int bStep = BLOCK_SIZE * wB;

// The element of the block sub-matrix that is computed // by the thread float Csub = 0;

// Loop over all the sub-matrices of A and B // required to compute the block sub-matrix for (int a = aBegin, b = bBegin;

a <= aEnd;a += aStep, b += bStep) {

// Shared memory for the sub-matrix of A__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

// Shared memory for the sub-matrix of B__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

// Load the matrices from global memory to // shared memory; each thread loads// one element of each matrixAs[ty][tx] = A[a + wA * ty + tx];Bs[ty][tx] = B[b + wB * ty + tx];

// Synchronize to make sure the matrices are // loaded__syncthreads();

// Multiply the two matrices together;// each thread computes one element// of the block sub-matrixfor (int k = 0; k < BLOCK_SIZE; ++k)

Csub += As[ty][k] * Bs[k][tx];

// Synchronize to make sure that the preceding // computation is done before loading two new // sub-matrices of A and B in the next iteration__syncthreads();

} // Write the block sub-matrix to global memory; // each thread writes one element int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx; C[c + wB * ty + tx] = Csub;}


CUDA Example (Matrix Multiply): Performance

Matrix Dimensions N were chosen as multiples of block size (16)

Using shared memory instead of global memory strongly improves performance

Available BLAS library for CUDA (CUBLAS) is highly optimized and should be used whenever possible


Outlook

GPU-based projects:

■ T. Neuhaus: Ising Model

■ J. Meinke: Protein Folding

■ O. Zimmermann: Support Vector Machine

■ L. Westphal: Stochastic Rotation Dynamics (SRD) Simulation of fluid and flows (CUDA, OpenCL)

■ J. Kreutz: Matrix Multiplikation Performance Black-Scholes Model (Assessment of stock price)

CELL-based projects:

■ S. Rinke: OpenMPI subset for eQPACE

■ J. Koutsou: QCD


Further Information

http://www.fz-juelich.de/jsc/juice

http://www.physik.uni-regensburg.de/sfbtr55

http://en.wikipedia.org/wiki/QPACE

People involved, contacts:■ [email protected]

■ [email protected]

■ [email protected]

DLR Braunschweig, 14th and 15th of October 2009 W.Homberg

Mitg

lied

der

Hel

mh

oltz

-Gem

ein

sch

aft

Appendix


Cell Architecture

Element Interconnect Bus (EIB)

■ 16-byte ring busses, 96 byte/cycle in total

■ up to 128 open requests

Memory Interface Controller (MIC)

■ Access to memory: 25.6 GB/s

■ max. 512 MB per CBE processor

Bus Interface Controller (BIC)

■ I/O or dual CBE interconnect

■ 12 FlexIO links, each 6.4 GB/s


Cell Cluster

35 QS22 blades @ 3.2 Ghz

■ 2 Cell processors (CBE)

■ 8 GB Memory

4x InfiniBand adapter (SDR)

Frontend: x86 compatible

Interconnect switch: IB (Cisco)

Peak performance:

■ 1 blade: 217,6 GFLOPS peak (64 bit float)

■ 35 blades: 7 TFLOPS peak (64 bit float)


Programming Model (1st Generation Cell Processor)

large ratio between SP/DP speed

e.g. first implementation of HPL benchmark on Cell uses mixed-precision (J.Kurzk, J.Dongarra)

■ compute intensive operations SP

■ necessary iterative refinement DP

■ 1.5 times memory consumption


DGEMV BLAS using SPEs <--> GotoBLAS

Matrix not transposed Matrix transposed


TNW Configuration: Blue=X-Direction, PHY=1,4


CUDA Example (Matrix Multiply): Performance

Matrix Dimensions N were chosen as multiples of block size (16)

Using shared memory instead of global memory strongly improves performance

Available BLAS library for CUDA (CUBLAS) is highly optimized and should be used whenever possible


Accelerator Performance Results


Development Time

energy-efficient many-core systems at jsctau.dlr.de › fileadmin › documents › meetings ›...

Documents