energy-efficient many-core systems at jsctau.dlr.de › fileadmin › documents › meetings ›...

58
DLR Braunschweig, 14 th and 15 th of October 2009 W.Homberg Mitglied der Helmholtz-Gemeinschaft Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future Architectures – Many-Cores, GPUs, FPGAs, …

Upload: others

Post on 24-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

DLR Braunschweig, 14th and 15th of October 2009 W.Homberg

Mitg

lied

der

Hel

mh

oltz

-Gem

ein

sch

aft

Energy-Efficient Many-Core Systems at JSC

Symposium:CFD on Future Architectures – Many-Cores, GPUs, FPGAs, …

Page 2: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 2

Outline

Top500 / Green500

JUICE■ Cell BE / PowerXCell 8i

JUPACE

■ QPACE (SFB TR 55)

■ FPGA-based Torus Network

■ PRACE/WP8 Prototype eQPACE

JUGIPSY

■ GPU System NIVIDIA Tesla S1070

Outlook

Page 3: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 3

Jülich Supercomputing Centre

Jugene:

IBM BlueGene/P

294912 processors

1 Petaflops

2.3 MW

Juropa:

Bull, Sun

3288 procs / 26394 cores

308 Teraflops

1.5 MW

Page 4: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 4

Top500 – June 2009 Top 10

Page 5: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 5

Green500 – June 2009 Top 10

Page 6: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 6

The JUICE (JUelich Initiative CEll cluster) Project

By 2006 the usage of the Cell processor is gradually widening to scientific applications.

One of the first Cell-based systems was IBM's QS20 blade equipped with two processors and 512 MB memory each and external connectivity provided by a Gigabit Ethernet and an optional InfiniBand network interface.

In the beginning of 2007 a cluster of 12 QS20 blades was established at JSC providing a single-precision floating point peak performance of about 5 TFLOPS.

A performance of 4.5 TFLOPS (90% peak) could be attained with a matrix multiplication using all nodes via InfiniBand network.

The Cell Cluster Meeting 2007 in Jülich with international attandence was organized at JSC:

http://www.fz-juelich.de/jsc/juice/cell_cluster_meeting

Page 7: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 7

JUICE: Activities at JSC

Lanczos Implementation for Hubbard Models (Andreas Dolfen, Erik Koch)

MPI on Cell (Norbert Eicker)

Benchmark Matrix-Matrix Multiplication (Inge Gutheil)

A Multigrid Method for the Solution of the Poisson Equation (Matthias Bolten)

CellSs and Triple-Matrix-Multiply (Liang Yang, Annika Schiller, Godehard Sutmann, B.Wylie)

A Fast Wavelet Based Implementation to Calculate Coulomb Potentials (Annika Schiller, Godehard Sutmann, Liang Yang)

Comparison of CellSs and native programming with a Jacobi solver and triple-matrix-multiply on Cell/B.E. (Liang Yang, Annika Schiller, Godehard Sutmann, Brian J. N. Wylie, Ralph Altenfeld, and Felix Wolf)

JUICE - Jülich Intitiative Cell Cluster - Report 2007 (FZJ-JSC-IB-2007-13)

Page 8: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 8

JUICE: External Partners

50 registered users and 30 sites

Master's Theses:

Porting the NestStep Run-time System to the CELL Broadband Engine, Daniel Johansson, Linköping 2007

A Fast Multigrid Solver for Molecular Dynamics on the Cell Broadband Engine, Daniel Ritter, Erlangen 2008

Über die Cell/B.E.-Architektur: Optionen zur Generierung von Programm-Traces, Daniel Hackenberg, Dresden 2008

Beschleunigung von Bildrekonstruktionsverfahren in der Positronen-Emissions-Tomographie unter Einsatz der Cell Broadband Engine, Michael Sievers, Paderborn

Page 9: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 9

JUICEnext

Hardware:

35 QS22 blades

70 PowerXCell 8i

7 TFLOPS peak

Main memory: 35 * 8 GB (aggregate 280 GB)

Infiniband 4x SDR

BladeCenter management / terminal server

ParaStation:

MPI stack, GridMonitor

Torque/Maui job management for interactive work and batch processing

Page 10: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 10

Activities on JUICEnext

Stochastic Rotation Dynamics (SRD)

Energy Function for Protein Folding (ECEPP/3)

High Performance LINPACK (HPL) using ParaStation MPI

Evaluation of IBM BLAS/LAPACK Library

Page 11: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 11

Stochastic Rotation Dynamics on Cell/B.E.(Annika Schiller, Godehard Sutmann)Stochastic Rotation Dynamics (SRD):

also called Multiparticle Collision Dynamics (MPC)

is a particle-based mesoscale algorithm for the simulation of fluid and flows

solves the linearized Navier-Stokes equations

employs a discrete-time dynamics with continuous velocities and local multiparticle collisions

basic steps:

● Free Streaming Step: particles are moved without interaction with each other according to their velocities for the duration of one time step dt

● Cell Filling Step: a virtual grid is superimposed over the simulation box and the particles are assigned to the grid cells, the grid is shifted randomly at each time step

● Multiparticle Collision Step: particle velocities are rotated relative to the centre of mass velocity of the collision cell

Page 12: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 12

Stochastic Rotation Dynamics on Cell/B.E.(Annika Schiller, Godehard Sutmann)

Cell/BE implementation via CellSs: data locality and independence is

achieved by storing the particle information cellwise instead of using linked-cell lists

free streaming step and multiparticle collision step are calculated on the SPEs in parallel

Current status: speedup could be optimized to nearly linear scaling by reducing the data

transfer ⇒ communication is not dominating over computation anymore linear scaling shows that SRD is a suitable application for Cell/BEFuture work: parallelization of the cell filling step with CellSs in cooperation with BSC

Page 13: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 13

Protein Folding with ECEPP/3 on Cell/B.E.(Lauretta Schubert, Jan Meinke) Protein Folding:

model

■ classic, all atom

■ fixed bond length and angles,free torsion angles

force field

■ ECEPP/3(Empirical Conformational Energy for Peptides and Proteins)

■ includes Coulomb, van-der-Waals, hydrogen bonding and torsion terms with selection rules

algorithms

■ Parallel Tempering

■ Monte Carlo Simulation with Metropolis algorithm

Page 14: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 14

Protein Folding with ECEPP/3 on Cell/B.E.(Lauretta Schubert, Jan Meinke)

Implementation:

■ calculate all interactions first (ignores selection rules)

■ substract non-interacting pairs→ reduce memory use→ avoid time consuming branching

● current status:

■ linear speedup

■ power efficient

● further steps:

■ reduce amount of transfer data

■ switch of parallelisation method

■ double buffering

Page 15: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 15

HP LINPACK

HP LINPACK measured with MPI-1 under SDK-3.0

■ not all nodes on same SDK level

■ only up to 24 blades and 48 MPI processes

■ only ~40% of peak performance

Implementation of MPI-2 with SDK-3.1 performance degradation

Analysis with VampirTrace for small problems

Page 16: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 16

Performance of HPL on JUICEnext with MPI 1

Page 17: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 17

Analysis of HPL with VampirTrace

Page 18: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 18

Evaluation of BLAS/LAPACK Library

Easy way to use Cell SPEs

Good scaling for DGEMM DGEMV memory bound

Page 19: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 19

QPACE (QCD Parallel Computing on the Cell)

Collaborative development effort

■ Main funding: DFG (SFB TR 55)

■ Main partners: Universities of Regensburg and Wuppertal, University of Ferrara, DESY-Zeuthen, and the IBM development lab in Böblingen

Installed systems in Jülich and Wuppertal

Main features:

■ PowerXCell 8i processor

■ Custom FPGA-based network processor (3D-Torus)

■ SPE-centric nearest neighbour communication

■ Communication from Local Store to Local Store

Page 20: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 20

JUPACE – QPACE System at Jülich

4 QPACE racks, each with:

■ 256 node cards

■ 26 TFLOPS peak

■ 1 TB memory

■ 35 KW power consumption

Cooltrans rack

Front-end system rack

■ login/master node

■ 3 TB Lustre filesystem

■ 2 Meta Data Servers

■ 4 Object Storage Servers

Page 21: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 21

The PRACE Project

EU approved the PRACE Preparatory Phase Project (Grant: INFSO-RI-211528)

• 16 Partners from 14 countries• Project duration:

January 2008 – December 2009• Project budget: 20 M € ,

EC funding: 10 M €• Kickoff: Jülich, January 29-30,

2008

Page 22: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 22

PRACE Objectives

Provide world-class systems for world-class science Create a single European entity Deploy 3 – 5 systems of the highest performance level (tier-0) Ensure diversity of architectures Provide support and training

Page 23: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 23

PRACE Work Packages

WP1 Management WP2 Organizational concept WP3 Dissemination, outreach and training WP4 Distributed computing WP5 Deployment of prototype systems WP6 Software enabling for prototype systems WP7 Petaflop/s systems for 2009/2010

WP8 Future petaflop/s technologies

Page 24: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 24

Future Petaflop/s Computer Technologies beyond 2010

Assessment and evaluation of emerging multi-petascale technology following the requirements of HPC users

Implementation of a strategy that guarantees a continuous HPC technology evaluation and system evolution within the PRACE Research Infrastructure

Fostering the development of components for future multi-petascale production systems in cooperation with European and international HPC industry

Creation of STRATOS - PRACE advisory group for Strategic Technologies

Page 25: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 25

PRACE WP8 Prototypes

Sites Hardware/Software Porting effort

CEA“GPU/CAPS”

1U Tesla Server T1070 (CUDA, CAPS, DDT)Intel Harpertown nodes

“Evaluate GPU accelerators and GPGPU programming models and middleware.” (e.g., pollutant migration code (ray tracing algorithm) to CUDA and HMPP)

CINES-LRZ“LRB/CS”

Hybrid SGI ICE2/UV/Nehalem-EP & Nehalem-EX/ClearSpeed/Larrabee

Gadget,SPECFEM3D_GLOBE, RaXml, Rinf, RandomAccess, ApexMap, Intel MPI BM

CSCS“UPC/CAF”

Prototype PGAS language compilers (CAF + UPC for Cray XT systems)

“The applications chosen for this analysis will include some of those already selected as benchmark codes”

EPCC“FPGA”

Maxwell – FPGA prototype (VHDL support & consultancy + software licenses (e.g., Mitrion-C))

“We wish to port several of the PRACE benchmark codes to the system. The codes will be chosen based on their suitability for execution on such a system.”

Page 26: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 26

PRACE WP8 Prototypes (cont'd)

Sites Hardware/Software Porting effort

FZJ (BSC)“Cell & FPGA interconnect”

eQPACE (PowerXCell cluster with special network processor)

Extend FPGA-based interconnect beyond QCD applications.

LRZ“RapidMind”

RapidMind (Streaming Processing Programming Paradym)X86, GPGPU, Cell

ApexMap, Multigrid, FZJ (QCD), CINECA (linear algebra kernels involved in solvers for ordinary differential equations), SNIC

NCF“ClearSpeed”

ClearSpeed CATS 700 units Astronomical many-body simulation, Iterative sparse solvers with preconditioning, finite element code, cryomicrotome image analysis

CINECA I/O Subsystem (SSD, Lustre, pNFS) -

-

Page 27: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 27

QPACE-green / eQPACE

Highlight activities towards energy-efficient HPC systems

Prototype of leadership class system defined in PRACE WP8

Top500 / Green500 publish planned for SC09 HPL communication requirements differ from QCD

Make system suitable for a wider class of applications

Study of HPC requirements on QPACE

Extensions to network processor protocol necessary

Cooperation between IBM and JSC

Page 28: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 28

QPACE Torus Network Processor

Page 29: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 29

TNW Configuration: Red=Z-Direction, PHY=2,5

Network reconfiguration possible by selecting primary and secondary interface

Page 30: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 30

TNW Configuration: Green=Y-Direction, PHY=0,3

1 backplane (32 node-cards) : X x Y x Z – 1 x (1,2,4) x (1,2,4,8)

1 rack (256 node-cards): X x Y x Z – (1,2) x (1,2,4,8,16) x (1,2,4,8)

Page 31: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 31

TNW-Setup: Ring of 32 Nodecards in a Backplane

Example: X x Y x Z = 1 x 4 x 8, ring of 32 nodes of backplane 6

tnwsetup 6:0 ... 6:7 6:15 … 6:8 6:24 … 6:31 6:23 .. 6:16 –action start

Page 32: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 32

Torus Network Ping-Pong (PPE-Side)

Page 33: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 33

Torus Network Ping-Pong (SPE-Side)

Page 34: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 34

Jupace: 1st Test Application (Wave Equation)

Motion of a swinging string

Page 35: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 35

Wave Equation on JUICEnext: PPE-Centric

for (t=2*dt; t<=t_end; t+=dt) { ++iter; // set halo points halo[1][0]=lblocal; halo[0][NUM_SPES+1]=rblocal; for( num=0; num<NUM_SPES; num++){ bstart = num*SPE_BUFF_SIZE; halo[0][num+1]=u[bstart]; halo[1][num+1]=u[bstart+SPE_N-1]; } // send message to SPEs to synchronise each iteration start for(num=0;num<NUM_SPES;num++){ spe_in_mbox_write(data[num].spe_ctx,(uint32_t *)&iter, 1,SPE_MBOX_ALL_BLOCKING); } // ... now SPE performs iteration step // read message from SPEs to synchronise each iteration end for(num=0;num<NUM_SPES;num++){ while(!spe_out_mbox_status(data[num].spe_ctx)); spe_out_mbox_read(data[num].spe_ctx,(uint32_t*)&mbx[num],1); }

// exchange MPI boundary values if (C_size>1) { /* send to the right */ if (C_rank!=0) MPI_Recv(&lblocal, 1, MPI_FLOAT, C_rank-1, right_copy, MPI_COMM_WORLD, &status); if (C_rank!=C_size-1) MPI_Send((void *)&u[BUFF_SIZE-SPE_BUFF_SIZE + SPE_N-1], 1, MPI_FLOAT, C_rank+1, right_copy,MPI_COMM_WORLD); /* send to the left */ if (C_rank!=C_size-1) MPI_Recv(&rblocal, 1, MPI_FLOAT, C_rank+1, left_copy, MPI_COMM_WORLD, &status); if (C_rank!=0) MPI_Send((void *)&u[0], 1, MPI_FLOAT, C_rank-1, left_copy, MPI_COMM_WORLD); }}

Page 36: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 36

Wave Equation on Jupace: SPE-Centric

tnw_credit_base(zminus,chid);tnw_notify_base(zminus,chid,notify_buf);tnw_credit_base(zplus,chid);tnw_notify_base(zplus,chid,notify_buf);do{ iter = spu_read_in_mbox(); if (iter == ITER_STOP) break; // get halo values mfc_get(halo,ctx.ea_halo,32*sizeof(float),tag_id,0,0); waitag(tag_id); // perform iteration string(u_l,u_old_l,u_new_l,SPE_N, eps,halo[1][myid],i halo[0][myid+2],myid); u_temp=u_old_l; u_old_l=u_l; u_l=u_new_l; u_new_l=u_temp; // put output buffer u mfc_put(u_l,ctx.ea_u,(SPE_BUFF_SIZE)*sizeof(float), tag_id,0,0); waitag(tag_id); // inform ppe that u values are up-to-date spu_write_out_mbox(myid);

if (myid == NUM_SPES-1) tnw_put(zplus,0,u_l,0,MSGSIZE,5); if (myid == 0) { // TNW: send to the right tnw_credit(zminus,chid,(uint)recv_buf0,MSGSIZE,0); tnw_wait_recv(notify_buf,0); if (C_rank!=0) halo[1][0]=recv_buf0[SPE_N-1]; // send to the left tnw_credit(zplus,chid,(uint)recv_buf0,MSGSIZE,0); tnw_put(zminus, chid,u_l,0,MSGSIZE,5); tnw_wait_recv(notify_buf,0); if (C_rank!=C_size-1) halo[0][NUM_SPES+1]=recv_buf0[0]; // put halo buffer mfc_put(halo,ctx.ea_halo,32*sizeof(float),tag_id,0,0); waitag(tag_id); } // inform ppe that iteration is done spu_write_out_mbox(myid);} while(1);

Page 37: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 37

QPACE-green / eQPACE

First modifications of FPGA done (F. Schifano, B.Krill)

■ reduce FPGA resource consumption

■ DMA controller for main-memory to main-memory transfers

Page 38: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 38

High Performance LINPACK

Page 39: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 39

QPACE-green / eQPACE

HPL for QPACE (S.Rinke, H,Böttiger) support a performance critical subset of MPI

Open MPI BTL and COLL components

MPI uses torus nearest neighbour communication where possible, Ethernet otherwise

MPI communication operations in HPL (S.Wunderlich)

MPI_Send

MPI_Recv

MPI_Allgatherv

MPI_Scatterv

MPI_Allreduce

MPI_Bcast

Page 40: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 40

OpenMPI Modulear Component Architecture

Organized in layers

Frameworks, components, modules

Architecture allows development of components independently

New Torus Network device requires:

■ Byte Transfer Layer module

■ Short messages (eager protocol)

■ Long messages (rendez-vous protocol)

■ Collective component

Page 41: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 41

QPACE-green / eQPACE

PingPong torus benchmark for main-memory to main-memory

HPL based on QS22 patch has been run successfully on QPACE architecture using torus network nearest-neighbour whenever possible

Page 42: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 42

Planned Activities

Porting applications from JUICEnext to QPACE

Stochastic Rotation Dynamics (simulation of fluid and flows)

ECEPP/3 force field (protein folding)

Multigrid method

Extension of FPGA capabilities for better application support

DMA engine

Routing feature

Page 43: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 43

JuGiPSY: Jülich's GPU System

NVIDIA Tesla S1070

4 x GPU / 1U blade, 1,45 GHz

4 x 4 GB DDR3 memory

4 x 102 GB/s memory bandwidth

4 TFLOPS single-precision

345 GFLOPS double-precision

700W

Host System (2x)

Xeon E5430@2,66 GHz quad-core

32 GB Memory

1 TB disk

Host Interface cards

2x PCIe 16x

(Transtec 1000R S1070)

Page 44: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 44

Tesla S1070 Features

A multiprocessor (Thread Processing Array, TPA) includes:

8 processor cores (floating point / integer unit; move, compare, logic, branch unit)

1 double-precision unit (IEEE 754 floating point)

Hardware driver information: Product Name: NVIDIA Tesla S1070 -400GPU 0: Tesla C1060, GPU 1: Tesla C1060

CUDA Capability Major revision number: 1CUDA Capability Minor revision number: 3Total amount of global memory: 4294705152 bytesNumber of multiprocessors: 30Number of cores: 240Total amount of constant memory: 65536 bytesTotal amount of shared memory per block: 16384 bytesTotal number of registers available per block: 16384Warp size: 32Maximum number of threads per block: 512

Host to Device: 33554432 2157.0Device to Host: 1886.4Device to Device: 73458.1

Bandwidth: Transfer Size (Bytes) Bandwidth(MB/s)

Page 45: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 45

CUDA Example (Matrix Multiply): Host Function// Thread block size#define BLOCK_SIZE 16// Forward declaration of the device multiplication function__global__ void Muld(float*, float*, int, int, float*);

// Host multiplication function: Compute C = A * B// hA is the height of A, wA is the width of A, wB is the width of Bvoid Mul(const float* A, const float* B, int hA, int wA, int wB, float* C){ int size; // Load A and B to the device float* Ad; size = hA * wA * sizeof(float); cudaMalloc((void**)&Ad, size); cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice); float* Bd; size = wA * wB * sizeof(float); cudaMalloc((void**)&Bd, size); cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice);

// Allocate C on the device float* Cd; size = hA * wB * sizeof(float); cudaMalloc((void**)&Cd, size);

// Compute the execution configuration assuming // the matrix dimensions are multiples of BLOCK_SIZE dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); dim3 dimGrid(wB / dimBlock.x, hA / dimBlock.y);

// Launch the device computation Muld<<<dimGrid, dimBlock>>>(Ad, Bd, wA, wB, Cd);

// Read C from the device cudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost); // Free device memory cudaFree(Ad); cudaFree(Bd); cudaFree(Cd);}

Page 46: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 46

CUDA Example (Matrix Multiply): Device Function

//// Device multiplication function called by Mul()//// Compute C = A * B//// wA is the width of A// wB is the width of B//

__global__ void Muld(float* A, float* B, int wA, int wB,float* C){ // Block index int bx = blockIdx.x; int by = blockIdx.y;

// Thread index int tx = threadIdx.x; int ty = threadIdx.y;

// Index of the first sub-matrix of A processed by the block int aBegin = wA * BLOCK_SIZE * by;

// Index of the last sub-matrix of A processed by the block int aEnd = aBegin + wA – 1;

// Step size used to iterate through the sub-matrices of A int aStep = BLOCK_SIZE;

// Index of the first sub-matrix of B processed by the block int bBegin = BLOCK_SIZE * bx;

// Step size used to iterate through the sub-matrices of B int bStep = BLOCK_SIZE * wB;

// The element of the block sub-matrix that is computed // by the thread float Csub = 0;

// Loop over all the sub-matrices of A and B // required to compute the block sub-matrix for (int a = aBegin, b = bBegin;

a <= aEnd;a += aStep, b += bStep) {

// Shared memory for the sub-matrix of A__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

// Shared memory for the sub-matrix of B__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

// Load the matrices from global memory to // shared memory; each thread loads// one element of each matrixAs[ty][tx] = A[a + wA * ty + tx];Bs[ty][tx] = B[b + wB * ty + tx];

// Synchronize to make sure the matrices are // loaded__syncthreads();

// Multiply the two matrices together;// each thread computes one element// of the block sub-matrixfor (int k = 0; k < BLOCK_SIZE; ++k)

Csub += As[ty][k] * Bs[k][tx];

// Synchronize to make sure that the preceding // computation is done before loading two new // sub-matrices of A and B in the next iteration__syncthreads();

} // Write the block sub-matrix to global memory; // each thread writes one element int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx; C[c + wB * ty + tx] = Csub;}

Page 47: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 47

CUDA Example (Matrix Multiply): Performance

Matrix Dimensions N were chosen as multiples of block size (16)

Using shared memory instead of global memory strongly improves performance

Available BLAS library for CUDA (CUBLAS) is highly optimized and should be used whenever possible

Page 48: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 48

Outlook

GPU-based projects:

■ T. Neuhaus: Ising Model

■ J. Meinke: Protein Folding

■ O. Zimmermann: Support Vector Machine

■ L. Westphal: Stochastic Rotation Dynamics (SRD) Simulation of fluid and flows (CUDA, OpenCL)

■ J. Kreutz: Matrix Multiplikation Performance Black-Scholes Model (Assessment of stock price)

CELL-based projects:

■ S. Rinke: OpenMPI subset for eQPACE

■ J. Koutsou: QCD

Page 49: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 49

Further Information

http://www.fz-juelich.de/jsc/juice

http://www.physik.uni-regensburg.de/sfbtr55

http://en.wikipedia.org/wiki/QPACE

People involved, contacts:■ [email protected]

[email protected]

[email protected]

Page 50: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

DLR Braunschweig, 14th and 15th of October 2009 W.Homberg

Mitg

lied

der

Hel

mh

oltz

-Gem

ein

sch

aft

Appendix

Page 51: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 51

Cell Architecture

Element Interconnect Bus (EIB)

■ 16-byte ring busses, 96 byte/cycle in total

■ up to 128 open requests

Memory Interface Controller (MIC)

■ Access to memory: 25.6 GB/s

■ max. 512 MB per CBE processor

Bus Interface Controller (BIC)

■ I/O or dual CBE interconnect

■ 12 FlexIO links, each 6.4 GB/s

Page 52: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 52

Cell Cluster

35 QS22 blades @ 3.2 Ghz

■ 2 Cell processors (CBE)

■ 8 GB Memory

4x InfiniBand adapter (SDR)

Frontend: x86 compatible

Interconnect switch: IB (Cisco)

Peak performance:

■ 1 blade: 217,6 GFLOPS peak (64 bit float)

■ 35 blades: 7 TFLOPS peak (64 bit float)

Page 53: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 53

Programming Model (1st Generation Cell Processor)

large ratio between SP/DP speed

e.g. first implementation of HPL benchmark on Cell uses mixed-precision (J.Kurzk, J.Dongarra)

■ compute intensive operations SP

■ necessary iterative refinement DP

■ 1.5 times memory consumption

Page 54: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 54

DGEMV BLAS using SPEs <--> GotoBLAS

Matrix not transposed Matrix transposed

Page 55: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 55

TNW Configuration: Blue=X-Direction, PHY=1,4

Page 56: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 56

CUDA Example (Matrix Multiply): Performance

Matrix Dimensions N were chosen as multiples of block size (16)

Using shared memory instead of global memory strongly improves performance

Available BLAS library for CUDA (CUBLAS) is highly optimized and should be used whenever possible

Page 57: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 57

Accelerator Performance Results

Page 58: Energy-Efficient Many-Core Systems at JSCtau.dlr.de › fileadmin › documents › meetings › 2009 › pdf › FZ...Energy-Efficient Many-Core Systems at JSC Symposium: CFD on Future

October 14-15, 2009 Slide 58

Development Time