message passing architecture - school of computer...

40
1 Dr. Martin Land Message Passing Advanced Computer Architecture — Hadassah College — Fall 2016 Message Passing Architecture

Upload: truongnhan

Post on 22-Apr-2018

217 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

1Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Message Passing Architecture

Page 2: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

2Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Message Passing MultiprocessorsCollection of N nodes

Node = CPU with cache and private memory space Node i has address space 0, … , Ai – 1 No shared memory locations

Processes communicate by exchange of structured messagesSwitching fabric — network

CPU CPU...

SwitchingFabric

Memory

I/O

User Interface

ExternalNetwork

Memory

0                              N 1−

0, ... , A 1                        0, ... , A 1− −

Page 3: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

3Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Message Passing SystemsInterprocessor communication

Send / receive transactionsMessage Passing Interface (MPI)

API for message passing parallel programmingTransaction primitives between processor and interprocessor networkImplementation depends on OS and hardware

Interprocessor network performanceGeneral computer networkCommunication protocol and message typesRouting, network access, segmentation/reassembly (SAR)Data requests and transfers

Scalable to large systemsDepends on network capacityNo geographical restrictionGrows naturally into cluster computing

Page 4: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

4Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Message Passing IssuesBasic message types

Send dataSend request (for data)Respond to request (data or status)Send system command

Structured messagesHeader

Source ID + destination ID + timestamp

Content Data / request / response

Message management ⇒ overhead

Sequential consistencyNo shared memory ⇒ no snooping ⇒ no snooping overheadCode expects specific content from specific sourceMessage header + content ⇒ consistent write order

Page 5: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

5Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Message Passing Example — Scalar Product

[ ] [ ]*∑ 3i=0 a i b iCompute on data pre-distributed to nodes 0 - 3

load Ra, a

load Rb, b

Ra ← Ra * Rbrecv P2, RbRa ← Ra + Rbrecv P1, RbRa ← Ra + Rbstore p, Ra

load Ra, a

load Rb, b

Ra ← Ra * Rbsend P3, Ra

load Ra, a

load Rb, b

Ra ← Ra * Rbrecv P0, RbRa ← Ra + Rbsend P3, Ra

load Ra, a

load Rb, b

Ra ← Ra * Rbsend P1, Ra

P3P2P1P0

Message overheadSource or destinationTime of creation

Sequential consistency guaranteed by message overheadP3 distinguishes P1 data from P2 data by source IDNo data hazard

Page 6: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

6Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Blocking versus Non‐Blocking ModelsProgram must pair

Send instruction in one thread Receive instruction in another thread

Synchronous (blocking) model Threads synchronize on send/receive pairSend and receive instructions block in codeReceive waits for RTSSend waits for CTSCreates synchronization barrier at transfer

Asynchronous (non-blocking) modelSender execute send instruction and continues

Message buffered at sender until sent through network

Message buffered at receiver until neededReceiver accepts message when executing receive instruction

PS PR

RTS

RTS = Request to SendCTS = Clear to Send

CTSData

Page 7: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

7Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Some MPI Environment Messages

Ref: https://computing.llnl.gov/tutorials/mpi/

Terminates MPI execution environmentLast MPI routine in every MPI program

MPI_Finalize ()

Environment variable Lists MPI-aware processes

MPI_COMM_WORLD

Returns process number of calling processMPI_Comm_rank (comm,&rank)

Returns number of processes in groupMPI_Comm_size (comm,&size)

Initialize MPI execution environmentBroadcast command line arguments to all processes

MPI_Init (&argc,&argv)

Page 8: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

8Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Some MPI Point‐to‐Point Messages

Ref: https://computing.llnl.gov/tutorials/mpi/

Blocks until a specified non‐blocking send or receive operation has completed

MPI_Wait (&request,&status)

MPI_Isend(buffer,count,data_type,destination_task,id_tag,comm,&request)

Non‐blocking send

request — system returns request number for subsequent synchronization

Non‐blocking receive

MPI_Irecv(buffer,count,data_type,source_task,id_tag,comm,&request)

Blocking receive

status — collection of error flags

MPI_Recv(buffer,count,data_type,source,id_tag,comm,status)

Blocking send

id_tag — defined by user to distinguish specific messagecomm— identifies group of related tasks (usually MPI_COMM_WORLD)

MPI_Send(buffer,count,data_type,destination_task,id_tag,comm)

Page 9: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

9Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Some Collective Communication Messages

Ref: https://computing.llnl.gov/tutorials/mpi/

Applies a reduction operation on all tasks in group and places result in one root task

MPI_Reduce(sendbuf,recvbuf,count,data_type,operation,root,comm)

MPI_Scatter(sendbuf,sendcnt,sendtype,recvbuf,recvcnt,recvtype,root,comm)

Distributes distinct messages from a single root task to each task in group

Gathers distinct messages from each task in group to a single destination task

MPI_Gather(sendbuf,sendcnt,sendtype,recvbuf,recvcount,recvtype,root,comm)

Sends broadcast message from process root to all other processes in group

MPI_Bcast(buffer,count,data_type,root,comm)

Each task reaching MPI_Barrier call blocks until all tasks in group reach same MPI_Barrier

MPI_Barrier(comm)

Page 10: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

10Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Scatter and Gather

4

3

2

1

Task 3Task 2Task 1Task 0

4321

Task 3Task 2Task 1Task 0

Scatter

Send buffer

Destination buffers

D

C

B

A

Task 3Task 2Task 1Task 0

DCBA

Task 3Task 2Task 1Task 0

Gather

Destination buffer

Send buffers

Page 11: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

11Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Reduce

4321

Task 3Task 2Task 1Task 0

10

Task 3Task 2Task 1Task 0

Reduce: ADD

Send buffers

Destination buffer

Page 12: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

12Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

MPI "Hello World"#include "mpi.h"main( argc, argv )int argc;char **argv;{char message[20];int myrank; /* myrank = this process number */MPI_Status status; /* MPI_Status = error flags */MPI_Init( &argc, &argv );MPI_Comm_rank( MPI_COMM_WORLD, &myrank );

/* MPI_COMM_WORLD = list of active MPI processes */if (myrank == 0) /* code for process zero */{strcpy(message,"Hello, there");MPI_Send(message, strlen(message)+1, MPI_CHAR, 1, 99, MPI_COMM_WORLD);

}else /* code for process one */{MPI_Recv(message, 20, MPI_CHAR, 0, 99, MPI_COMM_WORLD, &status);printf("received :%s:\n", message);

}MPI_Finalize();

}

Ref: "MPI: A Message-Passing Interface Standard Version 1.3"

Page 13: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

13Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Send/Receive InstructionsSend instruction

List of running MPI‐aware processesMPI_COMM_WORLD

Message identifier (tags specific message from source)99

Destination process1Data type in messageMPI_CHARLength of messagestrlen(message)+1Pointer to message buffermessageInstruction nameMPI_Send

Receive instruction

Error flags&statusList of running MPI‐aware processesMPI_COMM_WORLD

Message identifier (filters on tag)99

Source process0Data type in messageMPI_CHARMaximum length of message20Pointer to message buffermessageInstruction nameMPI_Recv

Page 14: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

14Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Scalar Product with MPI Constructs

/* scatter data from root node 0 *//* each node receives 1 component of a and one of b */MPI_Scatter(a,1,MPI_INT,a,1,MPI_INT,0,MPI_COMM_WORLD)MPI_Scatter(b,1,MPI_INT,b,1,MPI_INT,0,MPI_COMM_WORLD)

/* each node calculates */p = a * b

/* OS for root node adds 1 integer from each node and places sum in root 0 */

MPI_REDUCE(p,p,1,MPI_INT,MPI_SUM,0,MPI_COMM_WORLD)

[ ] [ ]*Compute on 4 nodes∑ 3i=0 a i b i

Page 15: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

15Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Implementation Issues for ProcessorProcessor directly supporting message passing

Synchronous modelMessage send/receive portHardware blocking on port access

Asynchronous modelMessage send/receive port or equivalentInternal send/receive buffer for messages

Interconnect support for message passingNo direct support in processor Interconnect buffers messages for processorSynchronous model

Processor blocks on send receive instructionInterconnect sends/receives messages

Asynchronous modelInterconnect buffers and sends messages for processorInterconnect buffers received messages and interrupts processor

Page 16: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

16Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Message Passing Support in Alpha 21364

Ref: Kevin Krewell, "Alpha EV7 Processor", Microprocessor Report, 2002

RISC ProcessorCore

L2 cache

Memory I/OMessagebuffers

Messagepassingrouter

Page 17: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

17Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Message Passing Multiprocessor Configurations 

Page 18: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

18Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Cluster ComputingLarge message passing distributed memory system

Exploits MPI scalabilityUp to millions of nodes

Typical nodeStandard workstation

Node-to-node scalePhysical busCrossbar switchLANWAN

Interconnection network LAN / WANMPI_COMM_WORLD includes network addresses

CPU CPU...

GeneralNetwork

Memory

I/O

User Interface

ExternalNetwork

Memory

0                              N 1−

0, ... , A 1                        0, ... , A 1− −

Page 19: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

19Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Benefits of ClustersScalability in parallel computing

Message passing parallel algorithms scale to large number of nodesTypical communication / computation ratio low to mediumLess sensitive to latency than shared memory

Very high scalability for thread-replicating serversWeb servers Database access servers On-line transaction processing (OLTP)

Availability Nodes provide backup to each other On device failure load transferred to another node within clusterTransparent to user

ManageabilitySingle system image ⇒ simple system restorationAutomatic load balancing

Page 20: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

20Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

MPI and Socket APISocket API maps endpoints to socket numbers

Socket number identifies connection to remote endpoint Endpoint = IP address:port

Client — socket identifies connection to server endpoint Server — socket identifies connection from client endpoint

Systems with MPI and socketsMPI cluster implemented as client/server system

Primary node (client) opens connection to each remote server Secondary nodes run MPI server listening on service port number

MPI maps socket numbers to MPI rank constructSocket number provides ID number for node in MPI_COMM_WORLD

Node performs MPI send to remote node Writes to socket for remote node

Node performs MPI receive Reads from socket for remote node

Page 21: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

21Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Network Latency in Cluster Computing

Communications overhead

Networking software

Network hardware

CPU CC ≈ 10‐9 sec

Message latency ≈ 150 × 10‐6 sec

≈ 1.5 × 105 CC

Comparable to page fault

( )

1 ,11

P

P P overhead

comm

overhead

S FF F F

NCPIF

CPI

= =⎛ ⎞− + +⎜ ⎟⎝ ⎠

= =

Parallel instructions

All instructions

Communication activity CC per instruction

Computation activity CC per i

Amdahl's law with communication overhead

P overheadF N Fnstruction

Cluster computing works well for and large, and small

Ref: Cisco Systems, "Cluster Computing"

Page 22: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

22Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Hardware Support for Cluster ProcessingTCP offload engine (TOE)

Dedicated communication processorProcesses TCP software for CPU(s) in node

Segmentation/reassembly (SAR)Error checking

Feature of mainframes since 1960sReturning as asymmetric multicore processors

Remote Direct Memory Access (RDMA)Performs DMA functions between nodes

Ref: Cisco Systems, "Cluster Computing"

Page 23: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

23Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Typical Cluster ConfigurationNode

Shared memory SMP running tightly coupled code over OpenMPHigh speed interconnect

TCP/IP over Gigabit Ethernet (1 Gbps serial transmission)Cluster of SMPs (CluMP)

Nodes exchange data for loosely coupled code over MPI

https://computing.llnl.gov/tutorials/linux_clusters/

Page 24: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

24Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Thunder Cluster CluMP type cluster computer

1024 Linux nodes Each node is 4-CPU Itanium 2 "Madison Tiger4" SMP

Built at Lawrence Livermore National Laboratory (LLNL)US Department of Energy National Nuclear Security AdministrationSecond fastest supercomputer in June 2004

(http://www.top500.org)Peak performance: 23 TFLOPS

https://computing.llnl.gov/tutorials/linux_clusters/thunder.html

Page 25: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

25Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Node Technical Details1024 nodes

4-CPU 1.4 GHz Itanium 2 "Madison Tiger4"VLIW explicitly parallel RISC language machine

8 GB DDR per node (8 TB total)16KB L1 Data Cache16KB L1 Instruction Cache256 KB L2 Cache4 MB L3 Cache

Intel E8870 chipset6.4 GB/sec switch

64-bit 133 MHz PCI-X

Page 26: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

26Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Interconnect DetailsQuadrics QsNetII

PCI-to-PCI switch900 MB/s of user space to user space bandwidthOptical interconnect extends maximum link length to 100mFat tree topology

Up to 4096 nodes in 5 switching stages

Switch built from Elan4 ASIC8 port custom switchIntegrated DMA controller

Supports collective communicationBarrier synchBroadcast Multicast

1 stage 4 node system

2 stage 16 node system

3 stage 128 node system

http://www.quadrics.com

Page 27: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

27Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Blue Gene/LMassively parallel supercomputer

65,536 dual-processor nodes32 TB (32,768 GB) main memoryBased on IBM system-on-a-chip (SOC) technologyPeak performance of 596 teraflops

Built at Lawrence Livermore National Laboratory (LLNL)US Department of Energy National Nuclear Security Administration2nd fastest supercomputer in June 2008 (1st in 2007)

Target applicationsLarge compute-intensive problemsSimulation of physical phenomenaOffline data analysis

Architectural goalsHigh performance on target applicationsCost/performance of typical server

Gara, et. al., "Overview of the Blue Gene/L system architecture", IBM Technical Journal

Page 28: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

28Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Target ApplicationsSimulations of physical phenomena

Represent physical state variables on latticeTime evolution through sequence of iterations

Exchange state variables among lattice pointsUpdate local state variables according to physical laws

Short-range physical interactions Exchange state variables within small volume of lattice points

Spectral methods FFT and othersRequire long-range communication between lattice pointPreferable for good convergence behavior

Mapping lattice sites onto multiple processors straightforward Each node allocated subset of problemSpatial decomposition ⇒ nodes communicate with small group of nodes

Global reduction Normalization over full data setDecision step in algorithm

Page 29: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

29Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Scaling Issues in Very Large SystemsCost / performance ratio scales badly at massive scale

Equipment overheadEquipment racksFloor spaceElectrical powerCooling

Communication complexity Fully connected network grows as (number of processors) Raises communication overhead FP

Computing power versus electrical powerPerformance / rack = (Performance / Watt) × (Watts / rack)Watts / rack ~ constant ~ 20 kW / rackRack performance limited by Performance / WattPerformance / Watt scales badly with performance

Page 30: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

30Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Architectural DecisionsHigh integration and hardware density

Fewer racks ⇒ lower rack overheadHigher performance per rackShorter and simpler communication paths

Requirements High performance / Watt

Medium performance processorsHigh performance processor ⇒ high ILP at high powerTLP more significant than ILP in target applications

Highly integrated system on a chip (SOC)Multiple cores Processor + memory + I/O on single integrated circuitHigh performance interconnectApplication specific integrated circuit (ASIC)

Page 31: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

31Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Implementation65,536 nodes

130-nm copper IBM CMOS 8SFG technology Single ASIC

two processors nine DDR SDRAM chips

Nodes physically small2 to 10 times higher density greater than high-frequency uniprocessors

1,024 dual-processor nodes per rack 27.5 kW of total power85% of inter-node connectivity contained within racks

Five interconnection networksHigh connectivity static switching fabric for bulk communicationsCollective network supporting scatter/gather/broadcast/reductionBarrier network supporting barrier synchronizationControl system networkGigabit Ethernet network supporting I/O

Page 32: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

32Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Node ASIC

Page 33: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

33Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Interconnect DetailsStatic interconnection network

64 × 32 × 32 three-dimensional torus for bulk communicationNodes communicate with neighboring nodes Same bandwidth and nearly same latency to all nodes

No edges in torus configuration

Simplifies programming model Nearest-neighbor link rate is 1.4 Gb/s

Collective network Global broadcast/scatter/gather of dataImprovement over 3D torus network1% to 10% of typical supercomputer

latency on collective operationsBuilt-in ALU hardware on network

Supports reduction operations Min, max, sum, OR, AND, XOR

Barrier networkPropagates OR of nodes to trigger interruptsPropagates AND of nodes for synchronization

2 × 2 × 2 torus

Page 34: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

34Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

IBM Sequoia — Blue Gene/QMassively parallel supercomputer

96 K (98,304) 16-core nodes

1.6 PB (1.6 × 1024 × 1024 GB) main memoryBased on IBM POWER system-on-a-chip (SOC)Peak performance of 16 petaflops

Built at Lawrence Livermore National Laboratory (LLNL)US Department of Energy National Nuclear Security AdministrationFastest supercomputer in June 2012

Operating systemsRed Hat Enterprise Linux on I/O nodes

Connect to file system

Compute Node Linux (CNL) on application processorsRuntime environment based on Linux kernel

Target applicationsAdvanced Simulation and Computing ProgramSimulated testing of US nuclear arsenal Nuclear detonations banned since 1992

Page 35: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

35Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Example of Cluster ApplicationParallel multiplication of N × N matrix by vector

Medium coupled computation

Works well on shared memory SMPDominated by communication overhead on cluster

Manager codesetupinitialize matrix a[i][j]for (i=0 ; i<N ; i++)MPI_Send row A[i] to rank_i

for (i=0 ; i<N ; i++)MPI_Recv c[i] as row * vector

Worker codesetupinitialize vector b[i]for (i=0 ; i<N ; i++)MPI_Recv row A[i]

dotp = 0.0for (i=0 ; i<N ; i++)dotp += c[i] * b[i]

MPI_Send dotp

∑N-1

j=0c[i]= A[i][j] * b[j]= A[i] dot b

Page 36: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

36Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Speedup?Computations

Each row of vector resultN multiplicationsN – 1 additions2N – 1 flops

Communications Send N × N matrix ⇒ N2 floatsReturn N vector components ⇒ N floatsTotal = N2 + N floats sent Send (N2 + N) / (2N – 1) floats per flop

Overhead Factor2 2 ~

2 1+

= = ×−

on fast interconnectcomm communication

overhead processor

CPI N N CCF NCPI N CC

( )1

1 1~ 011

≈= ⎯⎯⎯⎯⎯→⎛ ⎞− + +⎜ ⎟⎝ ⎠

largePF N

P P

SNF F N

N

( )

111 P P overhead

SF F F

N

=⎛ ⎞− + +⎜ ⎟⎝ ⎠

Page 37: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

37Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Example of Reasonable Cluster Application

Sequential VersionN stepsstep = 1 / Nfor (i=0; i<N; i++){x = (i+0.5)*step;sum = sum + 4.0/(1.0 + x*x);

}pi = step * sum;

Loop ComputationsN * (sum, product,

product, sum, division, sum) + product

~ 6N + 1 flops

1

20

41

dxx

π =+∫Calculate numerically

Cluster VersionN processorsprocessor i computesx = (i+0.5)*step;sum = 4.0/(1.0 + x*x);MPI_REDUCE (p, N, 1, MPI_INT,

MPI_SUM,0,MPI_COMM_WORLD)pi = step * p;

Computationssum, product, product, sum,

division + reduce_add~ 6 flops

Communications Send / receive 1 float per 6

flops1~ 2 ~ 36× ⇒overheadF S

Page 38: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

38Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Examples of Good Cluster ApplicationsThread-replicating servers

Web servers Database access servers On-line transaction processing (OLTP)

Number crunching on separable dataNumerical integrationRay tracingVLSI layout planning

Numerical solutions for 2nd order partial differential equationsElectrodynamics and opticsQuantum mechanics and quantum chemistryThermodynamics, hydrodynamics, aerodynamicsPlasma magnetohydrodynamics

Page 39: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

39Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Learning Curve in Parallel Programming

Medium scale projectsCoordination — folklore or scienceDebugging consistency across team of programmers

Large scale projectsHigh performance computing (HPC) for difficult mathematical problemsRequires expertise in programming and heuristics of mathematical problemProgrammers become specialists in specific HPC field

Small scale projectsSingle programmer"Embarrassingly" parallel

Obvious task/data decomposition

Follow design rulesVector constructsOpenMP constructsStructure optimizationSequential consistency

Page 40: Message Passing Architecture - School of Computer …cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide09-1.pdfAdvanced Computer Architecture — Hadassah College — Fall 2016

40Dr. Martin LandMessage PassingAdvanced Computer Architecture — Hadassah College — Fall 2016

Empirical ResultsStudy of student programmers

Graduate-level courses in parallel programmingStudents implement "game of life" program in C on PC cluster Compare serial with parallel solutions (OpenMP and MPI)

2.55.6

2.04.8

1.92.8

3.05.7OpenMP

2.15.0

MPI

Speedup on parallel version with N = 8 compared to parallel version with N = 1

9.16.7OpenMP

Standard DeviationMean

1.974.7MPI

Speedup on 8 processors 

Parallel version compared to sequential Programming model