9/20/2015 slide 1 pcod: lecture 1 per stenström © 2008, sally a. mckee © 2009 7.5 credit points...

44
06/27/22 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee Memory bus MicroChannel bus I/O i860 NI DMA DRAM IBM SP-2 node L 2 $ Power 2 CPU Memory contr oller 4-way interleaved DRAM General inter connection network formed fr om 8-port switches NIC

Upload: julian-wilkinson

Post on 12-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 1PCOD: Lecture 1

Per Stenström © 2008, Sally A. McKee © 2009

7.5 credit points

Instructor: Sally A. McKee

Memory bus

MicroChannel bus

I/O

i860 NI

DMA

DR

AM

IBM SP-2 node

L2 $

Power 2CPU

Memorycontroller

4-wayinterleaved

DRAM

General interconnectionnetwork formed from8-port switches

NIC

Page 2: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

INFOI need your names and email addresses!Class on MONDAY 1/24 3:15, place TBANO CLASS T/Th 1/25, 1/27 (SAM away)NO LAB/EXERCISES THIS WEEK NO LAB THURSDAY 2/3 (we’ll start week 3)Exercises next WEDNESDAY 2/210:00, place TBAClass on MONDAY 1/31 3:15, place TBAWeb page coming over weekend:

http://www.cse.chalmers.se/~mckee/courses/EDA281.htmlBooks are being copied, will distribute on MondayNO EXAM; final survey papers instead

04/21/23 slide 2PCOD: Lecture 1

Per Stenström © 2008, Sally A. McKee © 2009

Page 3: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 3PCOD: Lecture 1

Per Stenström © 2008, Sally A. McKee © 2009

What is a parallel What is a parallel computer?computer?

A parallel computer is a collection of processing elements that cooperate to solve large problems (fast)

Page 4: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 4PCOD: Lecture 1

Per Stenström © 2008, Sally A. McKee © 2009

Why parallel computers?Why parallel computers?

New performance demanding applications

Killer microprocessors

A collection of killer microprocessors

(integrated onthe same chip)

Economics

Technology trends

Page 5: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 5PCOD: Lecture 1

Per Stenström © 2008, Sally A. McKee © 2009

Broad issuesBroad issues

Programming issues: What programming model? Performance — cross

cutting issues: Impact of system design tradeoffs on application performance Impact of application design on performance

Performance — cross cutting issues: Impact of system design tradeoffs on application performance Impact of application design on performance

Architectural model issues: How big a collection? How powerful are the elements? How do the elements cooperate?

System interface

A parallel computer is a collection of processing elements that cooperate to solve large problems (fast)

Page 6: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 6PCOD: Lecture 1

Per Stenström © 2008, Sally A. McKee © 2009

Goal and overviewGoal and overview

The goal of this course is to provide knowledge on Programming models and techniques for design of high-performance parallel programs the data parallel model the shared address-space model the message-passing model

Design principles for parallel computers small-scale system design tradeoffs scalable system design tradeoffs interconnection networks

Page 7: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 8PCOD: Lecture 1

Per Stenström © 2008, Sally A. McKee © 2009

Driving forces behind parallel computers (1.1) Evolution behind today’s parallel computers (1.2) Fundamental design issues (1.3) Methodology for designing parallel programs (2.1 – 2.2)

Overview of parallel computer technology:What it is? What it is for? What are the issues?

Page 8: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 9PCOD: Lecture 1

Per Stenström © 2008, Sally A. McKee © 2009

Three driving forcesThree driving forces Application demands (coarse-grain parallelism abounds): Scientific computing (e.g., modeling of phenomena in

science) Engineering computing (e.g., CAD and design

analysis) Commercial computing (e.g., media and information

processing) Technology trends Transistor density growth high Clock frequency improvement moderate

Architecture trends Diminishing returns on instruction-level parallelism

Page 9: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 10PCOD: Lecture 1

Per Stenström © 2008, Sally A. McKee © 2009

Parallelism in sequential programsParallelism in sequential programs

A sequential program on a superscalar processor:Programming model:

sequential

Architecture: instruction-level parallelism register (memory) communicationpipeline interlocking for

synchronization

Gap between modeland architecture has increased

for i = 0 to N-1 a[(i+1) mod N] := b[i] + c[i];for i = 0 to N-1 d[i] := C*a[i];

Iteration: 0 1 2 … N-1Loop 1 a[1] a[2] … a[0]

Loop 2 a[0] a[1] … a[N-1]

data dependencies

Page 10: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 11PCOD: Lecture 1

Per Stenström © 2008, Sally A. McKee © 2009

A parallel programming modelA parallel programming model

Extended semantics to express units of parallelism at theinstruction levelthread levelprogram level

communication and coordination between units of parallelism at theregister levelmemory level I/O level

Page 11: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 12PCOD: Lecture 1

Per Stenström © 2008, Sally A. McKee © 2009

Programming model vs. Programming model vs. parallel architecture parallel architecture CAD

Multiprogramming Sharedaddress

Messagepassing

Dataparallel

Databases Scientific modeling Parallel applications

Programming models

Communication abstractionUser/system boundary

Compileror library

Operating system support

Communication harrdware

Physical communication medium

Hardware/software boundary

Three key concepts• Communication abstraction supports programming models• Communication architecture (ISA plus primitives for comm/synch)• Hardware/software boundary to define which parts of the communication architecture are implemented in hardware or software

Page 12: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 13PCOD: Lecture 1

Per Stenström © 2008, Sally A. McKee © 2009

Shared address space (SAS) modelShared address space (SAS) modelProgramming model Parallelism parts of a program, called threads

Communication and coordination among threads through a shared global address space

for_all i = 0 to P-1 for j = i0[i] to in[i] a[(j+1) mod N] := b[j] + c[j];barrier;for_all i = 0 to P-1 for j = i0[i] to in[i] d[j] := C*a[j]; P P P

Memory

Communication abstractionsupported by HW/SW interface

Page 13: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 14PCOD: Lecture 1

Per Stenström © 2008, Sally A. McKee © 2009

Message passing modelMessage passing modelProgramming model Process-level parallelism (private addresses) Communication and coordination via explicit messages

for_all i = 0 to P-1 for j = i0[i] to in[i] index = (j+1) mod N; a[index] := b[j] + c[j]; send(a[index], (j+1) mod P); end_forbarrier;for_all k = 0 to P-1 for j = i0[k] to in[k] recv(tmp,(P+j-1) mod P); d[j] := C*tmp;} end_for

for_all i = 0 to P-1 for j = i0[i] to in[i] index = (j+1) mod N; a[index] := b[j] + c[j]; if j = in[i] then send(a[index], (j+1) mod P, a[j]); end_forbarrier;for_all i = 0 to P-1 for j = i0[i] to in[i] if j = i0[i] then recv(tmp,(P+j-1) mod P, a[j]); d[j] := C * tmp;} end_for

Page 14: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 15PCOD: Lecture 1

Per Stenström © 2008, Sally A. McKee © 2009

Data parallel systems (SIMD)Programming model Operations performed in parallel on each element of data structure Logically, single thread of control performing sequential or parallel steps

Conceptually, a processor associated with each data element

Architectural model Array of many simple, cheap processors with little memory each

(processors don’t sequence through instructions) Attached to a control processor that issues instructions Specialized and general communication, cheap global synchronization

Original motivations•Matches simple differential equation solvers•Centralizes high cost of instruction fetch/sequencing

PE PE PE

PE PE PE

PE PE PE

Controlprocessor

Page 15: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 16PCOD: Lecture 1

Per Stenström © 2008, Sally A. McKee © 2009

Pros and cons of data parallelismPros and cons of data parallelismExample

parallel (i:0->N-1) a[(i+1) mod N] := b[i] + c[i];parallel (i:0->N-1) d[i] := C * a[i];

Evolution and convergence: Popular when cost savings of centralized sequencer is high Parallelism is limited to specialized regular computationsMuch of parallelism can be exploited at instruction level Coarser levels of parallelism can be exposed for

multiprocessors and message-passing machines

New data parallel programming model: SPMD

Single-Program Multiple-Data

Page 16: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 17PCOD: Lecture 1

Per Stenström © 2008, Sally A. McKee © 2009

A generic parallel architectureA generic parallel architectureA generic modern multiprocessor (shared address or message passing architecture)

Node: processor(s), memory system, plus a communication assist

• Network interface and communication controller

• Scalable network

• Convergence allows lots of innovation, now within framework

• Integration of assist with node, what operations, how efficiently...

Mem

Network

P

$

Communicationassist (CA)

Page 17: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 19PCOD: Lecture 2

Per Stenström © 2008, Sally A. McKee © 2009

Why study parallel programming Why study parallel programming issues?issues?

From a software/algorithm designer’s point of view: High-performance software is the key motivation for parallel computers Parallel compiler technology is far from being as mature as compiler technology for single processors (uniprocessors) From a system designer’s point of view: Understanding hardware/software interaction is key to making architectural tradeoffs for high performance

Important to understand tradeoffs in performance versus programming effort involved

Page 18: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 20PCOD: Lecture 2

Per Stenström © 2008, Sally A. McKee © 2009

Creating a parallel programCreating a parallel program

Assumption: Sequential algorithm is given pieces of the jobIdentify work that can be done in parallelPartition work and perhaps data among processesManage data access, communication, and synchronizationNote: work includes computation, data access, and I/O

Main goal: Speedup (plus low programming effort and resources needed)

Speedup (p) =

For a fixed problem:

Speedup (p) =

Performance(p)Performance(1)

Time(1)Time(p)

Page 19: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 21PCOD: Lecture 2

Per Stenström © 2008, Sally A. McKee © 2009

Steps in creating a parallel programSteps in creating a parallel program

4 steps: Decomposition, Assignment, Orchestration, MappingDone by programmer or system software (compiler, runtime, ...)Issues are the same, so just assume programmer does it

P0

Tasks Processes Processors

P1

P2 P3

p0 p1

p2 p3

p0 p1

p2 p3

Partitioning

Sequentialcomputation

Parallelprogram

Assignment

Decomposition

Mapping

Orchestration

Largely architecture independent Largely architecture dependent

Page 20: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 22PCOD: Lecture 2

Per Stenström © 2008, Sally A. McKee © 2009

Some important conceptsSome important concepts Task: Arbitrary piece of total work in a parallel computationExecuted sequentially — concurrency is only across tasksFine-grained versus coarse-grained tasks

Process (thread): Entity that is eventually executed by a CPUAbstract entity that performs the tasks assigned to processesProcesses communicate and synchronize to perform their tasks

Processor: Physical engine on which process executesProcesses virtualize the machine to the programmer

first write program in terms of processes, then map to processors

Page 21: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 23PCOD: Lecture 2

Per Stenström © 2008, Sally A. McKee © 2009

DecompositionDecomposition

Purpose: Break up computation into tasks to be divided among processes

Tasks may become available dynamically

Number of available tasks may vary with timei.e., identify concurrency and decide level at

which to exploit it

Goal: Enough tasks to keep processes busy, but not too many (keep task management reasonable)Number of tasks available at creates an upper

bound on achievable speedup

Page 22: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 24PCOD: Lecture 2

Per Stenström © 2008, Sally A. McKee © 2009

Limited concurrency: Amdahl’s LawLimited concurrency: Amdahl’s Law

The most fundamental limitation on parallel speedup. If fraction s of sequential execution is inherently serial, speedup <= 1/s

Example: 2-phase calculation Phase 1: sweep over n-by-n grid and do some independent computation (Time: n2/p)

Phase 2: sweep again and add each value to global sum (Time: n2)

Improved version: Trick — divide second phase into twoaccumulate into private sum during sweepadd per-process private sum into global sum

Parallel time is n2/p + n2/p + p, and speedup at best p 2n2

2n2 + p2

Page 23: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 25PCOD: Lecture 2

Per Stenström © 2008, Sally A. McKee © 2009

Graphical representation of exampleGraphical representation of example

1

p

1

p

1

n2/p

n2

p

wor

k do

ne c

oncu

rren

tly

n2

n2

Timen2/p n2/p

(c)

(b)

(a)

Page 24: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 26PCOD: Lecture 2

Per Stenström © 2008, Sally A. McKee © 2009

AssignmentAssignment Specify a mechanism to divide work up among processesGoal: Balance work among processes, reduce communication

and management costs

Structured approaches usually work wellCode inspection (parallel loops) or understanding of applicationWell known heuristicsStatic versus dynamic assignment

Programmers worry about decomposition and assignment first Largely independent of architecture or programming modelBut cost and complexity of using primitives may affect decisions

As architects, we assume program does reasonable job of it

Page 25: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 27PCOD: Lecture 2

Per Stenström © 2008, Sally A. McKee © 2009

OrchestrationOrchestration PurposeNaming data, structuring communication and synchronization Organizing data structures, scheduling tasks temporally

GoalsReduce costs of communication and synchronization as seen by

processorsEnhance locality of data references Reduce overhead of parallelism management

Closest to architecture (and programming model and language)Choices depend heavily on communication abstraction,

efficiency of primitives Architects must provide appropriate, efficient primitives

Page 26: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 28PCOD: Lecture 2

Per Stenström © 2008, Sally A. McKee © 2009

MappingMapping After orchestration, a parallel program exists Two aspects of mapping:Which processes will run on same processor, if necessaryWhich process runs on which particular processor

One extreme: space-sharingMachine divided into subsets, only one application at a time in

a subsetProcesses can be pinned to processors, or OS can balance

workloads Another extreme: complete resource management control to OSOS uses the performance techniques we will discuss later

Real world is between the twoUser specifies desires in some aspects, but system may ignore

Page 27: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 29PCOD: Lecture 2

Per Stenström © 2008, Sally A. McKee © 2009

High-level goalsHigh-level goals

But low resource usage and development effort Implications for algorithm designers and architectsAlgorithm designers: high-performance, low resource needsArchitects: high-performance, low cost, reduced programming

effort

High performance (speedup over equivalent sequential program)Table 2.1 Steps in the Parallelization Process and Their Goals

StepArchitecture-Dependent? Major Performance Goals

Decomposition Mostly no Expose enough concurrency but not too much

Assignment Mostly no Balance workloadReduce communication volume

Orchestration Yes Reduce noninherent communication via data locality

Reduce communication and synchronization cost as seen by the processor

Reduce serialization at shared resourcesSchedule tasks to satisfy dependences early

Mapping Yes Put related processes on the same processor if necessary

Exploit locality in network topology

Page 28: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 30PCOD: Lecture 2

Per Stenström © 2008, Sally A. McKee © 2009

Partitioning (Ch. 3.1) [in 1.5 weeks](=decomposition + assignment)Partitioning (Ch. 3.1) [in 1.5 weeks](=decomposition + assignment) P P

Communication cost

P

Part II — Textbook Reference: Ch. 2.3

Applying the methodology toan equation solver

Page 29: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 31PCOD: Lecture 2

Per Stenström © 2008, Sally A. McKee © 2009

10. procedure Solve (A) /*solve the equation system*/11. float **A; /*A is an (n + 2)-by-(n + 2) array*/12. begin13. int i, j, done = 0;14. float diff = 0, temp;15. while (!done) do /*outermost loop over sweeps*/16. diff = 0; /*initialize maximum difference to 0*/17. for i 1 to n do /*sweep over nonborder points of grid*/18. for j 1 to n do19. temp = A[i,j]; /*save old value of element*/20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]); /*compute average*/22. diff += abs(A[i,j] - temp);23. end for24. end for25. if (diff/(n*n) < TOL) then done = 1;26. end while27. end procedure

Sequential implementationSequential implementation

Page 30: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 32PCOD: Lecture 2

Per Stenström © 2008, Sally A. McKee © 2009

Decomposition 1(3)Decomposition 1(3)

Inherent concurrency in the loop structure• Dependencies with north and west grid points • Loops are inherently sequentialInherent concurrency ignoring loop structure• Concurrency O(n) along anti-diagonals, serialization O(n) across anti-diagonals• Result: load imbalance and many synchronizations

Page 31: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 33PCOD: Lecture 2

Per Stenström © 2008, Sally A. McKee © 2009

Decomposition 2(3)Decomposition 2(3)

Different ordering of updates: may converge quicker or slower Red sweep and black sweep are each fully parallel: Global synch between them (conservative but convenient)

Inherent concurrency in algorithm: the red-black ordering

Red point

Black point

Page 32: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 34PCOD: Lecture 2

Per Stenström © 2008, Sally A. McKee © 2009

Decomposition 3(3)Decomposition 3(3)

Decomposition into elements: degree of concurrency n2

To decompose into rows, make line 18 loop sequential; degree nfor_all leaves assignment left to system but implicit global

synchronization at end of for_all loop

15. while (!done) do /*a sequential loop*/16. diff = 0; 17. for_all i 1 to n do /*a parallel loop nest*/18. for_all j 1 to n do19. temp = A[i,j];20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]);22. diff += abs(A[i,j] - temp);23. end for_all24. end for_all25. if (diff/(n*n) < TOL) then done = 1;26. end while

Ignore the dependences — the solution will converge anyway

Page 33: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 35PCOD: Lecture 2

Per Stenström © 2008, Sally A. McKee © 2009

AssignmentAssignment

Static assignment (given decomposition into rows) block assignment (see figure) reduces communication, may introduce load imbalance

cyclic assignment (process i is assigned rows i, i+p…)

Dynamic assignment (let the system do it) each process grabs a new row when finished with current row

P0

P1

P2

P4

Page 34: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 36PCOD: Lecture 2

Per Stenström © 2008, Sally A. McKee © 2009

Solver under data parallel modelSolver under data parallel model10. procedure Solve(A) /*solve the equation system*/11. float **A; /*A is an (n + 2-by-n + 2) array*/12. begin13. int i, j, done = 0;14. float mydiff = 0, temp;14a. DECOMP A[BLOCK,*, nprocs];15. while (!done) do /*outermost loop over sweeps*/16. mydiff = 0; /*initialize maximum difference to 0*/

17. for_all i 1 to n do /*sweep over non-border points of grid*/18. for_all j 1 to n do19. temp = A[i,j]; /*save old value of element*/

20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]); /*compute average*/22. mydiff += abs(A[i,j] - temp);23. end for_all24. end for_all24a. REDUCE (mydiff, diff, ADD);25. if (diff/(n*n) < TOL) then done = 1;26. end while27. end procedure

Important observations: Matrix is shared across processes All processes do the same operation in parallel in lock-step Orchestration easy: no explicit communication or synchronization

Three primitives: DECOMP does assignment

for_all distributes work

REDUCE accumulates local sum to global sum

Page 35: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 37PCOD: Lecture 2

Per Stenström © 2008, Sally A. McKee © 2009

Solver under SAS model 1(3)Solver under SAS model 1(3)All processes have

separate control (in this example they do the same operations: SPMD model)

Assignment controlled by loop indices

All processes share the matrix but do not work in lock-step —orchestration focuses on synchronization

Sweep

Test Convergence

Processes

Solve Solve Solve Solve

Page 36: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 38PCOD: Lecture 2

Per Stenström © 2008, Sally A. McKee © 2009

10. procedure Solve(A)11. float **A; /*A is entire n+2-by-n+2 shared array,

as in the sequential program*/12. begin13. int i,j, pid, done = 0;14. float temp, mydiff = 0; /*private variables*/14a. int mymin = 1 + (pid * n/nprocs); /*assume that n is exactly divisible by*/14b. int mymax = mymin + n/nprocs - 1 /*nprocs for simplicity here*/

15. while (!done) do /*outer loop over all diagonal elements*/16. mydiff = diff = 0; /*set global diff to 0 (okay for all to do it)*/16a. BARRIER(bar1, nprocs); /*ensure all reach here before anyone modifies diff*/17. for i mymin to mymax do /*for each of my rows*/18. for j 1 to n do /*for all nonborder elements in that row*/19. temp = A[i,j];20. A[i,j] = 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]);22. mydiff += abs(A[i,j] - temp);23. endfor24. endfor25a. LOCK(diff_lock); /*update global diff if necessary*/25b. diff += mydiff;25c. UNLOCK(diff_lock);25d. BARRIER(bar1, nprocs); /*ensure all reach here before checking if done*/25e. if (diff/(n*n) < TOL) then done = 1; /*check convergence; all get

same answer*/25f. BARRIER(bar1, nprocs);26. endwhile27. end procedure

Solver under SAS model 2(3)Solver under SAS model 2(3)Code for a single process

Main changes to program: Assignment Synchronizations

Synchronizations:BARRIER: catch all

before allowed to proceed

LOCK/UNLOCK: enforce mutual exclusion

Page 37: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 39PCOD: Lecture 2

Per Stenström © 2008, Sally A. McKee © 2009

Mutual exclusion 1(2)Mutual exclusion 1(2)

Code each process executes:load r1 <- diffadd r1, r2, r1store diff <- r1

A possible interleaving:P1 P2

r1 diff {P1 gets 0 in its r1}r1 diff {P2 also gets 0}

r1 r1+r2 {P1 sets its r1 to 1}r1 r1+r2 {P2 sets its r1 to 1}

diff r1 {P1 sets cell_cost to 1}diff r1 {P2 also sets cell_cost to 1}

The sets of operations must be atomic (mutually exclusive)

Page 38: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 40PCOD: Lecture 2

Per Stenström © 2008, Sally A. McKee © 2009

Mutual exclusion 2(2)Mutual exclusion 2(2)

Provided by LOCK-UNLOCK around critical sections (code segments requiring mutual exclusion)Set of operations we want to execute atomicallyLOCK/UNLOCK implementations must

guarantee mutual exclusion (atomicity)

Can lead to significant serialization if contentionSince non-local accesses are expected in critical

sectionAnother reason to use private mydiff variable for

partial accumulation

Page 39: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 41PCOD: Lecture 2

Per Stenström © 2008, Sally A. McKee © 2009

Global event synchronizationGlobal event synchronizationBARRIER(nprocs): wait here till nprocs processes arrive

Built using lower-level primitivesUsed to separate phases of computation

Process P_1 Process P_2 Process P_nprocs

set up eqn system set up eqn system set up eqn system

Barrier (name, nprocs) Barrier (name, nprocs) Barrier (name, nprocs)

solve eqn system solve eqn system solve eqn system

Barrier (name, nprocs) Barrier (name, nprocs) Barrier (name, nprocs)

apply results apply results apply results

Barrier (name, nprocs) Barrier (name, nprocs) Barrier (name, nprocs)Conservative form of preserving dependencies, but easy to use

Point-to-point event synchronization possible — see text

Page 40: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 42PCOD: Lecture 2

Per Stenström © 2008, Sally A. McKee © 2009

Solver under message passing 1(2)Solver under message passing 1(2)Private versus shared address space causes many

differences from the SAS (shmem) modelCannot declare A to be a shared array anymoreNeed to compose it logically from per-process

private arraysusually allocated in accordance with the assignment

of workprocess assigned a set of rows allocates them locallyTransfers of entire rows between traversals

Structurally similar to SAS, but orchestration differentdata structures and data access/namingcommunicationsynchronization

Page 41: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 43PCOD: Lecture 2

Per Stenström © 2008, Sally A. McKee © 2009

110. procedure Solve()11. begin13. int i,j, pid, n’ = n/nprocs, done = 0;14. float temp, tempdiff, mydiff = 0; /*private variables*/6. myA malloc(a 2-d array of size [n/nprocs + 2] by n+2);

/*my assigned rows of A*/7. initialize(myA); /*initialize my rows of A, in an unspecified way*/

15. while (!done) do16. mydiff = 0; /*set local diff to 0*/16a. if (pid != 0) then SEND(&myA[1,0],n*sizeof(float),pid-1,ROW);16b. if (pid = nprocs-1) then

SEND(&myA[n’,0],n*sizeof(float),pid+1,ROW);16c. if (pid != 0) then RECEIVE(&myA[0,0],n*sizeof(float),pid-1,ROW);16d. if (pid != nprocs-1) then

RECEIVE(&myA[n’+1,0],n*sizeof(float), pid+1,ROW);/*border rows of neighbors have now been copiedinto myA[0,*] and myA[n’+1,*]*/

17. for i 1 to n’ do /*for each of my (nonghost) rows*/18. for j 1 to n do /*for all nonborder elements in that row*/19. temp = myA[i,j];20. myA[i,j] = 0.2 * (myA[i,j] + myA[i,j-1] + myA[i-1,j] +21. myA[i,j+1] + myA[i+1,j]);22. mydiff += abs(myA[i,j] - temp);23. endfor24. endfor

/*communicate local diff values and determine ifdone; can be replaced by reduction and broadcast*/

25a. if (pid != 0) then /*process 0 holds global total diff*/25b. SEND(mydiff,sizeof(float),0,DIFF);25c. RECEIVE(done,sizeof(int),0,DONE);25d. else /*pid 0 does this*/25e. for i 1 to nprocs-1 do /*for each other process*/25f. RECEIVE(tempdiff,sizeof(float),*,DIFF);25g. mydiff += tempdiff; /*accumulate into total*/25h. endfor25i if (mydiff/(n*n) < TOL) then done = 1;25j. for i 1 to nprocs-1 do /*for each other process*/25k. SEND(done,sizeof(int),i,DONE);25l. endfor25m. endif26. endwhile27. end procedure

Solver under message passing 2(2)Solver under message passing 2(2)Main changes to program:Assignment and

distribution of dataExplicit

communication of results (naming)

Synchronization

Primitives: SEND: copies of

data from local to remote

RECEIVE: copies of data from remote to local

Page 42: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 44PCOD: Lecture 2

Per Stenström © 2008, Sally A. McKee © 2009

Send and Receive alternativesSend and Receive alternatives

Affect event synchronization, ease of programming, performance Synchronous messages provide built-in synchronization through matching Synchronous messages are prone to deadlock

Functionality extensions: stride, scatter-gather, groups Semantics: based on when control is returned Semantics dictate when data structures or buffers can be reused at either end

Send/Receive

Synchronous Asynchronous

Blocking asynchronous Nonblocking asynchronous

Page 43: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

04/21/23 slide 45PCOD: Lecture 2

Per Stenström © 2008, Sally A. McKee © 2009

SummarySummary Decomposition and Assignment similar in SAS and MP Orchestration is different: data structures, data access/naming, communication, synchronization

Requirements for performance are another story — stay tuned for more

SAS Msg-Passing

Explicit global data structure? Yes No

Assignment indept of data layout? Yes No

Communication Implicit Explicit

Synchronization Explicit Implicit

Explicit replication of border rows? No Yes

Page 44: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee

INFOI need your names and email addresses!Class on MONDAY 1/24 3:15, place TBANO CLASS T/Th 1/25, 1/27 (SAM away)NO LAB/EXERCISES THIS WEEK NO LAB THURSDAY 2/3 (we’ll start week 3)Exercises next WEDNESDAY 2/210:00, place TBAClass on MONDAY 1/31 3:15, place TBAWeb page coming over weekend:

http://www.cse.chalmers.se/~mckee/courses/EDA281.htmlBooks are being copied, will distribute on MondayNO EXAM; final survey papers [email protected], [email protected]

04/21/23 slide 46PCOD: Lecture 1

Per Stenström © 2008, Sally A. McKee © 2009