9/20/2015 slide 1 pcod: lecture 1 per stenström © 2008, sally a. mckee © 2009 7.5 credit points...

04/21/23 slide 1PCOD: Lecture 1

Per Stenström © 2008, Sally A. McKee © 2009

7.5 credit points

Instructor: Sally A. McKee

Memory bus

MicroChannel bus

I/O

i860 NI

DMA

DR

AM

IBM SP-2 node

L2 $

Power 2CPU

Memorycontroller

4-wayinterleaved

DRAM

General interconnectionnetwork formed from8-port switches

NIC

INFOI need your names and email addresses!Class on MONDAY 1/24 3:15, place TBANO CLASS T/Th 1/25, 1/27 (SAM away)NO LAB/EXERCISES THIS WEEK NO LAB THURSDAY 2/3 (we’ll start week 3)Exercises next WEDNESDAY 2/210:00, place TBAClass on MONDAY 1/31 3:15, place TBAWeb page coming over weekend:

http://www.cse.chalmers.se/~mckee/courses/EDA281.htmlBooks are being copied, will distribute on MondayNO EXAM; final survey papers instead





What is a parallel What is a parallel computer?computer?

A parallel computer is a collection of processing elements that cooperate to solve large problems (fast)



Why parallel computers?Why parallel computers?

New performance demanding applications

Killer microprocessors

A collection of killer microprocessors

(integrated onthe same chip)

Economics

Technology trends



Broad issuesBroad issues

Programming issues: What programming model? Performance — cross

cutting issues: Impact of system design tradeoffs on application performance Impact of application design on performance

Performance — cross cutting issues: Impact of system design tradeoffs on application performance Impact of application design on performance

Architectural model issues: How big a collection? How powerful are the elements? How do the elements cooperate?

System interface

A parallel computer is a collection of processing elements that cooperate to solve large problems (fast)



Goal and overviewGoal and overview

The goal of this course is to provide knowledge on Programming models and techniques for design of high-performance parallel programs the data parallel model the shared address-space model the message-passing model

Design principles for parallel computers small-scale system design tradeoffs scalable system design tradeoffs interconnection networks



Driving forces behind parallel computers (1.1) Evolution behind today’s parallel computers (1.2) Fundamental design issues (1.3) Methodology for designing parallel programs (2.1 – 2.2)

Overview of parallel computer technology:What it is? What it is for? What are the issues?



Three driving forcesThree driving forces Application demands (coarse-grain parallelism abounds): Scientific computing (e.g., modeling of phenomena in

science) Engineering computing (e.g., CAD and design

analysis) Commercial computing (e.g., media and information

processing) Technology trends Transistor density growth high Clock frequency improvement moderate

Architecture trends Diminishing returns on instruction-level parallelism



Parallelism in sequential programsParallelism in sequential programs

A sequential program on a superscalar processor:Programming model:

sequential

Architecture: instruction-level parallelism register (memory) communicationpipeline interlocking for

synchronization

Gap between modeland architecture has increased

for i = 0 to N-1 a[(i+1) mod N] := b[i] + c[i];for i = 0 to N-1 d[i] := C*a[i];

Iteration: 0 1 2 … N-1Loop 1 a[1] a[2] … a[0]

Loop 2 a[0] a[1] … a[N-1]

data dependencies



A parallel programming modelA parallel programming model

Extended semantics to express units of parallelism at theinstruction levelthread levelprogram level

communication and coordination between units of parallelism at theregister levelmemory level I/O level



Programming model vs. Programming model vs. parallel architecture parallel architecture CAD

Multiprogramming Sharedaddress

Messagepassing

Dataparallel

Databases Scientific modeling Parallel applications

Programming models

Communication abstractionUser/system boundary

Compileror library

Operating system support

Communication harrdware

Physical communication medium

Hardware/software boundary

Three key concepts• Communication abstraction supports programming models• Communication architecture (ISA plus primitives for comm/synch)• Hardware/software boundary to define which parts of the communication architecture are implemented in hardware or software



Shared address space (SAS) modelShared address space (SAS) modelProgramming model Parallelism parts of a program, called threads

Communication and coordination among threads through a shared global address space

for_all i = 0 to P-1 for j = i0[i] to in[i] a[(j+1) mod N] := b[j] + c[j];barrier;for_all i = 0 to P-1 for j = i0[i] to in[i] d[j] := C*a[j]; P P P

Memory

Communication abstractionsupported by HW/SW interface



Message passing modelMessage passing modelProgramming model Process-level parallelism (private addresses) Communication and coordination via explicit messages

for_all i = 0 to P-1 for j = i0[i] to in[i] index = (j+1) mod N; a[index] := b[j] + c[j]; send(a[index], (j+1) mod P); end_forbarrier;for_all k = 0 to P-1 for j = i0[k] to in[k] recv(tmp,(P+j-1) mod P); d[j] := C*tmp;} end_for

for_all i = 0 to P-1 for j = i0[i] to in[i] index = (j+1) mod N; a[index] := b[j] + c[j]; if j = in[i] then send(a[index], (j+1) mod P, a[j]); end_forbarrier;for_all i = 0 to P-1 for j = i0[i] to in[i] if j = i0[i] then recv(tmp,(P+j-1) mod P, a[j]); d[j] := C * tmp;} end_for



Data parallel systems (SIMD)Programming model Operations performed in parallel on each element of data structure Logically, single thread of control performing sequential or parallel steps

Conceptually, a processor associated with each data element

Architectural model Array of many simple, cheap processors with little memory each

(processors don’t sequence through instructions) Attached to a control processor that issues instructions Specialized and general communication, cheap global synchronization

Original motivations•Matches simple differential equation solvers•Centralizes high cost of instruction fetch/sequencing

PE PE PE

PE PE PE

PE PE PE

Controlprocessor



Pros and cons of data parallelismPros and cons of data parallelismExample

parallel (i:0->N-1) a[(i+1) mod N] := b[i] + c[i];parallel (i:0->N-1) d[i] := C * a[i];

Evolution and convergence: Popular when cost savings of centralized sequencer is high Parallelism is limited to specialized regular computationsMuch of parallelism can be exploited at instruction level Coarser levels of parallelism can be exposed for

multiprocessors and message-passing machines

New data parallel programming model: SPMD

Single-Program Multiple-Data



A generic parallel architectureA generic parallel architectureA generic modern multiprocessor (shared address or message passing architecture)

Node: processor(s), memory system, plus a communication assist

• Network interface and communication controller

• Scalable network

• Convergence allows lots of innovation, now within framework

• Integration of assist with node, what operations, how efficiently...

Mem

Network

P

$

Communicationassist (CA)



Why study parallel programming Why study parallel programming issues?issues?

From a software/algorithm designer’s point of view: High-performance software is the key motivation for parallel computers Parallel compiler technology is far from being as mature as compiler technology for single processors (uniprocessors) From a system designer’s point of view: Understanding hardware/software interaction is key to making architectural tradeoffs for high performance

Important to understand tradeoffs in performance versus programming effort involved



Creating a parallel programCreating a parallel program

Assumption: Sequential algorithm is given pieces of the jobIdentify work that can be done in parallelPartition work and perhaps data among processesManage data access, communication, and synchronizationNote: work includes computation, data access, and I/O

Main goal: Speedup (plus low programming effort and resources needed)

Speedup (p) =

For a fixed problem:

Speedup (p) =

Performance(p)Performance(1)

Time(1)Time(p)



Steps in creating a parallel programSteps in creating a parallel program

4 steps: Decomposition, Assignment, Orchestration, MappingDone by programmer or system software (compiler, runtime, ...)Issues are the same, so just assume programmer does it

P0

Tasks Processes Processors

P1

P2 P3

p0 p1

p2 p3

p0 p1

p2 p3

Partitioning

Sequentialcomputation

Parallelprogram

Assignment

Decomposition

Mapping

Orchestration

Largely architecture independent Largely architecture dependent



Some important conceptsSome important concepts Task: Arbitrary piece of total work in a parallel computationExecuted sequentially — concurrency is only across tasksFine-grained versus coarse-grained tasks

Process (thread): Entity that is eventually executed by a CPUAbstract entity that performs the tasks assigned to processesProcesses communicate and synchronize to perform their tasks

Processor: Physical engine on which process executesProcesses virtualize the machine to the programmer

first write program in terms of processes, then map to processors



DecompositionDecomposition

Purpose: Break up computation into tasks to be divided among processes

Tasks may become available dynamically

Number of available tasks may vary with timei.e., identify concurrency and decide level at

which to exploit it

Goal: Enough tasks to keep processes busy, but not too many (keep task management reasonable)Number of tasks available at creates an upper

bound on achievable speedup



Limited concurrency: Amdahl’s LawLimited concurrency: Amdahl’s Law

The most fundamental limitation on parallel speedup. If fraction s of sequential execution is inherently serial, speedup <= 1/s

Example: 2-phase calculation Phase 1: sweep over n-by-n grid and do some independent computation (Time: n2/p)

Phase 2: sweep again and add each value to global sum (Time: n2)

Improved version: Trick — divide second phase into twoaccumulate into private sum during sweepadd per-process private sum into global sum

Parallel time is n2/p + n2/p + p, and speedup at best p 2n2

2n2 + p2



Graphical representation of exampleGraphical representation of example

1

p

1

p

1

n2/p

n2

p

wor

k do

ne c

oncu

rren

tly

n2

n2

Timen2/p n2/p

(c)

(b)

(a)



AssignmentAssignment Specify a mechanism to divide work up among processesGoal: Balance work among processes, reduce communication

and management costs

Structured approaches usually work wellCode inspection (parallel loops) or understanding of applicationWell known heuristicsStatic versus dynamic assignment

Programmers worry about decomposition and assignment first Largely independent of architecture or programming modelBut cost and complexity of using primitives may affect decisions

As architects, we assume program does reasonable job of it



OrchestrationOrchestration PurposeNaming data, structuring communication and synchronization Organizing data structures, scheduling tasks temporally

GoalsReduce costs of communication and synchronization as seen by

processorsEnhance locality of data references Reduce overhead of parallelism management

Closest to architecture (and programming model and language)Choices depend heavily on communication abstraction,

efficiency of primitives Architects must provide appropriate, efficient primitives



MappingMapping After orchestration, a parallel program exists Two aspects of mapping:Which processes will run on same processor, if necessaryWhich process runs on which particular processor

One extreme: space-sharingMachine divided into subsets, only one application at a time in

a subsetProcesses can be pinned to processors, or OS can balance

workloads Another extreme: complete resource management control to OSOS uses the performance techniques we will discuss later

Real world is between the twoUser specifies desires in some aspects, but system may ignore



High-level goalsHigh-level goals

But low resource usage and development effort Implications for algorithm designers and architectsAlgorithm designers: high-performance, low resource needsArchitects: high-performance, low cost, reduced programming

effort

High performance (speedup over equivalent sequential program)Table 2.1 Steps in the Parallelization Process and Their Goals

StepArchitecture-Dependent? Major Performance Goals

Decomposition Mostly no Expose enough concurrency but not too much

Assignment Mostly no Balance workloadReduce communication volume

Orchestration Yes Reduce noninherent communication via data locality

Reduce communication and synchronization cost as seen by the processor

Reduce serialization at shared resourcesSchedule tasks to satisfy dependences early

Mapping Yes Put related processes on the same processor if necessary

Exploit locality in network topology



Partitioning (Ch. 3.1) [in 1.5 weeks](=decomposition + assignment)Partitioning (Ch. 3.1) [in 1.5 weeks](=decomposition + assignment) P P

Communication cost

P

Part II — Textbook Reference: Ch. 2.3

Applying the methodology toan equation solver



10. procedure Solve (A) /*solve the equation system*/11. float **A; /*A is an (n + 2)-by-(n + 2) array*/12. begin13. int i, j, done = 0;14. float diff = 0, temp;15. while (!done) do /*outermost loop over sweeps*/16. diff = 0; /*initialize maximum difference to 0*/17. for i 1 to n do /*sweep over nonborder points of grid*/18. for j 1 to n do19. temp = A[i,j]; /*save old value of element*/20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]); /*compute average*/22. diff += abs(A[i,j] - temp);23. end for24. end for25. if (diff/(n*n) < TOL) then done = 1;26. end while27. end procedure

Sequential implementationSequential implementation



Decomposition 1(3)Decomposition 1(3)

Inherent concurrency in the loop structure• Dependencies with north and west grid points • Loops are inherently sequentialInherent concurrency ignoring loop structure• Concurrency O(n) along anti-diagonals, serialization O(n) across anti-diagonals• Result: load imbalance and many synchronizations




Different ordering of updates: may converge quicker or slower Red sweep and black sweep are each fully parallel: Global synch between them (conservative but convenient)

Inherent concurrency in algorithm: the red-black ordering

Red point

Black point




Decomposition into elements: degree of concurrency n2

To decompose into rows, make line 18 loop sequential; degree nfor_all leaves assignment left to system but implicit global

synchronization at end of for_all loop

15. while (!done) do /*a sequential loop*/16. diff = 0; 17. for_all i 1 to n do /*a parallel loop nest*/18. for_all j 1 to n do19. temp = A[i,j];20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]);22. diff += abs(A[i,j] - temp);23. end for_all24. end for_all25. if (diff/(n*n) < TOL) then done = 1;26. end while

Ignore the dependences — the solution will converge anyway



AssignmentAssignment

Static assignment (given decomposition into rows) block assignment (see figure) reduces communication, may introduce load imbalance

cyclic assignment (process i is assigned rows i, i+p…)

Dynamic assignment (let the system do it) each process grabs a new row when finished with current row

P0

P1

P2

P4



Solver under data parallel modelSolver under data parallel model10. procedure Solve(A) /*solve the equation system*/11. float **A; /*A is an (n + 2-by-n + 2) array*/12. begin13. int i, j, done = 0;14. float mydiff = 0, temp;14a. DECOMP A[BLOCK,*, nprocs];15. while (!done) do /*outermost loop over sweeps*/16. mydiff = 0; /*initialize maximum difference to 0*/

17. for_all i 1 to n do /*sweep over non-border points of grid*/18. for_all j 1 to n do19. temp = A[i,j]; /*save old value of element*/

20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]); /*compute average*/22. mydiff += abs(A[i,j] - temp);23. end for_all24. end for_all24a. REDUCE (mydiff, diff, ADD);25. if (diff/(n*n) < TOL) then done = 1;26. end while27. end procedure

Important observations: Matrix is shared across processes All processes do the same operation in parallel in lock-step Orchestration easy: no explicit communication or synchronization

Three primitives: DECOMP does assignment

for_all distributes work

REDUCE accumulates local sum to global sum



Solver under SAS model 1(3)Solver under SAS model 1(3)All processes have

separate control (in this example they do the same operations: SPMD model)

Assignment controlled by loop indices

All processes share the matrix but do not work in lock-step —orchestration focuses on synchronization

Sweep

Test Convergence

Processes

Solve Solve Solve Solve



10. procedure Solve(A)11. float **A; /*A is entire n+2-by-n+2 shared array,

as in the sequential program*/12. begin13. int i,j, pid, done = 0;14. float temp, mydiff = 0; /*private variables*/14a. int mymin = 1 + (pid * n/nprocs); /*assume that n is exactly divisible by*/14b. int mymax = mymin + n/nprocs - 1 /*nprocs for simplicity here*/

15. while (!done) do /*outer loop over all diagonal elements*/16. mydiff = diff = 0; /*set global diff to 0 (okay for all to do it)*/16a. BARRIER(bar1, nprocs); /*ensure all reach here before anyone modifies diff*/17. for i mymin to mymax do /*for each of my rows*/18. for j 1 to n do /*for all nonborder elements in that row*/19. temp = A[i,j];20. A[i,j] = 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]);22. mydiff += abs(A[i,j] - temp);23. endfor24. endfor25a. LOCK(diff_lock); /*update global diff if necessary*/25b. diff += mydiff;25c. UNLOCK(diff_lock);25d. BARRIER(bar1, nprocs); /*ensure all reach here before checking if done*/25e. if (diff/(n*n) < TOL) then done = 1; /*check convergence; all get

same answer*/25f. BARRIER(bar1, nprocs);26. endwhile27. end procedure

Solver under SAS model 2(3)Solver under SAS model 2(3)Code for a single process

Main changes to program: Assignment Synchronizations

Synchronizations:BARRIER: catch all

before allowed to proceed

LOCK/UNLOCK: enforce mutual exclusion



Mutual exclusion 1(2)Mutual exclusion 1(2)

Code each process executes:load r1 <- diffadd r1, r2, r1store diff <- r1

A possible interleaving:P1 P2

r1 diff {P1 gets 0 in its r1}r1 diff {P2 also gets 0}

r1 r1+r2 {P1 sets its r1 to 1}r1 r1+r2 {P2 sets its r1 to 1}

diff r1 {P1 sets cell_cost to 1}diff r1 {P2 also sets cell_cost to 1}

The sets of operations must be atomic (mutually exclusive)



Mutual exclusion 2(2)Mutual exclusion 2(2)

Provided by LOCK-UNLOCK around critical sections (code segments requiring mutual exclusion)Set of operations we want to execute atomicallyLOCK/UNLOCK implementations must

guarantee mutual exclusion (atomicity)

Can lead to significant serialization if contentionSince non-local accesses are expected in critical

sectionAnother reason to use private mydiff variable for

partial accumulation



Global event synchronizationGlobal event synchronizationBARRIER(nprocs): wait here till nprocs processes arrive

Built using lower-level primitivesUsed to separate phases of computation

Process P_1 Process P_2 Process P_nprocs

set up eqn system set up eqn system set up eqn system

Barrier (name, nprocs) Barrier (name, nprocs) Barrier (name, nprocs)

solve eqn system solve eqn system solve eqn system

Barrier (name, nprocs) Barrier (name, nprocs) Barrier (name, nprocs)

apply results apply results apply results

Barrier (name, nprocs) Barrier (name, nprocs) Barrier (name, nprocs)Conservative form of preserving dependencies, but easy to use

Point-to-point event synchronization possible — see text



Solver under message passing 1(2)Solver under message passing 1(2)Private versus shared address space causes many

differences from the SAS (shmem) modelCannot declare A to be a shared array anymoreNeed to compose it logically from per-process

private arraysusually allocated in accordance with the assignment

of workprocess assigned a set of rows allocates them locallyTransfers of entire rows between traversals

Structurally similar to SAS, but orchestration differentdata structures and data access/namingcommunicationsynchronization



110. procedure Solve()11. begin13. int i,j, pid, n’ = n/nprocs, done = 0;14. float temp, tempdiff, mydiff = 0; /*private variables*/6. myA malloc(a 2-d array of size [n/nprocs + 2] by n+2);

/*my assigned rows of A*/7. initialize(myA); /*initialize my rows of A, in an unspecified way*/

15. while (!done) do16. mydiff = 0; /*set local diff to 0*/16a. if (pid != 0) then SEND(&myA[1,0],n*sizeof(float),pid-1,ROW);16b. if (pid = nprocs-1) then

SEND(&myA[n’,0],n*sizeof(float),pid+1,ROW);16c. if (pid != 0) then RECEIVE(&myA[0,0],n*sizeof(float),pid-1,ROW);16d. if (pid != nprocs-1) then

RECEIVE(&myA[n’+1,0],n*sizeof(float), pid+1,ROW);/*border rows of neighbors have now been copiedinto myA[0,*] and myA[n’+1,*]*/

17. for i 1 to n’ do /*for each of my (nonghost) rows*/18. for j 1 to n do /*for all nonborder elements in that row*/19. temp = myA[i,j];20. myA[i,j] = 0.2 * (myA[i,j] + myA[i,j-1] + myA[i-1,j] +21. myA[i,j+1] + myA[i+1,j]);22. mydiff += abs(myA[i,j] - temp);23. endfor24. endfor

/*communicate local diff values and determine ifdone; can be replaced by reduction and broadcast*/

25a. if (pid != 0) then /*process 0 holds global total diff*/25b. SEND(mydiff,sizeof(float),0,DIFF);25c. RECEIVE(done,sizeof(int),0,DONE);25d. else /*pid 0 does this*/25e. for i 1 to nprocs-1 do /*for each other process*/25f. RECEIVE(tempdiff,sizeof(float),*,DIFF);25g. mydiff += tempdiff; /*accumulate into total*/25h. endfor25i if (mydiff/(n*n) < TOL) then done = 1;25j. for i 1 to nprocs-1 do /*for each other process*/25k. SEND(done,sizeof(int),i,DONE);25l. endfor25m. endif26. endwhile27. end procedure

Solver under message passing 2(2)Solver under message passing 2(2)Main changes to program:Assignment and

distribution of dataExplicit

communication of results (naming)

Synchronization

Primitives: SEND: copies of

data from local to remote

RECEIVE: copies of data from remote to local



Send and Receive alternativesSend and Receive alternatives

Affect event synchronization, ease of programming, performance Synchronous messages provide built-in synchronization through matching Synchronous messages are prone to deadlock

Functionality extensions: stride, scatter-gather, groups Semantics: based on when control is returned Semantics dictate when data structures or buffers can be reused at either end

Send/Receive

Synchronous Asynchronous

Blocking asynchronous Nonblocking asynchronous



SummarySummary Decomposition and Assignment similar in SAS and MP Orchestration is different: data structures, data access/naming, communication, synchronization

Requirements for performance are another story — stay tuned for more

SAS Msg-Passing

Explicit global data structure? Yes No

Assignment indept of data layout? Yes No

Communication Implicit Explicit

Synchronization Explicit Implicit

Explicit replication of border rows? No Yes

INFOI need your names and email addresses!Class on MONDAY 1/24 3:15, place TBANO CLASS T/Th 1/25, 1/27 (SAM away)NO LAB/EXERCISES THIS WEEK NO LAB THURSDAY 2/3 (we’ll start week 3)Exercises next WEDNESDAY 2/210:00, place TBAClass on MONDAY 1/31 3:15, place TBAWeb page coming over weekend:

http://www.cse.chalmers.se/~mckee/courses/EDA281.htmlBooks are being copied, will distribute on MondayNO EXAM; final survey papers [email protected], [email protected]



9/20/2015 slide 1 pcod: lecture 1 per stenström © 2008, sally a. mckee © 2009 7.5 credit points...

Documents

data parallel model

fundamental design issues

todays parallel computers

model design principles

driving forces application

shared addressspace

programming models

scientific computing