9/20/2015 slide 1 pcod: lecture 1 per stenström © 2008, sally a. mckee © 2009 7.5 credit points...
TRANSCRIPT
![Page 1: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/1.jpg)
04/21/23 slide 1PCOD: Lecture 1
Per Stenström © 2008, Sally A. McKee © 2009
7.5 credit points
Instructor: Sally A. McKee
Memory bus
MicroChannel bus
I/O
i860 NI
DMA
DR
AM
IBM SP-2 node
L2 $
Power 2CPU
Memorycontroller
4-wayinterleaved
DRAM
General interconnectionnetwork formed from8-port switches
NIC
![Page 2: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/2.jpg)
INFOI need your names and email addresses!Class on MONDAY 1/24 3:15, place TBANO CLASS T/Th 1/25, 1/27 (SAM away)NO LAB/EXERCISES THIS WEEK NO LAB THURSDAY 2/3 (we’ll start week 3)Exercises next WEDNESDAY 2/210:00, place TBAClass on MONDAY 1/31 3:15, place TBAWeb page coming over weekend:
http://www.cse.chalmers.se/~mckee/courses/EDA281.htmlBooks are being copied, will distribute on MondayNO EXAM; final survey papers instead
04/21/23 slide 2PCOD: Lecture 1
Per Stenström © 2008, Sally A. McKee © 2009
![Page 3: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/3.jpg)
04/21/23 slide 3PCOD: Lecture 1
Per Stenström © 2008, Sally A. McKee © 2009
What is a parallel What is a parallel computer?computer?
A parallel computer is a collection of processing elements that cooperate to solve large problems (fast)
![Page 4: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/4.jpg)
04/21/23 slide 4PCOD: Lecture 1
Per Stenström © 2008, Sally A. McKee © 2009
Why parallel computers?Why parallel computers?
New performance demanding applications
Killer microprocessors
A collection of killer microprocessors
(integrated onthe same chip)
Economics
Technology trends
![Page 5: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/5.jpg)
04/21/23 slide 5PCOD: Lecture 1
Per Stenström © 2008, Sally A. McKee © 2009
Broad issuesBroad issues
Programming issues: What programming model? Performance — cross
cutting issues: Impact of system design tradeoffs on application performance Impact of application design on performance
Performance — cross cutting issues: Impact of system design tradeoffs on application performance Impact of application design on performance
Architectural model issues: How big a collection? How powerful are the elements? How do the elements cooperate?
System interface
A parallel computer is a collection of processing elements that cooperate to solve large problems (fast)
![Page 6: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/6.jpg)
04/21/23 slide 6PCOD: Lecture 1
Per Stenström © 2008, Sally A. McKee © 2009
Goal and overviewGoal and overview
The goal of this course is to provide knowledge on Programming models and techniques for design of high-performance parallel programs the data parallel model the shared address-space model the message-passing model
Design principles for parallel computers small-scale system design tradeoffs scalable system design tradeoffs interconnection networks
![Page 7: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/7.jpg)
04/21/23 slide 8PCOD: Lecture 1
Per Stenström © 2008, Sally A. McKee © 2009
Driving forces behind parallel computers (1.1) Evolution behind today’s parallel computers (1.2) Fundamental design issues (1.3) Methodology for designing parallel programs (2.1 – 2.2)
Overview of parallel computer technology:What it is? What it is for? What are the issues?
![Page 8: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/8.jpg)
04/21/23 slide 9PCOD: Lecture 1
Per Stenström © 2008, Sally A. McKee © 2009
Three driving forcesThree driving forces Application demands (coarse-grain parallelism abounds): Scientific computing (e.g., modeling of phenomena in
science) Engineering computing (e.g., CAD and design
analysis) Commercial computing (e.g., media and information
processing) Technology trends Transistor density growth high Clock frequency improvement moderate
Architecture trends Diminishing returns on instruction-level parallelism
![Page 9: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/9.jpg)
04/21/23 slide 10PCOD: Lecture 1
Per Stenström © 2008, Sally A. McKee © 2009
Parallelism in sequential programsParallelism in sequential programs
A sequential program on a superscalar processor:Programming model:
sequential
Architecture: instruction-level parallelism register (memory) communicationpipeline interlocking for
synchronization
Gap between modeland architecture has increased
for i = 0 to N-1 a[(i+1) mod N] := b[i] + c[i];for i = 0 to N-1 d[i] := C*a[i];
Iteration: 0 1 2 … N-1Loop 1 a[1] a[2] … a[0]
Loop 2 a[0] a[1] … a[N-1]
data dependencies
![Page 10: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/10.jpg)
04/21/23 slide 11PCOD: Lecture 1
Per Stenström © 2008, Sally A. McKee © 2009
A parallel programming modelA parallel programming model
Extended semantics to express units of parallelism at theinstruction levelthread levelprogram level
communication and coordination between units of parallelism at theregister levelmemory level I/O level
![Page 11: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/11.jpg)
04/21/23 slide 12PCOD: Lecture 1
Per Stenström © 2008, Sally A. McKee © 2009
Programming model vs. Programming model vs. parallel architecture parallel architecture CAD
Multiprogramming Sharedaddress
Messagepassing
Dataparallel
Databases Scientific modeling Parallel applications
Programming models
Communication abstractionUser/system boundary
Compileror library
Operating system support
Communication harrdware
Physical communication medium
Hardware/software boundary
Three key concepts• Communication abstraction supports programming models• Communication architecture (ISA plus primitives for comm/synch)• Hardware/software boundary to define which parts of the communication architecture are implemented in hardware or software
![Page 12: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/12.jpg)
04/21/23 slide 13PCOD: Lecture 1
Per Stenström © 2008, Sally A. McKee © 2009
Shared address space (SAS) modelShared address space (SAS) modelProgramming model Parallelism parts of a program, called threads
Communication and coordination among threads through a shared global address space
for_all i = 0 to P-1 for j = i0[i] to in[i] a[(j+1) mod N] := b[j] + c[j];barrier;for_all i = 0 to P-1 for j = i0[i] to in[i] d[j] := C*a[j]; P P P
Memory
Communication abstractionsupported by HW/SW interface
![Page 13: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/13.jpg)
04/21/23 slide 14PCOD: Lecture 1
Per Stenström © 2008, Sally A. McKee © 2009
Message passing modelMessage passing modelProgramming model Process-level parallelism (private addresses) Communication and coordination via explicit messages
for_all i = 0 to P-1 for j = i0[i] to in[i] index = (j+1) mod N; a[index] := b[j] + c[j]; send(a[index], (j+1) mod P); end_forbarrier;for_all k = 0 to P-1 for j = i0[k] to in[k] recv(tmp,(P+j-1) mod P); d[j] := C*tmp;} end_for
for_all i = 0 to P-1 for j = i0[i] to in[i] index = (j+1) mod N; a[index] := b[j] + c[j]; if j = in[i] then send(a[index], (j+1) mod P, a[j]); end_forbarrier;for_all i = 0 to P-1 for j = i0[i] to in[i] if j = i0[i] then recv(tmp,(P+j-1) mod P, a[j]); d[j] := C * tmp;} end_for
![Page 14: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/14.jpg)
04/21/23 slide 15PCOD: Lecture 1
Per Stenström © 2008, Sally A. McKee © 2009
Data parallel systems (SIMD)Programming model Operations performed in parallel on each element of data structure Logically, single thread of control performing sequential or parallel steps
Conceptually, a processor associated with each data element
Architectural model Array of many simple, cheap processors with little memory each
(processors don’t sequence through instructions) Attached to a control processor that issues instructions Specialized and general communication, cheap global synchronization
Original motivations•Matches simple differential equation solvers•Centralizes high cost of instruction fetch/sequencing
PE PE PE
PE PE PE
PE PE PE
Controlprocessor
![Page 15: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/15.jpg)
04/21/23 slide 16PCOD: Lecture 1
Per Stenström © 2008, Sally A. McKee © 2009
Pros and cons of data parallelismPros and cons of data parallelismExample
parallel (i:0->N-1) a[(i+1) mod N] := b[i] + c[i];parallel (i:0->N-1) d[i] := C * a[i];
Evolution and convergence: Popular when cost savings of centralized sequencer is high Parallelism is limited to specialized regular computationsMuch of parallelism can be exploited at instruction level Coarser levels of parallelism can be exposed for
multiprocessors and message-passing machines
New data parallel programming model: SPMD
Single-Program Multiple-Data
![Page 16: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/16.jpg)
04/21/23 slide 17PCOD: Lecture 1
Per Stenström © 2008, Sally A. McKee © 2009
A generic parallel architectureA generic parallel architectureA generic modern multiprocessor (shared address or message passing architecture)
Node: processor(s), memory system, plus a communication assist
• Network interface and communication controller
• Scalable network
• Convergence allows lots of innovation, now within framework
• Integration of assist with node, what operations, how efficiently...
Mem
Network
P
$
Communicationassist (CA)
![Page 17: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/17.jpg)
04/21/23 slide 19PCOD: Lecture 2
Per Stenström © 2008, Sally A. McKee © 2009
Why study parallel programming Why study parallel programming issues?issues?
From a software/algorithm designer’s point of view: High-performance software is the key motivation for parallel computers Parallel compiler technology is far from being as mature as compiler technology for single processors (uniprocessors) From a system designer’s point of view: Understanding hardware/software interaction is key to making architectural tradeoffs for high performance
Important to understand tradeoffs in performance versus programming effort involved
![Page 18: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/18.jpg)
04/21/23 slide 20PCOD: Lecture 2
Per Stenström © 2008, Sally A. McKee © 2009
Creating a parallel programCreating a parallel program
Assumption: Sequential algorithm is given pieces of the jobIdentify work that can be done in parallelPartition work and perhaps data among processesManage data access, communication, and synchronizationNote: work includes computation, data access, and I/O
Main goal: Speedup (plus low programming effort and resources needed)
Speedup (p) =
For a fixed problem:
Speedup (p) =
Performance(p)Performance(1)
Time(1)Time(p)
![Page 19: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/19.jpg)
04/21/23 slide 21PCOD: Lecture 2
Per Stenström © 2008, Sally A. McKee © 2009
Steps in creating a parallel programSteps in creating a parallel program
4 steps: Decomposition, Assignment, Orchestration, MappingDone by programmer or system software (compiler, runtime, ...)Issues are the same, so just assume programmer does it
P0
Tasks Processes Processors
P1
P2 P3
p0 p1
p2 p3
p0 p1
p2 p3
Partitioning
Sequentialcomputation
Parallelprogram
Assignment
Decomposition
Mapping
Orchestration
Largely architecture independent Largely architecture dependent
![Page 20: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/20.jpg)
04/21/23 slide 22PCOD: Lecture 2
Per Stenström © 2008, Sally A. McKee © 2009
Some important conceptsSome important concepts Task: Arbitrary piece of total work in a parallel computationExecuted sequentially — concurrency is only across tasksFine-grained versus coarse-grained tasks
Process (thread): Entity that is eventually executed by a CPUAbstract entity that performs the tasks assigned to processesProcesses communicate and synchronize to perform their tasks
Processor: Physical engine on which process executesProcesses virtualize the machine to the programmer
first write program in terms of processes, then map to processors
![Page 21: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/21.jpg)
04/21/23 slide 23PCOD: Lecture 2
Per Stenström © 2008, Sally A. McKee © 2009
DecompositionDecomposition
Purpose: Break up computation into tasks to be divided among processes
Tasks may become available dynamically
Number of available tasks may vary with timei.e., identify concurrency and decide level at
which to exploit it
Goal: Enough tasks to keep processes busy, but not too many (keep task management reasonable)Number of tasks available at creates an upper
bound on achievable speedup
![Page 22: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/22.jpg)
04/21/23 slide 24PCOD: Lecture 2
Per Stenström © 2008, Sally A. McKee © 2009
Limited concurrency: Amdahl’s LawLimited concurrency: Amdahl’s Law
The most fundamental limitation on parallel speedup. If fraction s of sequential execution is inherently serial, speedup <= 1/s
Example: 2-phase calculation Phase 1: sweep over n-by-n grid and do some independent computation (Time: n2/p)
Phase 2: sweep again and add each value to global sum (Time: n2)
Improved version: Trick — divide second phase into twoaccumulate into private sum during sweepadd per-process private sum into global sum
Parallel time is n2/p + n2/p + p, and speedup at best p 2n2
2n2 + p2
![Page 23: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/23.jpg)
04/21/23 slide 25PCOD: Lecture 2
Per Stenström © 2008, Sally A. McKee © 2009
Graphical representation of exampleGraphical representation of example
1
p
1
p
1
n2/p
n2
p
wor
k do
ne c
oncu
rren
tly
n2
n2
Timen2/p n2/p
(c)
(b)
(a)
![Page 24: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/24.jpg)
04/21/23 slide 26PCOD: Lecture 2
Per Stenström © 2008, Sally A. McKee © 2009
AssignmentAssignment Specify a mechanism to divide work up among processesGoal: Balance work among processes, reduce communication
and management costs
Structured approaches usually work wellCode inspection (parallel loops) or understanding of applicationWell known heuristicsStatic versus dynamic assignment
Programmers worry about decomposition and assignment first Largely independent of architecture or programming modelBut cost and complexity of using primitives may affect decisions
As architects, we assume program does reasonable job of it
![Page 25: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/25.jpg)
04/21/23 slide 27PCOD: Lecture 2
Per Stenström © 2008, Sally A. McKee © 2009
OrchestrationOrchestration PurposeNaming data, structuring communication and synchronization Organizing data structures, scheduling tasks temporally
GoalsReduce costs of communication and synchronization as seen by
processorsEnhance locality of data references Reduce overhead of parallelism management
Closest to architecture (and programming model and language)Choices depend heavily on communication abstraction,
efficiency of primitives Architects must provide appropriate, efficient primitives
![Page 26: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/26.jpg)
04/21/23 slide 28PCOD: Lecture 2
Per Stenström © 2008, Sally A. McKee © 2009
MappingMapping After orchestration, a parallel program exists Two aspects of mapping:Which processes will run on same processor, if necessaryWhich process runs on which particular processor
One extreme: space-sharingMachine divided into subsets, only one application at a time in
a subsetProcesses can be pinned to processors, or OS can balance
workloads Another extreme: complete resource management control to OSOS uses the performance techniques we will discuss later
Real world is between the twoUser specifies desires in some aspects, but system may ignore
![Page 27: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/27.jpg)
04/21/23 slide 29PCOD: Lecture 2
Per Stenström © 2008, Sally A. McKee © 2009
High-level goalsHigh-level goals
But low resource usage and development effort Implications for algorithm designers and architectsAlgorithm designers: high-performance, low resource needsArchitects: high-performance, low cost, reduced programming
effort
High performance (speedup over equivalent sequential program)Table 2.1 Steps in the Parallelization Process and Their Goals
StepArchitecture-Dependent? Major Performance Goals
Decomposition Mostly no Expose enough concurrency but not too much
Assignment Mostly no Balance workloadReduce communication volume
Orchestration Yes Reduce noninherent communication via data locality
Reduce communication and synchronization cost as seen by the processor
Reduce serialization at shared resourcesSchedule tasks to satisfy dependences early
Mapping Yes Put related processes on the same processor if necessary
Exploit locality in network topology
![Page 28: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/28.jpg)
04/21/23 slide 30PCOD: Lecture 2
Per Stenström © 2008, Sally A. McKee © 2009
Partitioning (Ch. 3.1) [in 1.5 weeks](=decomposition + assignment)Partitioning (Ch. 3.1) [in 1.5 weeks](=decomposition + assignment) P P
Communication cost
P
Part II — Textbook Reference: Ch. 2.3
Applying the methodology toan equation solver
![Page 29: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/29.jpg)
04/21/23 slide 31PCOD: Lecture 2
Per Stenström © 2008, Sally A. McKee © 2009
10. procedure Solve (A) /*solve the equation system*/11. float **A; /*A is an (n + 2)-by-(n + 2) array*/12. begin13. int i, j, done = 0;14. float diff = 0, temp;15. while (!done) do /*outermost loop over sweeps*/16. diff = 0; /*initialize maximum difference to 0*/17. for i 1 to n do /*sweep over nonborder points of grid*/18. for j 1 to n do19. temp = A[i,j]; /*save old value of element*/20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]); /*compute average*/22. diff += abs(A[i,j] - temp);23. end for24. end for25. if (diff/(n*n) < TOL) then done = 1;26. end while27. end procedure
Sequential implementationSequential implementation
![Page 30: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/30.jpg)
04/21/23 slide 32PCOD: Lecture 2
Per Stenström © 2008, Sally A. McKee © 2009
Decomposition 1(3)Decomposition 1(3)
Inherent concurrency in the loop structure• Dependencies with north and west grid points • Loops are inherently sequentialInherent concurrency ignoring loop structure• Concurrency O(n) along anti-diagonals, serialization O(n) across anti-diagonals• Result: load imbalance and many synchronizations
![Page 31: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/31.jpg)
04/21/23 slide 33PCOD: Lecture 2
Per Stenström © 2008, Sally A. McKee © 2009
Decomposition 2(3)Decomposition 2(3)
Different ordering of updates: may converge quicker or slower Red sweep and black sweep are each fully parallel: Global synch between them (conservative but convenient)
Inherent concurrency in algorithm: the red-black ordering
Red point
Black point
![Page 32: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/32.jpg)
04/21/23 slide 34PCOD: Lecture 2
Per Stenström © 2008, Sally A. McKee © 2009
Decomposition 3(3)Decomposition 3(3)
Decomposition into elements: degree of concurrency n2
To decompose into rows, make line 18 loop sequential; degree nfor_all leaves assignment left to system but implicit global
synchronization at end of for_all loop
15. while (!done) do /*a sequential loop*/16. diff = 0; 17. for_all i 1 to n do /*a parallel loop nest*/18. for_all j 1 to n do19. temp = A[i,j];20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]);22. diff += abs(A[i,j] - temp);23. end for_all24. end for_all25. if (diff/(n*n) < TOL) then done = 1;26. end while
Ignore the dependences — the solution will converge anyway
![Page 33: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/33.jpg)
04/21/23 slide 35PCOD: Lecture 2
Per Stenström © 2008, Sally A. McKee © 2009
AssignmentAssignment
Static assignment (given decomposition into rows) block assignment (see figure) reduces communication, may introduce load imbalance
cyclic assignment (process i is assigned rows i, i+p…)
Dynamic assignment (let the system do it) each process grabs a new row when finished with current row
P0
P1
P2
P4
![Page 34: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/34.jpg)
04/21/23 slide 36PCOD: Lecture 2
Per Stenström © 2008, Sally A. McKee © 2009
Solver under data parallel modelSolver under data parallel model10. procedure Solve(A) /*solve the equation system*/11. float **A; /*A is an (n + 2-by-n + 2) array*/12. begin13. int i, j, done = 0;14. float mydiff = 0, temp;14a. DECOMP A[BLOCK,*, nprocs];15. while (!done) do /*outermost loop over sweeps*/16. mydiff = 0; /*initialize maximum difference to 0*/
17. for_all i 1 to n do /*sweep over non-border points of grid*/18. for_all j 1 to n do19. temp = A[i,j]; /*save old value of element*/
20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]); /*compute average*/22. mydiff += abs(A[i,j] - temp);23. end for_all24. end for_all24a. REDUCE (mydiff, diff, ADD);25. if (diff/(n*n) < TOL) then done = 1;26. end while27. end procedure
Important observations: Matrix is shared across processes All processes do the same operation in parallel in lock-step Orchestration easy: no explicit communication or synchronization
Three primitives: DECOMP does assignment
for_all distributes work
REDUCE accumulates local sum to global sum
![Page 35: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/35.jpg)
04/21/23 slide 37PCOD: Lecture 2
Per Stenström © 2008, Sally A. McKee © 2009
Solver under SAS model 1(3)Solver under SAS model 1(3)All processes have
separate control (in this example they do the same operations: SPMD model)
Assignment controlled by loop indices
All processes share the matrix but do not work in lock-step —orchestration focuses on synchronization
Sweep
Test Convergence
Processes
Solve Solve Solve Solve
![Page 36: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/36.jpg)
04/21/23 slide 38PCOD: Lecture 2
Per Stenström © 2008, Sally A. McKee © 2009
10. procedure Solve(A)11. float **A; /*A is entire n+2-by-n+2 shared array,
as in the sequential program*/12. begin13. int i,j, pid, done = 0;14. float temp, mydiff = 0; /*private variables*/14a. int mymin = 1 + (pid * n/nprocs); /*assume that n is exactly divisible by*/14b. int mymax = mymin + n/nprocs - 1 /*nprocs for simplicity here*/
15. while (!done) do /*outer loop over all diagonal elements*/16. mydiff = diff = 0; /*set global diff to 0 (okay for all to do it)*/16a. BARRIER(bar1, nprocs); /*ensure all reach here before anyone modifies diff*/17. for i mymin to mymax do /*for each of my rows*/18. for j 1 to n do /*for all nonborder elements in that row*/19. temp = A[i,j];20. A[i,j] = 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]);22. mydiff += abs(A[i,j] - temp);23. endfor24. endfor25a. LOCK(diff_lock); /*update global diff if necessary*/25b. diff += mydiff;25c. UNLOCK(diff_lock);25d. BARRIER(bar1, nprocs); /*ensure all reach here before checking if done*/25e. if (diff/(n*n) < TOL) then done = 1; /*check convergence; all get
same answer*/25f. BARRIER(bar1, nprocs);26. endwhile27. end procedure
Solver under SAS model 2(3)Solver under SAS model 2(3)Code for a single process
Main changes to program: Assignment Synchronizations
Synchronizations:BARRIER: catch all
before allowed to proceed
LOCK/UNLOCK: enforce mutual exclusion
![Page 37: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/37.jpg)
04/21/23 slide 39PCOD: Lecture 2
Per Stenström © 2008, Sally A. McKee © 2009
Mutual exclusion 1(2)Mutual exclusion 1(2)
Code each process executes:load r1 <- diffadd r1, r2, r1store diff <- r1
A possible interleaving:P1 P2
r1 diff {P1 gets 0 in its r1}r1 diff {P2 also gets 0}
r1 r1+r2 {P1 sets its r1 to 1}r1 r1+r2 {P2 sets its r1 to 1}
diff r1 {P1 sets cell_cost to 1}diff r1 {P2 also sets cell_cost to 1}
The sets of operations must be atomic (mutually exclusive)
![Page 38: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/38.jpg)
04/21/23 slide 40PCOD: Lecture 2
Per Stenström © 2008, Sally A. McKee © 2009
Mutual exclusion 2(2)Mutual exclusion 2(2)
Provided by LOCK-UNLOCK around critical sections (code segments requiring mutual exclusion)Set of operations we want to execute atomicallyLOCK/UNLOCK implementations must
guarantee mutual exclusion (atomicity)
Can lead to significant serialization if contentionSince non-local accesses are expected in critical
sectionAnother reason to use private mydiff variable for
partial accumulation
![Page 39: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/39.jpg)
04/21/23 slide 41PCOD: Lecture 2
Per Stenström © 2008, Sally A. McKee © 2009
Global event synchronizationGlobal event synchronizationBARRIER(nprocs): wait here till nprocs processes arrive
Built using lower-level primitivesUsed to separate phases of computation
Process P_1 Process P_2 Process P_nprocs
set up eqn system set up eqn system set up eqn system
Barrier (name, nprocs) Barrier (name, nprocs) Barrier (name, nprocs)
solve eqn system solve eqn system solve eqn system
Barrier (name, nprocs) Barrier (name, nprocs) Barrier (name, nprocs)
apply results apply results apply results
Barrier (name, nprocs) Barrier (name, nprocs) Barrier (name, nprocs)Conservative form of preserving dependencies, but easy to use
Point-to-point event synchronization possible — see text
![Page 40: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/40.jpg)
04/21/23 slide 42PCOD: Lecture 2
Per Stenström © 2008, Sally A. McKee © 2009
Solver under message passing 1(2)Solver under message passing 1(2)Private versus shared address space causes many
differences from the SAS (shmem) modelCannot declare A to be a shared array anymoreNeed to compose it logically from per-process
private arraysusually allocated in accordance with the assignment
of workprocess assigned a set of rows allocates them locallyTransfers of entire rows between traversals
Structurally similar to SAS, but orchestration differentdata structures and data access/namingcommunicationsynchronization
![Page 41: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/41.jpg)
04/21/23 slide 43PCOD: Lecture 2
Per Stenström © 2008, Sally A. McKee © 2009
110. procedure Solve()11. begin13. int i,j, pid, n’ = n/nprocs, done = 0;14. float temp, tempdiff, mydiff = 0; /*private variables*/6. myA malloc(a 2-d array of size [n/nprocs + 2] by n+2);
/*my assigned rows of A*/7. initialize(myA); /*initialize my rows of A, in an unspecified way*/
15. while (!done) do16. mydiff = 0; /*set local diff to 0*/16a. if (pid != 0) then SEND(&myA[1,0],n*sizeof(float),pid-1,ROW);16b. if (pid = nprocs-1) then
SEND(&myA[n’,0],n*sizeof(float),pid+1,ROW);16c. if (pid != 0) then RECEIVE(&myA[0,0],n*sizeof(float),pid-1,ROW);16d. if (pid != nprocs-1) then
RECEIVE(&myA[n’+1,0],n*sizeof(float), pid+1,ROW);/*border rows of neighbors have now been copiedinto myA[0,*] and myA[n’+1,*]*/
17. for i 1 to n’ do /*for each of my (nonghost) rows*/18. for j 1 to n do /*for all nonborder elements in that row*/19. temp = myA[i,j];20. myA[i,j] = 0.2 * (myA[i,j] + myA[i,j-1] + myA[i-1,j] +21. myA[i,j+1] + myA[i+1,j]);22. mydiff += abs(myA[i,j] - temp);23. endfor24. endfor
/*communicate local diff values and determine ifdone; can be replaced by reduction and broadcast*/
25a. if (pid != 0) then /*process 0 holds global total diff*/25b. SEND(mydiff,sizeof(float),0,DIFF);25c. RECEIVE(done,sizeof(int),0,DONE);25d. else /*pid 0 does this*/25e. for i 1 to nprocs-1 do /*for each other process*/25f. RECEIVE(tempdiff,sizeof(float),*,DIFF);25g. mydiff += tempdiff; /*accumulate into total*/25h. endfor25i if (mydiff/(n*n) < TOL) then done = 1;25j. for i 1 to nprocs-1 do /*for each other process*/25k. SEND(done,sizeof(int),i,DONE);25l. endfor25m. endif26. endwhile27. end procedure
Solver under message passing 2(2)Solver under message passing 2(2)Main changes to program:Assignment and
distribution of dataExplicit
communication of results (naming)
Synchronization
Primitives: SEND: copies of
data from local to remote
RECEIVE: copies of data from remote to local
![Page 42: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/42.jpg)
04/21/23 slide 44PCOD: Lecture 2
Per Stenström © 2008, Sally A. McKee © 2009
Send and Receive alternativesSend and Receive alternatives
Affect event synchronization, ease of programming, performance Synchronous messages provide built-in synchronization through matching Synchronous messages are prone to deadlock
Functionality extensions: stride, scatter-gather, groups Semantics: based on when control is returned Semantics dictate when data structures or buffers can be reused at either end
Send/Receive
Synchronous Asynchronous
Blocking asynchronous Nonblocking asynchronous
![Page 43: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/43.jpg)
04/21/23 slide 45PCOD: Lecture 2
Per Stenström © 2008, Sally A. McKee © 2009
SummarySummary Decomposition and Assignment similar in SAS and MP Orchestration is different: data structures, data access/naming, communication, synchronization
Requirements for performance are another story — stay tuned for more
SAS Msg-Passing
Explicit global data structure? Yes No
Assignment indept of data layout? Yes No
Communication Implicit Explicit
Synchronization Explicit Implicit
Explicit replication of border rows? No Yes
![Page 44: 9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © 2009 7.5 credit points Instructor: Sally A. McKee](https://reader035.vdocuments.us/reader035/viewer/2022062801/56649e635503460f94b5fadc/html5/thumbnails/44.jpg)
INFOI need your names and email addresses!Class on MONDAY 1/24 3:15, place TBANO CLASS T/Th 1/25, 1/27 (SAM away)NO LAB/EXERCISES THIS WEEK NO LAB THURSDAY 2/3 (we’ll start week 3)Exercises next WEDNESDAY 2/210:00, place TBAClass on MONDAY 1/31 3:15, place TBAWeb page coming over weekend:
http://www.cse.chalmers.se/~mckee/courses/EDA281.htmlBooks are being copied, will distribute on MondayNO EXAM; final survey papers [email protected], [email protected]
04/21/23 slide 46PCOD: Lecture 1
Per Stenström © 2008, Sally A. McKee © 2009