definitions
DESCRIPTION
Definitions. A Synchronous application is one where all processes must reach certain points before execution continues. Local synchronization is a requirement that a subset of processes (usually neighbors) reach a synchronous point before execution continues. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Definitions](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812dc2550346895d9303f6/html5/thumbnails/1.jpg)
Definitions• A Synchronous application is one where all
processes must reach certain points before execution continues.
• Local synchronization is a requirement that a subset of processes (usually neighbors) reach a synchronous point before execution continues.
• A barrier is the basic message passing mechanism for synchronizing processes.
• Deadlock occurs when groups of processors are permanently waiting for messages that cannot be satisfied because the sending processes are also permanently waiting for messages.
![Page 2: Definitions](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812dc2550346895d9303f6/html5/thumbnails/2.jpg)
Barrier Illustration
P0
P0
P0
P0
P0
Barrie
r
Waiting
Executing
C: MPI_Barrier(MPI_COMM_WORLD);mpiJava: MPI.COMM_WORLD.Barrier();
![Page 3: Definitions](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812dc2550346895d9303f6/html5/thumbnails/3.jpg)
Counter (linear) Barrier
Master ProcessorFor (i=0; i<P; i++) // Arrival Phase
Receive null message from any processorFor (i=0; i<P; i++) // Departure Phase
Send null message to release slaves
Slave ProcessorsSend null message to enter barrierReceive null message for barrier release
Note: This logic avoids processors arriving before prior release
![Page 4: Definitions](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812dc2550346895d9303f6/html5/thumbnails/4.jpg)
Tree (non-linear) BarrierP
0P
1P
2P
3P
4P
5P
6P
7
P0
P1
P2
P3
P4
P5
P6
P7Entry Phase Release Phase
Note: Implementation logic is similar to divide and conquer
![Page 5: Definitions](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812dc2550346895d9303f6/html5/thumbnails/5.jpg)
Barrier Barrier
– Stage 1: P0p1; p2p3; p4p5; p6p7
– Stage 2: p0 p2; p1p3; p4p6; p5p7
– Stage 3: p0p4; p1p5; p2p6; p3p7
P0 P1 P2 P3 P4 P5 P6 P7
P0 P1 P2 P3 P4 P5 P6 P7
P0 P1 P2 P3 P4 P5 P6 P7
P0 P1 P2 P3 P4 P5 P6 P7
![Page 6: Definitions](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812dc2550346895d9303f6/html5/thumbnails/6.jpg)
Local Synchronization
• Even ProcessorsSend null message to processor i-1Receive null message from processor i-1Send null message to processor i+1Receive null message from processor i+1
• Odd Numbered ProcessorsReceive null message from processor i+1Send null message to processor i+1Receive null message from processor i-1Send null message to processor i-1
• Notes: – Local Synchronization is an incomplete barrier
• processors exit after receiving messages from their neighbors– Deadlock can occur if the message passing order is incorrect.
• MPI_Sendrecv() and MPI_Sendrecv_replace() are deadlock free
Synchronize with neighbors before proceeding
![Page 7: Definitions](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812dc2550346895d9303f6/html5/thumbnails/7.jpg)
Local Synchronization Example
• Heat Distribution Problem– Goal
• Determine final temperature at each n x n grid point
– Initial boundary condition• Know initial temperatures at the end
points
– Cannot proceed to next iteration until local synchronization completes
DOAverage each grid point with its neighbors
UNTIL temperature changes are small enough
New Value =(∑neighbors)/4
![Page 8: Definitions](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812dc2550346895d9303f6/html5/thumbnails/8.jpg)
Sequential Heat Distribution Code
Initialize rows 0,n and columns 0,n of g and h
Iteration = 0
DO
FOR (i=1; i<n; i++)
FOR (j=1; j<n; j++)
IF (iteration %2) hi,j = (gi-1,j+gi+1,j+gi,j-1+gi,j+1)/4
ELSE gi,j = (hi-1,j+hi+1,j+hi,j-1+hi,j+1)/4
iteration++
UNTIL max (|gi – hi|)<tolerance or iteration>MAX
![Page 9: Definitions](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812dc2550346895d9303f6/html5/thumbnails/9.jpg)
Block or Strip Partitioning
p0 p1 p2 p3
p4 p5 p6 p7
p8 p9 p10 p11
p12 p13 p14 p15
p0 p1 p2 p3 p4 p5 p6 p7
Blocks
Column Strips
• Block Partitioning– Eight messages exchanged
at each iteration– Data exchanged per
message is n/P1/2
• Strip Partitioning– Four messages exchanged
at each iteration– Data exchanged per
message is n • Question: Which is better?
Assign portions of the grid to processors in the topology
![Page 10: Definitions](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812dc2550346895d9303f6/html5/thumbnails/10.jpg)
Parallel Implementation
• Algorithm Modifications– Declare “ghost” rows to hold adjacent data (10 x 10 for 8 x 8 block)
– Exchange data with neighbor processors
– Perform the calculation for the local grid cells
PiCells to east
Cells to south
Cells to west
Cells to north
![Page 11: Definitions](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812dc2550346895d9303f6/html5/thumbnails/11.jpg)
Heat Distribution PartitioningSendRcv(row,col) if row,col is not local if myrank even Send(point,prow,col) Recv(point,prow,col)Else Recv(point,prow,col) Send(point,prow,col)
Main logicFor each iteration
For each point compute new temperature
SendRcv(row-1,col,point)SendRcv(row+1,col,point)SendRcv(row,col-1,point)SendRcv(row,col+1,point)
![Page 12: Definitions](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812dc2550346895d9303f6/html5/thumbnails/12.jpg)
Fully Synchronized Example
• Data Parallel Computations– Simultaneously apply the same operation to different data
• Sequential Codefor (i=1; i<n; i++) a[i] = someFunction(a[i])
• Shared Memory CodeForall (i=0; i<n; i++) {bodyOfInstructions}
– In these cases, the for loop is a natural barrier
• Distributed processingFor local a[i]; {someFunction(a[i])}
barrier();
![Page 13: Definitions](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812dc2550346895d9303f6/html5/thumbnails/13.jpg)
Data Parallel Example
A[0] += k A[1] += k A[n-1] += k
A[] += k
•All processors execute instructions in “lock step”
•Forall (i=0; i<n; i++) a[i] += k
Note: Multicomputer configurations partition numbers in blocks
p0 p1 pn
![Page 14: Definitions](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812dc2550346895d9303f6/html5/thumbnails/14.jpg)
Prefix Sum Problem
• Definition: Given numbers a[i]; i=0; i<n, the prefix sum of a[i] is: a[i] += a[0] + a[1] + … + a[i-1]
• Application: Radix Sort
• Sequential codefor (j=0;j<lg(n);j++) for (i=2j; i<n; i++) a[i] += a[i-2j];
• Parallel shared memory codefor (j=0; j<lg(n); j++) forall (i=2j; i<n; i++) a[i] += a[i-2j];
• Parallel distributed memory code for (j=1; j<= log(n); j++) if (myrank>=2j-1 receive(sum, myrank – 2j-1) a[myrank] = a[myrank] += sum
else send(a[myrank], myrank + 2j)
Note: Prefix Sum algorithm works for any associative operation
![Page 15: Definitions](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812dc2550346895d9303f6/html5/thumbnails/15.jpg)
Prefix Sum Illustration
![Page 16: Definitions](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812dc2550346895d9303f6/html5/thumbnails/16.jpg)
Synchronous Iteration
• Processes synchronize at each iteration step• Example: Simulation of Natural Processes• Shared memory code for (j=0; j<n; j++) forall (i=0; i<N; i++) body(i);
• Distributed memory code for (j=0; j<n; j++) body(myRank); barrier();
![Page 17: Definitions](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812dc2550346895d9303f6/html5/thumbnails/17.jpg)
Example: n equations of n unknowns
an-1,0x0 + an,1x1 … + an,n-1xn-1 = bk
∙∙∙
ak,0x0 + ak,1x1 … + ak,n-1xn-1 = bk
∙∙∙
a1,0x0 + a1,1x1 … + a1,n-1xn-1 = b1
a0,0x0 + a0,1x1 … + a0,n-1xn-1 = b0
• Or rewrite equations as follows:xk=(bk–ak,0x0-…-ak,j-1xj-1-ak,j+1xj+1-…-ak,n-1xn-1)/ak,k
= (bk - ∑j≠kai,j xj)/ai,i
![Page 18: Definitions](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812dc2550346895d9303f6/html5/thumbnails/18.jpg)
Jacobi Iteration
• Jacobi Iterationxnewi = initial guess
DO
xi = xnewi
xnewi = Calculated next guess
UNTIL ∑i|xnewi – xi|<tolerance
• Jacobi iteration always converges if:
ak,k > ∑i≠k ai,0
i i+1
Error
Iteration
xi
![Page 19: Definitions](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812dc2550346895d9303f6/html5/thumbnails/19.jpg)
Parallel Jacobi Code
xi = bi
DO for each i
sum = -ai,i * xi
FOR (j=0; j<n; j++) sum += ai,i * xj
xnewi = (bi – sum)/ai,i
allgather(xnewi)
barrier()
Until iterations>MAX or ∑i|xnewi – xi|<tolerance
xnew0 xnew1 xnewn-1
xi
Allgather() xnewi into xi
![Page 20: Definitions](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812dc2550346895d9303f6/html5/thumbnails/20.jpg)
Additional Jacobi Notes
• What if P (processor count) < n?– Answer: Allocate blocks of variables
to processors
• Block Allocation– Allocate consecutive xi to processors
• Cyclic Allocation– Allocate x0, xP, … to p0– Allocate x1, xp+1, … to p1 … etc.
• Question: Which allocation scheme is better?
Time
Processors4 8 12 16 20 24
Computation
Communication
Jacobi Performance
![Page 21: Definitions](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812dc2550346895d9303f6/html5/thumbnails/21.jpg)
Cellular Automata
• The System has a finite grid of cells
• Each cell can assume a finite number of states
• Neighbor cells affect a cell according to rule set
• All cell changes of state occur simultaneously
• The system iterates through a number of generations
Serious Applications:•Fluid and gas dynamics•Biological growth•Airplane wing airflow•Erosion modeling•Groundwater pollution
Definition
![Page 22: Definitions](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812dc2550346895d9303f6/html5/thumbnails/22.jpg)
Conway’s Game of Life• The grid is a two dimension array of cells
– The grid ends can optionally wrap around (like a torus)• Each cell
– Can hold one “organism” – There are eight neighbor cells
o North, Northeast, East, Southeast, South, Southwest, West, Northwest
• Rules (run the simulation over many generations)1. Organism dies from loneliness if 0 or 1 organisms live in
neighbor cells2. Organism survives if 2 organisms live in adjacent cells3. An empty cell with 3 living neighbors gives birth to
organisms in every empty adjacent cell 4. Organism dies from overpopulation >= 4 organisms live in
neighbor cells
![Page 23: Definitions](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812dc2550346895d9303f6/html5/thumbnails/23.jpg)
Sharks and Fishes• The grid (ocean) is modeled by a three dimension array
– The grid ends can optionally wrap around (like a torus)• Each cell
– Can hold either a fish or a shark, but not both– There are twenty six neighbor cells
• Rules for fish1. Fish move randomly to empty adjacent cells2. If there are no empty adjacent cells, fish stay put3. Fish of breeding age leave a baby fish in the vacating cell4. Fish die after x generations
• Rules for sharks1. Sharks randomly move to adjacent cells with fish, eating the fish2. If no adjacent cells have fish, the shark moves randomly to empty
cells. It stays put if there are no empty cells3. Sharks of breeding age leave a baby shark in the vacating cell4. Sharks that die if they don’t eat a fish for y generations