decomposition, locality and (finishing synchronization lecture) barriers
DESCRIPTION
Decomposition, Locality and (finishing synchronization lecture) Barriers. Kathy Yelick [email protected] www.cs.berkeley.edu/~yelick/cs194f07. Lecture Schedule for Next Two Weeks. Fri 9/14 11-12:00 Mon 9/17 10:30-11:30 Discussion (11:30-12 as needed) - PowerPoint PPT PresentationTRANSCRIPT
04/22/23 CS194 Lecture 1
Decomposition, Locality and (finishing synchronization
lecture) Barriers
Kathy [email protected]
www.cs.berkeley.edu/~yelick/cs194f07
04/22/23 CS194 Lecture 2
Lecture Schedule for Next Two Weeks• Fri 9/14 11-12:00 • Mon 9/17 10:30-11:30 Discussion (11:30-12 as needed)• Wed 9/19 10:30-12:00 Decomposition and Locality• Fri 9/21 10:30-12:00 NVIDIA Lecture• Mon 9/24 10:30-12:00 NVIDIA lecture (David Kirk)• Tue 9/25 3-:4:30 NVIDIA research talk in Woz (optional)• Wed 9/26 10:30-11:30 Discussion• Fri 9/27 11-12 Lecture topic TBD
04/22/23 CS194 Lecture 3
Outline
• Task Decomposition• Review parallel decomposition and task graphs• Styles of decomposition• Extending task graphs with interaction information
• Example: Sharks and Fish (Wator)• Parallel vs. sequential versions
• Data decomposition• Partitioning rectangular grids (like matrix multiply)• Ghost regions (unlike matrix multiply)
• Bulk-synchronous programming• Barrier synchronization
04/22/23 CS194 Lecture 4
Designing Parallel Algorithms
• Parallel software design starts with decomposition• Decomposition Techniques
• Recursive Decomposition • Data Decomposition: Input, Output, or Intermediate Data• And others
• Characteristics of Tasks and Interactions • Task Generation, Granularity, and Context • Characteristics of Task Interactions.
04/22/23 CS194 Lecture 5
Recursive Decomposition: Example • Consider parallel Quicksort.
Once the array is partitioned, each subarry can be processed i parallel
Source: Ananth Grama
04/22/23 CS194 Lecture 6
Data Decomposition
• Identify the data on which computations are performed.
• Partition this data across various tasks. • This partitioning induces a decomposition of the
of the computation, often using the following rule:
• Owner Computes Rule: the thread assigned to a particular data item is responsible for all computation associated with it.
• The owner computes rule is especially common on output data. Why?
Source: Ananth Grama
04/22/23 CS194 Lecture 7
Input Data Decomposition• Recall: applying a function square to the
elements of array A, then computing its sum:
• Conceptualize a decomposition as a task dependency graph:
• A directed graph with • Nodes corresponding to tasks • Edges indicating dependencies, that the result of
one task is required for processing the next.
sqr (A[0])sqr(A[1])sqr(A[2]) sqr(A[n])
sum
…
04/22/23 CS194 Lecture 8
Output Data Decomposition: Example
Consider matrix multiply:
Task 1:
Task 2:
Task 3:
Task 4:
Source: Ananth Grama
04/22/23 CS194 Lecture 9
A Model Problem: Sharks and Fish
• Illustration of parallel programming• Original version (discrete event only) proposed by
Geoffrey Fox• Called WATOR
• Basic idea: sharks and fish living in an ocean• Rules for breeding, eating, and death• Could also add: forces in the ocean, between
creatures, etc.• Ocean is toroidal (2D donut)
04/22/23 CS194 Lecture 10
Serial vs. Parallel• Updating left-to-right row-wise order, we get a serial
algorithm• Cannot be parallelized, because of dependencies, so
instead we use a “red-black” orderforall black points grid(i,j) update …forall red points grid(i,j) update …
° For 3D or general graph, use graph coloring ° Can use repeated Maximal Independent Sets to color° Graph(T) is bipartite => 2 colorable (red and black)° Nodes for each color can be updated simultaneously
° Can also use two copies (old and new), but in both cases Note we have changed behavior of original algorithm !
04/22/23 CS194 Lecture 11
Parallelism in Wator• The simulation is synchronous
• use two copies of the grid (old and new).• the value of each new grid cell depends only on 9 cells (itself plus 8
neighbors) in old grid.• simulation proceeds in timesteps-- each cell is updated at every step.
• Easy to parallelize by dividing physical domain: Domain Decomposition
• Locality is achieved by using large patches of the ocean• Only boundary values from neighboring patches are needed.
• How to pick shapes of domains?
P4
P1 P2 P3
P5 P6
P7 P8 P9
Repeat compute locally to update local system barrier() exchange state info with neighborsuntil done simulating
04/22/23 CS194 Lecture 12
Regular Meshes (eg Game of Life)
• Suppose graph is nxn mesh with connection NSEW neighbors• Which has less communication (less potential false sharing)?
n*(p-1)edge crossings
2*n*(p1/2 –1)edge crossings
• Minimizing communication on mesh minimizing “surface to volume ratio” of partition
04/22/23 CS194 Lecture 13
Ghost Nodes• Overview of (possible!) Memory Hierarchy Optimization
• Normally done with 1-wide ghost• Can you imagine a reason for wider?
To compute green
Copy yellow
Compute blue
04/22/23 CS194 Lecture 14
Bulk Synchronous Coputation• With this decomposition, computation is mostly
independent• Need to synchronize between phases
• Picking up from last time: barrier synchronization
15CS194 Lecture04/22/23
Barriers
• Software algorithms implemented using locks, flags, counters
• Hardware barriers• Wired-AND line separate from address/data bus
• Set input high when arrive, wait for output to be high to leave
• In practice, multiple wires to allow reuse• Useful when barriers are global and very frequent• Difficult to support arbitrary subset of processors
• even harder with multiple processes per processor• Difficult to dynamically change number and identity of
participants• e.g. latter due to process migration
• Not common today on bus-based machines
16CS194 Lecture04/22/23
struct bar_type {int counter; struct lock_type lock; int flag = 0;} bar_name;
BARRIER (bar_name, p) {LOCK(bar_name.lock);if (bar_name.counter == 0) /* reset if first */
bar_name.flag = 0; /* to reach */mycount = ++bar_name.counter; /* mycount is private */UNLOCK(bar_name.lock);if (mycount == p) { /* last to arrive */
bar_name.counter = 0; /* reset for next barrier */
bar_name.flag = 1; /* release waiters */}else while (bar_name.flag == 0) {}; /* busy wait */
}
A Simple Centralized Barrier• Shared counter maintains number of processes that
have arrived: increment when arrive, check until =procs
What is the problem if barriers are done back-to-back?
17CS194 Lecture04/22/23
A Working Centralized Barrier• Consecutively entering the same barrier doesn’t work
• Must prevent process from entering until all have left previous instance• Could use another counter, but increases latency and contention
• Sense reversal: wait for flag to take different value consecutive times
• Toggle this value only when all processes reach
BARRIER (bar_name, p) {local_sense = !(local_sense); /* toggle private sense variable */
LOCK(bar_name.lock);mycount = ++bar_name.counter; /* mycount is private */if (bar_name.counter == p)
UNLOCK(bar_name.lock); bar_name.flag = local_sense; /* release waiters*/
else { UNLOCK(bar_name.lock);
while (bar_name.flag != local_sense) {}; }}
18CS194 Lecture04/22/23
Centralized Barrier Performance• Latency
• Centralized has critical path length at least proportional to p• Traffic
• About 3p bus transactions• Storage Cost
• Very low: centralized counter and flag• Fairness
• Same processor should not always be last to exit barrier• No such bias in centralized
• Key problems for centralized barrier are latency and traffic
• Especially with distributed memory, traffic goes to same node
19CS194 Lecture04/22/23
Improved Barrier Algorithms for a Bus
• Separate arrival and exit trees, and use sense reversal• Valuable in distributed network: communicate along different paths• On bus, all traffic goes on same bus, and no less total traffic• Higher latency (log p steps of work, and O(p) serialized bus xactions)• Advantage on bus is use of ordinary reads/writes instead of locks
Software combining tree•Only k processors access the same location, where k is degree of tree
Flat Tree structured
Contention Little contention
04/22/23 CS194 Lecture 20
Combining Tree Barrier
04/22/23 CS194 Lecture 21
Combining Tree Barrier
04/22/23 CS194 Lecture 22
Tree-Based Barrier Summary
04/22/23 CS194 Lecture 23
Tree-Based Barrier Cost
Latency
Traffic
Memory
04/22/23 CS194 Lecture 24
Dissemination Based Barrier
04/22/23 CS194 Lecture 25
Dissemination Barrier
04/22/23 CS194 Lecture 26
Latency
TrafficMemory
27CS194 Lecture04/22/23
Barrier Performance on SGI Challenge
• Centralized does quite well• Will discuss fancier barrier algorithms for distributed machines
• Helpful hardware support: piggybacking of reads misses on bus• Also for spinning on highly contended locks
Number of processors
Tim
e (
s)
123456780
5
10
15
20
25
30
35 Centralized Combining tree Tournament Dissemination
28CS194 Lecture04/22/23
Synchronization Summary
• Rich interaction of hardware-software tradeoffs• Must evaluate hardware primitives and software algorithms
together• primitives determine which algorithms perform well
• Evaluation methodology is challenging• Use of delays, microbenchmarks• Should use both microbenchmarks and real workloads
• Simple software algorithms with common hardware primitives do well on bus
• Will see more sophisticated techniques for distributed machines• Hardware support still subject of debate
• Theoretical research argues for swap or compare&swap, not fetch&op
• Algorithms that ensure constant-time access, but complex
04/22/23 CS194 Lecture 29
References• John Mellor-Crummey’s 422 Lecture Notes
04/22/23 CS194 Lecture 30
Tree-Based Computation
• The broadcast and reduction operations in MPI are a good example of tree-based algorithms
• For reductions: take n inputs and produce 1 output• For broadcast: take 1 input and produce n outputs• What can we say about such computations in general?
04/22/23 CS194 Lecture 31
A log n lower bound to compute any function of n variables
• Assume we can only use binary operations, one per time unit
• After 1 time unit, an output can only depend on two inputs
• Use induction to show that after k time units, an output can only depend on 2k inputs
• After log2 n time units, output depends on at most n inputs
• A binary tree performs such a computation
04/22/23 CS194 Lecture 32
Broadcasts and Reductions on Trees
04/22/23 CS194 Lecture 33
Parallel Prefix, or Scan• If “+” is an associative operator, and x[0],…,x[p-1] are input data
then parallel prefix operation computes
• Notation: j:k mean x[j]+x[j+1]+…+x[k], blue is final valuey[j] = x[0] + x[1] + … + x[j] for j=0,1,…,p-1
04/22/23 CS194 Lecture 34
Mapping Parallel Prefix onto a Tree - Details• Up-the-tree phase (from leaves to root)
1) Get values L and R from left and right children2) Save L in a local register Lsave3) Pass sum L+R to parent
By induction, Lsave = sum of all leaves in left subtree• Down the tree phase (from root to leaves)
1) Get value S from parent (the root gets 0)2) Send S to the left child3) Send S + Lsave to the right child
• By induction, S = sum of all leaves to left of subtree rooted at the parent
04/22/23 CS194 Lecture 35
E.g., Fibonacci via Matrix Multiply Prefix
Fn+1 = Fn + Fn-1
1-n
n
n
1n
FF
0111
FF
Can compute all Fn by matmul_prefix on
[ , , , , , , , , ]then select the upper left entry
0111
0111
0111
0111
0111
0111
0111
0111
0111
Derived from: Alan Edelman, MIT
• Consider computing of the Fibbonacci numbers:
• Each step can be viewed as a matrix multiplication:
04/22/23 CS194 Lecture 36
Adding two n-bit integers in O(log n) time• Let a = a[n-1]a[n-2]…a[0] and b = b[n-1]b[n-2]…b[0] be two n-bit
binary numbers• We want their sum s = a+b = s[n]s[n-1]…s[0]
• Challenge: compute all c[i] in O(log n) time via parallel prefix
• Used in all computers to implement addition - Carry look-ahead
c[-1] = 0 … rightmost carry bitfor i = 0 to n-1 c[i] = ( (a[i] xor b[i]) and c[i-1] ) or ( a[i] and b[i] ) ... next carry bit s[i] = ( a[i] xor b[i] ) xor c[i-1]
for all (0 <= i <= n-1) p[i] = a[i] xor b[i] … propagate bit for all (0 <= i <= n-1) g[i] = a[i] and b[i] … generate bit
c[i] = ( p[i] and c[i-1] ) or g[i] = p[i] g[i] * c[i-1] = C[i] * c[i-1] 1 1 0 1 1 1 … 2-by-2 Boolean matrix multiplication (associative) = C[i] * C[i-1] * … C[0] * 0 1 … evaluate each P[i] = C[i] * C[i-1] * … * C[0] by parallel prefix
04/22/23 CS194 Lecture 37
Other applications of scans
• There are several applications of scans, some more obvious than others
• add multi-precision numbers (represented as array of numbers)• evaluate recurrences, expressions • solve tridiagonal systems (numerically unstable!)• implement bucket sort and radix sort• to dynamically allocate processors• to search for regular expression (e.g., grep)
• Names: +\ (APL), cumsum (Matlab), MPI_SCAN• Note: 2n operations used when only n-1 needed
04/22/23 CS194 Lecture 38
Evaluating arbitrary expressions
• Let E be an arbitrary expression formed from +, -, *, /, parentheses, and n variables, where each appearance of each variable is counted separately
• Can think of E as arbitrary expression tree with n leaves (the variables) and internal nodes labeled by +, -, * and /
• Theorem (Brent): E can be evaluated in O(log n) time, if we reorganize it using laws of commutativity, associativity and distributivity
• Sketch of (modern) proof: evaluate expression tree E greedily by
• collapsing all leaves into their parents at each time step• evaluating all “chains” in E with parallel prefix
04/22/23 CS194 Lecture 39
Multiplying n-by-n matrices in O(log n) time• For all (1 <= i,j,k <= n) P(i,j,k) = A(i,k) * B(k,j)
• cost = 1 time unit, using n^3 processors
• For all (1 <= I,j <= n) C(i,j) = P(i,j,k)• cost = O(log n) time, using a tree with n^3 / 2 processors
k =1
n
04/22/23 CS194 Lecture 40
Evaluating recurrences• Let xi = fi(xi-1), fi a rational function, x0 given• How fast can we compute xn?• Theorem (Kung): Suppose degree(fi) = d for all i
• If d=1, xn can be evaluated in O(log n) using parallel prefix• If d>1, evaluating xn takes (n) time, i.e. no speedup is possible
• Sketch of proof when d=1
• Sketch of proof when d>1• degree(xi) as a function of x0 is di
• After k parallel steps, degree(anything) <= 2k
• Computing xi take (i) steps
xi = fi(xi-1) = (ai * xi-1 + bi )/( ci * xi-1 + di ) can be written as
xi = numi / deni = (ai * numi-1 + bi * deni-1)/(ci * numi-1 + di * deni-1) or
numi = ai bi * numi-1 = Mi * numi-1 = Mi * Mi-1 * … * M1* num0
demi ci di deni-1 deni-1 den0
Can use parallel prefix with 2-by-2 matrix multiplication
04/22/23 CS194 Lecture 41
Summary• Message passing programming
• Maps well to large-scale parallel hardware (clusters)• Most popular programming model for these machines• A few primitives are enough to get started
• send/receive or broadcast/reduce plus initialization• More subtle semantics to manage message buffers to avoid
copying and speed up communication
• Tree-based algorithms• Elegant model that is a key piece of data-parallel programming• Most common are broadcast/reduce• Parallel prefix (aka scan) has produces partial answers and can
be used for many surprising applications• Some of these or more theoretical than practical interest
04/22/23 CS194 Lecture 42
Extra Slides
04/22/23 CS194 Lecture 44
Inverting Dense n-by-n matrices in O(log n) time
• Lemma 1: Cayley-Hamilton Theorem• expression for A-1 via characteristic polynomial in A
• Lemma 2: Newton’s Identities• Triangular system of equations for coefficients of characteristic
polynomial, matrix entries = sk
• Lemma 3: sk = trace(Ak) = Ak [i,i] = [i (A)]k
• Csanky’s Algorithm (1976)
• Completely numerically unstable
2
i=1
n
i=1
n
1) Compute the powers A2, A3, …,An-1 by parallel prefix cost = O(log2 n)2) Compute the traces sk = trace(Ak) cost = O(log n)3) Solve Newton identities for coefficients of characteristic polynomial cost = O(log2 n)4) Evaluate A-1 using Cayley-Hamilton Theorem cost = O(log n)
04/22/23 CS194 Lecture 45
Summary of tree algorithms• Lots of problems can be done quickly - in theory - using
trees• Some algorithms are widely used
• broadcasts, reductions, parallel prefix• carry look ahead addition
• Some are of theoretical interest only• Csanky’s method for matrix inversion• Solving general tridiagonals (without pivoting)• Both numerically unstable• Csanky needs too many processors
• Embedded in various systems• CM-5 hardware control network• MPI, Split-C, Titanium, NESL, other languages