decomposition, locality and (finishing synchronization lecture) barriers

04/22/23 CS194 Lecture 1

Decomposition, Locality and (finishing synchronization

lecture) Barriers

Kathy [email protected]

www.cs.berkeley.edu/~yelick/cs194f07

04/22/23 CS194 Lecture 2

Lecture Schedule for Next Two Weeks• Fri 9/14 11-12:00 • Mon 9/17 10:30-11:30 Discussion (11:30-12 as needed)• Wed 9/19 10:30-12:00 Decomposition and Locality• Fri 9/21 10:30-12:00 NVIDIA Lecture• Mon 9/24 10:30-12:00 NVIDIA lecture (David Kirk)• Tue 9/25 3-:4:30 NVIDIA research talk in Woz (optional)• Wed 9/26 10:30-11:30 Discussion• Fri 9/27 11-12 Lecture topic TBD

04/22/23 CS194 Lecture 3

Outline

• Task Decomposition• Review parallel decomposition and task graphs• Styles of decomposition• Extending task graphs with interaction information

• Example: Sharks and Fish (Wator)• Parallel vs. sequential versions

• Data decomposition• Partitioning rectangular grids (like matrix multiply)• Ghost regions (unlike matrix multiply)

• Bulk-synchronous programming• Barrier synchronization

04/22/23 CS194 Lecture 4

Designing Parallel Algorithms

• Parallel software design starts with decomposition• Decomposition Techniques

• Recursive Decomposition • Data Decomposition: Input, Output, or Intermediate Data• And others

• Characteristics of Tasks and Interactions • Task Generation, Granularity, and Context • Characteristics of Task Interactions.

04/22/23 CS194 Lecture 5

Recursive Decomposition: Example • Consider parallel Quicksort.

Once the array is partitioned, each subarry can be processed i parallel

Source: Ananth Grama

04/22/23 CS194 Lecture 6

Data Decomposition

• Identify the data on which computations are performed.

• Partition this data across various tasks. • This partitioning induces a decomposition of the

of the computation, often using the following rule:

• Owner Computes Rule: the thread assigned to a particular data item is responsible for all computation associated with it.

• The owner computes rule is especially common on output data. Why?


04/22/23 CS194 Lecture 7

Input Data Decomposition• Recall: applying a function square to the

elements of array A, then computing its sum:

• Conceptualize a decomposition as a task dependency graph:

• A directed graph with • Nodes corresponding to tasks • Edges indicating dependencies, that the result of

one task is required for processing the next.

sqr (A[0])sqr(A[1])sqr(A[2]) sqr(A[n])

sum

…

04/22/23 CS194 Lecture 8

Output Data Decomposition: Example

Consider matrix multiply:

Task 1:

Task 2:

Task 3:

Task 4:


04/22/23 CS194 Lecture 9

A Model Problem: Sharks and Fish

• Illustration of parallel programming• Original version (discrete event only) proposed by

Geoffrey Fox• Called WATOR

• Basic idea: sharks and fish living in an ocean• Rules for breeding, eating, and death• Could also add: forces in the ocean, between

creatures, etc.• Ocean is toroidal (2D donut)

04/22/23 CS194 Lecture 10

Serial vs. Parallel• Updating left-to-right row-wise order, we get a serial

algorithm• Cannot be parallelized, because of dependencies, so

instead we use a “red-black” orderforall black points grid(i,j) update …forall red points grid(i,j) update …

° For 3D or general graph, use graph coloring ° Can use repeated Maximal Independent Sets to color° Graph(T) is bipartite => 2 colorable (red and black)° Nodes for each color can be updated simultaneously

° Can also use two copies (old and new), but in both cases Note we have changed behavior of original algorithm !

04/22/23 CS194 Lecture 11

Parallelism in Wator• The simulation is synchronous

• use two copies of the grid (old and new).• the value of each new grid cell depends only on 9 cells (itself plus 8

neighbors) in old grid.• simulation proceeds in timesteps-- each cell is updated at every step.

• Easy to parallelize by dividing physical domain: Domain Decomposition

• Locality is achieved by using large patches of the ocean• Only boundary values from neighboring patches are needed.

• How to pick shapes of domains?

P4

P1 P2 P3

P5 P6

P7 P8 P9

Repeat compute locally to update local system barrier() exchange state info with neighborsuntil done simulating

04/22/23 CS194 Lecture 12

Regular Meshes (eg Game of Life)

• Suppose graph is nxn mesh with connection NSEW neighbors• Which has less communication (less potential false sharing)?

n*(p-1)edge crossings

2*n*(p1/2 –1)edge crossings

• Minimizing communication on mesh minimizing “surface to volume ratio” of partition

04/22/23 CS194 Lecture 13

Ghost Nodes• Overview of (possible!) Memory Hierarchy Optimization

• Normally done with 1-wide ghost• Can you imagine a reason for wider?

To compute green

Copy yellow

Compute blue

04/22/23 CS194 Lecture 14

Bulk Synchronous Coputation• With this decomposition, computation is mostly

independent• Need to synchronize between phases

• Picking up from last time: barrier synchronization

15CS194 Lecture04/22/23

Barriers

• Software algorithms implemented using locks, flags, counters

• Hardware barriers• Wired-AND line separate from address/data bus

• Set input high when arrive, wait for output to be high to leave

• In practice, multiple wires to allow reuse• Useful when barriers are global and very frequent• Difficult to support arbitrary subset of processors

• even harder with multiple processes per processor• Difficult to dynamically change number and identity of

participants• e.g. latter due to process migration

• Not common today on bus-based machines

16CS194 Lecture04/22/23

struct bar_type {int counter; struct lock_type lock; int flag = 0;} bar_name;

BARRIER (bar_name, p) {LOCK(bar_name.lock);if (bar_name.counter == 0) /* reset if first */

bar_name.flag = 0; /* to reach */mycount = ++bar_name.counter; /* mycount is private */UNLOCK(bar_name.lock);if (mycount == p) { /* last to arrive */

bar_name.counter = 0; /* reset for next barrier */

bar_name.flag = 1; /* release waiters */}else while (bar_name.flag == 0) {}; /* busy wait */

}

A Simple Centralized Barrier• Shared counter maintains number of processes that

have arrived: increment when arrive, check until =procs

What is the problem if barriers are done back-to-back?

17CS194 Lecture04/22/23

A Working Centralized Barrier• Consecutively entering the same barrier doesn’t work

• Must prevent process from entering until all have left previous instance• Could use another counter, but increases latency and contention

• Sense reversal: wait for flag to take different value consecutive times

• Toggle this value only when all processes reach

BARRIER (bar_name, p) {local_sense = !(local_sense); /* toggle private sense variable */

LOCK(bar_name.lock);mycount = ++bar_name.counter; /* mycount is private */if (bar_name.counter == p)

UNLOCK(bar_name.lock); bar_name.flag = local_sense; /* release waiters*/

else { UNLOCK(bar_name.lock);

while (bar_name.flag != local_sense) {}; }}

18CS194 Lecture04/22/23

Centralized Barrier Performance• Latency

• Centralized has critical path length at least proportional to p• Traffic

• About 3p bus transactions• Storage Cost

• Very low: centralized counter and flag• Fairness

• Same processor should not always be last to exit barrier• No such bias in centralized

• Key problems for centralized barrier are latency and traffic

• Especially with distributed memory, traffic goes to same node

19CS194 Lecture04/22/23

Improved Barrier Algorithms for a Bus

• Separate arrival and exit trees, and use sense reversal• Valuable in distributed network: communicate along different paths• On bus, all traffic goes on same bus, and no less total traffic• Higher latency (log p steps of work, and O(p) serialized bus xactions)• Advantage on bus is use of ordinary reads/writes instead of locks

Software combining tree•Only k processors access the same location, where k is degree of tree

Flat Tree structured

Contention Little contention

04/22/23 CS194 Lecture 20

Combining Tree Barrier

04/22/23 CS194 Lecture 21

Combining Tree Barrier

04/22/23 CS194 Lecture 22

Tree-Based Barrier Summary

04/22/23 CS194 Lecture 23

Tree-Based Barrier Cost

Latency

Traffic

Memory

04/22/23 CS194 Lecture 24

Dissemination Based Barrier

04/22/23 CS194 Lecture 25

Dissemination Barrier

04/22/23 CS194 Lecture 26

Latency

TrafficMemory

27CS194 Lecture04/22/23

Barrier Performance on SGI Challenge

• Centralized does quite well• Will discuss fancier barrier algorithms for distributed machines

• Helpful hardware support: piggybacking of reads misses on bus• Also for spinning on highly contended locks

Number of processors

Tim

e (

s)

123456780

5

10

15

20

25

30

35 Centralized Combining tree Tournament Dissemination

28CS194 Lecture04/22/23

Synchronization Summary

• Rich interaction of hardware-software tradeoffs• Must evaluate hardware primitives and software algorithms

together• primitives determine which algorithms perform well

• Evaluation methodology is challenging• Use of delays, microbenchmarks• Should use both microbenchmarks and real workloads

• Simple software algorithms with common hardware primitives do well on bus

• Will see more sophisticated techniques for distributed machines• Hardware support still subject of debate

• Theoretical research argues for swap or compare&swap, not fetch&op

• Algorithms that ensure constant-time access, but complex

04/22/23 CS194 Lecture 29

References• John Mellor-Crummey’s 422 Lecture Notes

04/22/23 CS194 Lecture 30

Tree-Based Computation

• The broadcast and reduction operations in MPI are a good example of tree-based algorithms

• For reductions: take n inputs and produce 1 output• For broadcast: take 1 input and produce n outputs• What can we say about such computations in general?

04/22/23 CS194 Lecture 31

A log n lower bound to compute any function of n variables

• Assume we can only use binary operations, one per time unit

• After 1 time unit, an output can only depend on two inputs

• Use induction to show that after k time units, an output can only depend on 2k inputs

• After log2 n time units, output depends on at most n inputs

• A binary tree performs such a computation

04/22/23 CS194 Lecture 32

Broadcasts and Reductions on Trees

04/22/23 CS194 Lecture 33

Parallel Prefix, or Scan• If “+” is an associative operator, and x[0],…,x[p-1] are input data

then parallel prefix operation computes

• Notation: j:k mean x[j]+x[j+1]+…+x[k], blue is final valuey[j] = x[0] + x[1] + … + x[j] for j=0,1,…,p-1

04/22/23 CS194 Lecture 34

Mapping Parallel Prefix onto a Tree - Details• Up-the-tree phase (from leaves to root)

1) Get values L and R from left and right children2) Save L in a local register Lsave3) Pass sum L+R to parent

By induction, Lsave = sum of all leaves in left subtree• Down the tree phase (from root to leaves)

1) Get value S from parent (the root gets 0)2) Send S to the left child3) Send S + Lsave to the right child

• By induction, S = sum of all leaves to left of subtree rooted at the parent

04/22/23 CS194 Lecture 35

E.g., Fibonacci via Matrix Multiply Prefix

Fn+1 = Fn + Fn-1

1-n

n

n

1n

FF

0111

FF

Can compute all Fn by matmul_prefix on

[ , , , , , , , , ]then select the upper left entry

0111

0111

0111

0111

0111

0111

0111

0111

0111

Derived from: Alan Edelman, MIT

• Consider computing of the Fibbonacci numbers:

• Each step can be viewed as a matrix multiplication:

04/22/23 CS194 Lecture 36

Adding two n-bit integers in O(log n) time• Let a = a[n-1]a[n-2]…a[0] and b = b[n-1]b[n-2]…b[0] be two n-bit

binary numbers• We want their sum s = a+b = s[n]s[n-1]…s[0]

• Challenge: compute all c[i] in O(log n) time via parallel prefix

• Used in all computers to implement addition - Carry look-ahead

c[-1] = 0 … rightmost carry bitfor i = 0 to n-1 c[i] = ( (a[i] xor b[i]) and c[i-1] ) or ( a[i] and b[i] ) ... next carry bit s[i] = ( a[i] xor b[i] ) xor c[i-1]

for all (0 <= i <= n-1) p[i] = a[i] xor b[i] … propagate bit for all (0 <= i <= n-1) g[i] = a[i] and b[i] … generate bit

c[i] = ( p[i] and c[i-1] ) or g[i] = p[i] g[i] * c[i-1] = C[i] * c[i-1] 1 1 0 1 1 1 … 2-by-2 Boolean matrix multiplication (associative) = C[i] * C[i-1] * … C[0] * 0 1 … evaluate each P[i] = C[i] * C[i-1] * … * C[0] by parallel prefix

04/22/23 CS194 Lecture 37

Other applications of scans

• There are several applications of scans, some more obvious than others

• add multi-precision numbers (represented as array of numbers)• evaluate recurrences, expressions • solve tridiagonal systems (numerically unstable!)• implement bucket sort and radix sort• to dynamically allocate processors• to search for regular expression (e.g., grep)

• Names: +\ (APL), cumsum (Matlab), MPI_SCAN• Note: 2n operations used when only n-1 needed

04/22/23 CS194 Lecture 38

Evaluating arbitrary expressions

• Let E be an arbitrary expression formed from +, -, *, /, parentheses, and n variables, where each appearance of each variable is counted separately

• Can think of E as arbitrary expression tree with n leaves (the variables) and internal nodes labeled by +, -, * and /

• Theorem (Brent): E can be evaluated in O(log n) time, if we reorganize it using laws of commutativity, associativity and distributivity

• Sketch of (modern) proof: evaluate expression tree E greedily by

• collapsing all leaves into their parents at each time step• evaluating all “chains” in E with parallel prefix

04/22/23 CS194 Lecture 39

Multiplying n-by-n matrices in O(log n) time• For all (1 <= i,j,k <= n) P(i,j,k) = A(i,k) * B(k,j)

• cost = 1 time unit, using n^3 processors

• For all (1 <= I,j <= n) C(i,j) = P(i,j,k)• cost = O(log n) time, using a tree with n^3 / 2 processors

k =1

n

04/22/23 CS194 Lecture 40

Evaluating recurrences• Let xi = fi(xi-1), fi a rational function, x0 given• How fast can we compute xn?• Theorem (Kung): Suppose degree(fi) = d for all i

• If d=1, xn can be evaluated in O(log n) using parallel prefix• If d>1, evaluating xn takes (n) time, i.e. no speedup is possible

• Sketch of proof when d=1

• Sketch of proof when d>1• degree(xi) as a function of x0 is di

• After k parallel steps, degree(anything) <= 2k

• Computing xi take (i) steps

xi = fi(xi-1) = (ai * xi-1 + bi )/( ci * xi-1 + di ) can be written as

xi = numi / deni = (ai * numi-1 + bi * deni-1)/(ci * numi-1 + di * deni-1) or

numi = ai bi * numi-1 = Mi * numi-1 = Mi * Mi-1 * … * M1* num0

demi ci di deni-1 deni-1 den0

Can use parallel prefix with 2-by-2 matrix multiplication

04/22/23 CS194 Lecture 41

Summary• Message passing programming

• Maps well to large-scale parallel hardware (clusters)• Most popular programming model for these machines• A few primitives are enough to get started

• send/receive or broadcast/reduce plus initialization• More subtle semantics to manage message buffers to avoid

copying and speed up communication

• Tree-based algorithms• Elegant model that is a key piece of data-parallel programming• Most common are broadcast/reduce• Parallel prefix (aka scan) has produces partial answers and can

be used for many surprising applications• Some of these or more theoretical than practical interest

04/22/23 CS194 Lecture 42

Extra Slides

04/22/23 CS194 Lecture 44

Inverting Dense n-by-n matrices in O(log n) time

• Lemma 1: Cayley-Hamilton Theorem• expression for A-1 via characteristic polynomial in A

• Lemma 2: Newton’s Identities• Triangular system of equations for coefficients of characteristic

polynomial, matrix entries = sk

• Lemma 3: sk = trace(Ak) = Ak [i,i] = [i (A)]k

• Csanky’s Algorithm (1976)

• Completely numerically unstable

2

i=1

n

i=1

n

1) Compute the powers A2, A3, …,An-1 by parallel prefix cost = O(log2 n)2) Compute the traces sk = trace(Ak) cost = O(log n)3) Solve Newton identities for coefficients of characteristic polynomial cost = O(log2 n)4) Evaluate A-1 using Cayley-Hamilton Theorem cost = O(log n)

04/22/23 CS194 Lecture 45

Summary of tree algorithms• Lots of problems can be done quickly - in theory - using

trees• Some algorithms are widely used

• broadcasts, reductions, parallel prefix• carry look ahead addition

• Some are of theoretical interest only• Csanky’s method for matrix inversion• Solving general tridiagonals (without pivoting)• Both numerically unstable• Csanky needs too many processors

• Embedded in various systems• CM-5 hardware control network• MPI, Split-C, Titanium, NESL, other languages

decomposition, locality and (finishing synchronization lecture) barriers

Documents

task dependency graph

interactions task generation

particular data item

parallel quicksort

graph coloring

general graph

directed graph

black nodes