© 2009 matthew j. sottile, timothy g. mattson, and craig e rasmussen 1 introduction to concurrency...
Post on 17-Dec-2015
217 Views
Preview:
TRANSCRIPT
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1
Introduction to Concurrency in Programming Languages: Chapter 12: Recursive Algorithms
Matthew J. SottileMatthew J. Sottile
Timothy G. MattsonTimothy G. Mattson
Craig E RasmussenCraig E Rasmussen
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 2
Chapter 12 Objectives
• Review the concept of recursion as a general algorithm pattern.
• Demonstrate recursion to implement parallel algorithms.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 3
The Algorithm Design Patterns
Result Parallelism
GeometricDecomposition
GeometricDecomposition
Task Parallelism
Task Parallelism
Divide and Conquer
Divide and Conquer
Recursive Data
Recursive Data
Specialist Parallelism
PipelinePipeline Event Based Coordination
Event Based Coordination
Agenda Parallelism
Data Parallelism
Data Parallelism
Embarrassingly Parallel
Embarrassingly Parallel
Separable Dependencies
Separable Dependencies
Start with a basic concurrency decomposition• A problem decomposed into a set of tasks• A data decomposition aligned with the set of tasks … designed to minimize
interactions between tasks and make concurrent updates to data safe.• Dependencies and ordering constraints between groups of tasks.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 4
Supporting Patterns• Fork-join
– A computation begins as a single thread of control. Additional threads are created as needed (forked) to execute functions and then when complete terminate (join). The computation continues as a single thread until a later time when more threads might be useful.
• SPMD– Multiple copies of a single program are launched typically with their own view of the data. The
path through the program is determined in part base don a unique ID (a rank). This is by far the most commonly used pattern with message passing APIs such as MPI.
• Loop parallelism– Parallelism is expressed in terms of loops that execute concurrently.
• Master-worker– A process or thread (the master) sets up a task queue and manages other threads (the workers)
as they grab a task from the queue, carry out the computation, and then return for their next task. This continues until the master detects that a termination condition has been met, at which point the master ends the computation.
• SIMD– The computation is a single stream of instructions applied to the individual components of a
data structure (such as an array). • Functional parallelism
– Concurrency is expressed as a distinct set of functions that execute concurrently. This pattern may be used with an imperative semantics in which case the way the functions execute are defined in the source code (e.g., event based coordination). Alternatively, this pattern can be used with declarative semantics, such as within a functional language, where the functions are defined but how (or when) they execute is dictated by the interaction of the data with the language model.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 5
Outline
• Recursion concepts• Recursion and the divide and conquer pattern• Case study: sorting• Case Study: Sudoku
Recursion: general concepts
• Mathematically, a recursive function is defined in terms of the function itself. Defined in terms of:– A recursion relation– A base case
• Example: Factorial function
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 6
)1(*)(
1)1(
nfnnf
fBase case
Recursion Relation
Recursion: Computer science• Recursion plays a key role in how we program.• Recursive functions are defined in most modern programming
languages. Consider function invocation process
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 7
– Push the state of the caller (e.g. processor registers) onto a stack
– Push function arguments onto the stack (as values or references)
– Push return address to jump to when callee returns. Basic Stack discipline for a function
f() calling a function g()
• The use of the stack for function invocation defines a common discipline that can be used to support recursive functions.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 8
Call graphs• Recursive function invocation is managed by the stack discipline, but that is
not always a useful way to picture what is occurring during a computation.• A more useful model is the “call graph”
• A call graph is a directed acyclic graph that shows the function invocations in a computation.
• Function invocations are the vertices of the graph with edges showing caller/callee relationships
int fib(int n) { if (n == a || n == 0) return 1; else return fib(n-1)*fib(n-2)}
A call graph for invocation of fib(4)
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 9
Recursion and concurrency• Recursion with multiple threads
• A thread can be invoked for each recursive function call.
• Easy to implement as it leaves the details of managing concurrency to the OS
• Potentially high overhead … reduced scalability
A call graph for invocation of fib(4)
fib(4) fib(4) fib(4) fib(4)
fib(2) fib(2) fib(2)
fib(0) fib(1)
fib(4) fib(2) fib(0) fib(1)
Cactus stack … each box is a stack frame. Stack grows down with name of parent at the top.
• Recursion with Cactus Stack (Cilk)• A tree with child nodes containing
pointers to parent nodes.
• Cilk spawn generates ref to call stack with child frame pushed to the top.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 10
Outline
• Recursion concepts• Recursion and the divide and conquer pattern• Case study: sorting• Case Study: Sudoku
11
Divide and Conquer Pattern• Use when:
– A problem includes a method to divide into subproblems and a way to recombine solutions of subproblems into a global solution.
• Solution– Define a split operation– Continue to split the problem until subproblems are
small enough to solve directly.– Recombine solutions to subproblems to solve original
global problem.• Note:
– Computing may occur at each phase (split, leaves, recombine).
Source: Mattson and Keutzer, UCB CS294
12
Divide and conquer• Split the problem into smaller sub-problems. Continue until the sub-problems
can be solve directly.
3 Options: Do work as you split
into sub-problems. Do work only at the
leaves. Do work as you
recombine.
Source: Mattson and Keutzer, UCB CS294
13
FFT algorithm
FFT(0,1,2,3,…,15) = FFT(xxxx)
FFT(1,3,…,15) = FFT(xxx1)FFT(0,2,…,14) = FFT(xxx0)
FFT(xx10) FFT(xx01) FFT(xx11)FFT(xx00)
FFT(x100) FFT(x010) FFT(x110) FFT(x001) FFT(x101) FFT(x011) FFT(x111)FFT(x000)
FFT(0) FFT(8) FFT(4) FFT(12) FFT(2) FFT(10) FFT(6) FFT(14) FFT(1) FFT(9) FFT(5) FFT(13) FFT(3) FFT(11) FFT(7) FFT(15)
even odd
Divide and conquer for 0(N ln N) FFT algorithm
Binary tree of FFT terms from UCB CS267, 2007
14
Examples of Divide and conquer
• Backtracking– Depth first search to find optimum. – Find provably sub-optimal value, backtrack and try another choice.
• Dynamic programming– Decompose into independent subproblems … but they overlap
(same subproblems appear across the splitting tree). Reuse solved subproblems to reduce total work.
• Branch and Bound– a systematic enumeration of candidate solutions, where large
subsets of fruitless candidates are discarded en masse, by using upper and lower estimated bounds of the quantity being optimized
Source: Mattson and Keutzer, UCB CS294
15
Fork-Join for Divide and conquer
• The fork-join “supporting pattern” is ideal for divide and conquer.– Fork threads each of which is using the same function … this
creates a call graph of recursive function calls (as we showed earlier for the Fibonacci sequence).
– Join “from the bottom” as you unwind the stack.
• Cilk is an ideal programming language for the Fork-join patterns:– In many ways it acts like a high level framework for recursive
algorithms.
Source: Mattson and Keutzer, UCB CS294
16© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 16
Cilk in one slide:• Extends C to create a parallel language but maintains serial semantics.• A task oriented programming model perfect for recursive algorithms
(e.g. branch-and-bound) … shared memory machines only!• Solid theoretical foundation … can prove performance theorems.
cilk Marks a function as a “cilk” function that can be spawned
spawn Spawns a cilk function … only 2 to 5 times the cost of a regular function call
sync Wait until immediate children spawned functions return
• “Advanced” key wordsinlet Define a function to handle return values from a cilk task
cilk_fence A portable memory fence.
abort Terminate all currently existing spawned tasks
• Also Includes locks and a few other odds and ends.
Source: Mattson and Keutzer, UCB CS294
17© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 17
A simple Cilk example: Example
• Compute Fibonacci numbers ... recursively split the problem until its small enough to compute directly
int fib (int n) {if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); }}
C version
cilk int fib (int n) {if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}
Cilk version
Remove cilk key words and you produce the correct C program (the C elision)
Based on Charles E. Leiserson, multithreaded programming in Cilk, lecture 1, July 13, 2006
Cilk supports an incremental parallelism software methodology.Cilk supports an incremental parallelism software methodology.
Source: Mattson and Keutzer, UCB CS294
18© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 18
Recursion is at the heart of cilk
• Cilk makes it inexpensive to spawn new tasks.• Instead of loops, recursively generate lots of tasks.• Creates nested queues of tasks. A scheduler
intelligently uses work-stealing to keep all the cores busy as they work on these tasks.
With Cilk, the programmer worries about With Cilk, the programmer worries about expressing concurrency, not the details of expressing concurrency, not the details of how it is implementedhow it is implemented
With Cilk, the programmer worries about With Cilk, the programmer worries about expressing concurrency, not the details of expressing concurrency, not the details of how it is implementedhow it is implemented
Source: Mattson and Keutzer, UCB CS294
19© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 19
Common pattern for Cilk
• Start with a program with a loop.
• Convert to a recursive structure … splitting range in half until the remaining chunk is small enough to compute directly.
void vadd (real *A, real *B, int n){void vadd (real *A, real *B, int n){ int i; for(i=0; i<n; i++) A[i] += B[i]; int i; for(i=0; i<n; i++) A[i] += B[i];}}
void vadd (real *A, real *B, int n){ void vadd (real *A, real *B, int n){ if (n<MIN) { if (n<MIN) { int i; for(i=0; i<n; i++) A[i] += B[i]; int i; for(i=0; i<n; i++) A[i] += B[i]; } else { } else { vadd(A, B, n/2); vadd(A, B, n/2); vadd(A+n/2, B+n/2, n-n/2); vadd(A+n/2, B+n/2, n-n/2); } } } }• Add Cilk keywords
spawnspawnspawnspawn
syncsync;;
cilkcilk
Source: Mattson and Keutzer, UCB CS294
20
Recursive algorithms in OpenMP• OpenMP 3.0 added constructs to support recursive
algorithms• Consider the following example
– Count the incidence of a “key” in an array.• We will solve this two different ways using OpenMP:
– Geometric Decomposition with SPMD– Divide and conquer with fork-join (fine grained)
Source: Mattson and Keutzer, UCB CS294
21© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 21
Count keys: Main program#define N 131072int main(){ long a[N]; int i; long key = 42, nkey=0;
// fill the array and make sure it has a few instances of the key
for (i=0;i<N;i++) a[i] = random()%N; a[N%43] = key; a[N%73] = key; a[N%3] = key; // count key in a with geometric decomposition nkey = search_geom(N, a, key);
// count key in a with divide and conquer (aka: recursive splitting)
nkey = search_recur(N, a, key);
}
Source: Mattson and Keutzer, UCB CS294
This is included for completeness … it just shows how we call the different functions to count instances of a key.
22© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 22
Count keys with OpenMP
// geometric decomposition implemented with the SPMD patternlong search_geom(long Nlen, long *a, long key){ long count = 0;
#pragma omp parallel reduction(+:count) { int i, num_threads = omp_get_num_threads(); int ID = omp_get_thread_num(); int istart = ID * N/num_threads; int iend = (ID+1)*N/num_threads;
if(ID == (num_threads-1)) iend = N;
for (i=istart; i<iend; i++) if(a[i]==key) count++; } return count;}
• Design Patterns used:• Geometric
Decomposition• SPMD
Source: Mattson and Keutzer, UCB CS294
This is a common trick to handle the case when N is not evenly divided by the number of threads
23© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 23
Count keys: with OpenMPlong search_recur(long Nlen, long *a, long key){ long count = 0; #pragma omp parallel reduction(+:count) { #pragma omp single count = search(Nlen, a, key); } return count;}
• Design Patterns used:• Divide and conquer • Fork-Join
long search(long Nlen, long *a, long key){ long count1=0, count2=0, Nl2; if (Nlen == 2){ if (*(a) == key) count1=1; if (*(a+1) == key) count2=1; return count1+count2; } else { Nl2 = Nlen/2; #pragma omp task shared(count1) count1 = search(Nl2, a, key); #pragma omp task shared(count2) count2 = search(Nl2, a+Nl2, key);
#pragma omp taskwait
return count1+count2; }}
Source: Mattson and Keutzer, UCB CS294
24© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 24
Count keys: Random number generator
// A simple random number generator static unsigned long long MULTIPLIER = 764261123;static unsigned long long PMOD = 2147483647;unsigned long long random_last = 42;
long random(){ unsigned long long random_next;
// // compute an integer random number from zero to pmod// random_next = (unsigned long long)(( MULTIPLIER * random_last)% PMOD); random_last = random_next; return (long) random_next;
}
I include this for completeness … it has nothing to do with any parallelism
Source: Mattson and Keutzer, UCB CS294
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 25
Outline
• Recursion concepts• Recursion and the divide and conquer pattern• Case study: sorting• Case Study: Sudoku
26
Merge Sort
• Sorting: An important class of algorithms that take an input list and generate a sorted list.\
• Merge sort– Split a list in two– Sort each half by a call to merge sort– Continue until you his a trivial base case.– Unwind the recursive stack to generate final list
• Example– Starting point: [3 6 4 1 5 7 3 2]– Split in two: [3 6 4 1] [5 7 3 2]– Split in two: [3 6] [4 1] [5 7] [3 2]– Base case: [3 6] [1 4] [5 7] [2 3]– Sort on merge: [1 3 4 6] [2 3 5 7]– Sort on merge: [1 2 3 3 4 5 6 7]
Source: Mattson and Keutzer, UCB CS294
• Fill stack of recursive calls to merge sort, after base case (n<2) unwind the stack to generate sorted list)
27
Serial Merge Sort:
Source: Mattson and Keutzer, UCB CS294
void mergesort(int * X, int n, int * tmp){ if (n < 2) return;
/* recursively sort each half of list */ mergesort(X, n/2, tmp); mergesort(X+(n/2), n-(n/2), tmp);
/* merge sorted halves into sorted list */ merge(X, n, tmp);}
Note: we include the merge function in a later slide … it’s the same for both the serial and parallel cases.
tmp points to space equal in size to X and is used as a buffer to sort into
• Each mergesort is independent so parallel version is trivial to create.
28
Cilk parallel Merge Sort:
Source: Mattson and Keutzer, UCB CS294
cilk void mergesort(int * X, int n, int * tmp){ if (n < 2) return;
/* recursively sort each half of list */ spawn mergesort(X, n/2, tmp); spawn mergesort(X+(n/2), n-(n/2), tmp); sync;
/* merge sorted halves into sorted list */ merge(X, n, tmp);}
Parallel program created by inserting 4 cilk keywords
• OpenMP 3.0 tasks let you write the same algorithm as with Cilk but …• OpenMP exposes threads … must call inside a parallel region.
• OpenMP has a more flexible data model … must explicitly define how to scope data in the tasks.
29
OpenMP parallel Merge Sort:
Source: Mattson and Keutzer, UCB CS294
void mergesort(int * X, int n, int * tmp){ if (n < 2) return;
#pragma omp task firstprivate (X, n, tmp) mergesort(X, n/2, tmp);
#pragma omp task firstprivate (X, n, tmp) mergesort(X+(n/2), n-(n/2), tmp); #pragma omp taskwait;
/* merge sorted halves into sorted list */ merge(X, n, tmp);}
30
Main program for OpenMP merge sort
Source: Mattson and Keutzer, UCB CS294
• The only way to create threads in OpenMP is with a parallel construct.• Hence our parallel merge sort must occur within a parallel region.
#include “omp.h”#define MAX_SIZE 1000Int main(){ int n = 100; int data[MAX_SIZE], tmp[MAX_SIZE]; generate_list(data, n) #pragma omp parallel { #pragma omp single mergesort(data, n, tmp); }}
Create a team of threads using the “default number” of threads
The single construct causes only one member of the team to call the first mergesort
31
Background for Merge Sort: The merge routine
Source: Mattson and Keutzer, UCB CS294
#include <string.h>#include <stdlib.h>void merge(int * X, int n, int * tmp) { int i = 0; int j = n/2; int ti = 0;
while (i<n/2 && j<n) { if (X[i] < X[j]) { tmp[ti] = X[i]; ti++; i++; } else { tmp[ti] = X[j]; ti++; j++; } }
while (i<n/2) { /* finish up lower half */ tmp[ti] = X[i]; ti++; i++; } while (j<n) { /* finish up upper half */ tmp[ti] = X[j]; ti++; j++; } memcpy(X, tmp, n*sizeof(int));
} // end of merge()
• This is the merge function used by both the serial and parallel versions of the program
32
Background for merge sort: Generate_list
Source: Mattson and Keutzer, UCB CS294
• A function to generate a list of integers (included for completeness … this has nothing to do with sorting or parallelism)
void generate_list(int * x, int n) { int i; srand(10000); for (i = 0; i < n; i++) { int val = n * ((double) rand() / ((double) RAND_MAX + (double) 1)); x[i] = val; } }
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 33
Outline
• Recursion concepts• Recursion and the divide and conquer pattern• Case study: sorting• Case Study: Sudoku
34
Sudoko
• A game where you fill in a grid with numbers– A number cannot appear more than once in any column– A number cannot appear more than once in any row– A number can not appear more than once in any “region”
• Typically presented with a 9 by 9 grid … but for simplicity we’ll consider a 4 by 4 grid
Source: Mattson and Keutzer, UCB CS294
A 4 x 4 Sudoku puzzle with 11 open positions … we show three steps in the solution
1
23
Since 1 is the only number missing in this column
Since 3 already appears in this region
Since 3 is the only number missing in this row
35
Sudoko Algorithm
• The two-dimensional Sudoko grid is flattened into a vector– Unsolved locations are filled with zeros– The first two rows of the initial 4 x 4 puzzle are shown– The current working location [loc=0] is shown in red and the subgrid size is 3– Initially call spawn solve(size=3, grid, loc=0)
Source: Mattson and Keutzer, UCB CS294
3 0 0 4 0 0 0 2 …
• The first location has a solution so move to next location– Recursively call spawn solve(size=3, grid, loc=loc+1)
grid
3 0 0 4 0 0 0 2 …
36
Exhaustive Search
• The next location [loc=1] has no solution (‘0’ in the current cell) so …– Create 4 new grids and try each of the 4 possibilities (1,2,3,4) concurrently– Note: the search goes much faster if the guess is first tested to see if it is legal– Spawn a new search tree for each guess k– Call: spawn solve(size=3, grid[k], loc=loc+1)
Source: Mattson and Keutzer, UCB CS294
3 1 0 4 0 0 0 2 …
new grids
3 2 0 4 0 0 0 2 …
3 3 0 4 0 0 0 2 …
3 4 0 4 0 0 0 2 …
Illegal since 3 and 4 are already in the same row
37
Cilk Sudoko solution (part 1 of 3)
Source: Mattson and Keutzer, UCB CS294
cilk int solve(int size, int* grid, int loc){ int i, k, solved, solution[MAX_NUM]; int* grid[MAX_NUM]; int numNumbers = size*size: int Girdlen = numNumbers*numNumbers;
if (loc == Gridlen) { /* maximum depth; reached the end of the puzzle */ return check_solution(size, grid); }
/* if this node has a solution (given by puzzle) at this location */ /* move to next node location */ if (grid[loc] != 0) { solved = spawn solve(size, g, loc+1); return solved; }
38
Cilk Sudoko solution (part 2 of 3)
Source: Mattson and Keutzer, UCB CS294
/* try each number (unique to row,col,sq) */ numGrids = 0; for (i = 0, k = 0; i < MAX_NUM; i++) { k = next_guess(size, k, loc, grid); if (k == 0) break; /* no more legal solutions at t his location */
/* need new grid to work with */ myGrid[i] = new_grid(size, grid); myGrid[i][loc] = k; solution[i] = spawn solve(size, myGrid[i], loc+1); nGrids += 1; }
sync;
39
Cilk Sudoko solution (part 3 of 3)
Source: Mattson and Keutzer, UCB CS294
/* check to see if there is a solution */
solved = 0; for (i = 0; i < nGrids; i++) { if (solution[i] == 1) { int n; /* found a solution, copy result to parent */ for (n = loc; n < len; n++) { grid[n] = (myGrid[i])[n]; } solved = 1; } free(myGrid[i]); }
return solved;}
top related