© 2009 matthew j. sottile, timothy g. mattson, and craig e rasmussen 1 introduction to concurrency...

Introduction to Concurrency in Programming Languages: Chapter 12: Recursive Algorithms

Matthew J. SottileMatthew J. Sottile

Timothy G. MattsonTimothy G. Mattson

Craig E RasmussenCraig E Rasmussen

Chapter 12 Objectives

• Review the concept of recursion as a general algorithm pattern.

• Demonstrate recursion to implement parallel algorithms.

The Algorithm Design Patterns

Result Parallelism

GeometricDecomposition

Task Parallelism

Divide and Conquer

Recursive Data

Specialist Parallelism

PipelinePipeline Event Based Coordination

Event Based Coordination

Agenda Parallelism

Data Parallelism

Embarrassingly Parallel

Separable Dependencies

Start with a basic concurrency decomposition• A problem decomposed into a set of tasks• A data decomposition aligned with the set of tasks … designed to minimize

interactions between tasks and make concurrent updates to data safe.• Dependencies and ordering constraints between groups of tasks.

Supporting Patterns• Fork-join

– A computation begins as a single thread of control. Additional threads are created as needed (forked) to execute functions and then when complete terminate (join). The computation continues as a single thread until a later time when more threads might be useful.

• SPMD– Multiple copies of a single program are launched typically with their own view of the data. The

path through the program is determined in part base don a unique ID (a rank). This is by far the most commonly used pattern with message passing APIs such as MPI.

• Loop parallelism– Parallelism is expressed in terms of loops that execute concurrently.

• Master-worker– A process or thread (the master) sets up a task queue and manages other threads (the workers)

as they grab a task from the queue, carry out the computation, and then return for their next task. This continues until the master detects that a termination condition has been met, at which point the master ends the computation.

• SIMD– The computation is a single stream of instructions applied to the individual components of a

data structure (such as an array). • Functional parallelism

– Concurrency is expressed as a distinct set of functions that execute concurrently. This pattern may be used with an imperative semantics in which case the way the functions execute are defined in the source code (e.g., event based coordination). Alternatively, this pattern can be used with declarative semantics, such as within a functional language, where the functions are defined but how (or when) they execute is dictated by the interaction of the data with the language model.

Outline

• Recursion concepts• Recursion and the divide and conquer pattern• Case study: sorting• Case Study: Sudoku

Recursion: general concepts

• Mathematically, a recursive function is defined in terms of the function itself. Defined in terms of:– A recursion relation– A base case

• Example: Factorial function

)1(*)(

fBase case

Recursion Relation

Recursion: Computer science• Recursion plays a key role in how we program.• Recursive functions are defined in most modern programming

languages. Consider function invocation process

– Push the state of the caller (e.g. processor registers) onto a stack

– Push function arguments onto the stack (as values or references)

– Push return address to jump to when callee returns. Basic Stack discipline for a function

f() calling a function g()

• The use of the stack for function invocation defines a common discipline that can be used to support recursive functions.

Call graphs• Recursive function invocation is managed by the stack discipline, but that is

not always a useful way to picture what is occurring during a computation.• A more useful model is the “call graph”

• A call graph is a directed acyclic graph that shows the function invocations in a computation.

• Function invocations are the vertices of the graph with edges showing caller/callee relationships

int fib(int n) { if (n == a || n == 0) return 1; else return fib(n-1)*fib(n-2)}

A call graph for invocation of fib(4)

Recursion and concurrency• Recursion with multiple threads

• A thread can be invoked for each recursive function call.

• Easy to implement as it leaves the details of managing concurrency to the OS

• Potentially high overhead … reduced scalability

A call graph for invocation of fib(4)

fib(4) fib(4) fib(4) fib(4)

fib(2) fib(2) fib(2)

fib(0) fib(1)

fib(4) fib(2) fib(0) fib(1)

Cactus stack … each box is a stack frame. Stack grows down with name of parent at the top.

• Recursion with Cactus Stack (Cilk)• A tree with child nodes containing

pointers to parent nodes.

• Cilk spawn generates ref to call stack with child frame pushed to the top.

Outline

Divide and Conquer Pattern• Use when:

– A problem includes a method to divide into subproblems and a way to recombine solutions of subproblems into a global solution.

• Solution– Define a split operation– Continue to split the problem until subproblems are

small enough to solve directly.– Recombine solutions to subproblems to solve original

global problem.• Note:

– Computing may occur at each phase (split, leaves, recombine).

Source: Mattson and Keutzer, UCB CS294

Divide and conquer• Split the problem into smaller sub-problems. Continue until the sub-problems

can be solve directly.

3 Options: Do work as you split

into sub-problems. Do work only at the

leaves. Do work as you

recombine.

FFT algorithm

FFT(0,1,2,3,…,15) = FFT(xxxx)

FFT(1,3,…,15) = FFT(xxx1)FFT(0,2,…,14) = FFT(xxx0)

FFT(xx10) FFT(xx01) FFT(xx11)FFT(xx00)

FFT(x100) FFT(x010) FFT(x110) FFT(x001) FFT(x101) FFT(x011) FFT(x111)FFT(x000)

FFT(0) FFT(8) FFT(4) FFT(12) FFT(2) FFT(10) FFT(6) FFT(14) FFT(1) FFT(9) FFT(5) FFT(13) FFT(3) FFT(11) FFT(7) FFT(15)

even odd

Divide and conquer for 0(N ln N) FFT algorithm

Binary tree of FFT terms from UCB CS267, 2007

Examples of Divide and conquer

• Backtracking– Depth first search to find optimum. – Find provably sub-optimal value, backtrack and try another choice.

• Dynamic programming– Decompose into independent subproblems … but they overlap

(same subproblems appear across the splitting tree). Reuse solved subproblems to reduce total work.

• Branch and Bound– a systematic enumeration of candidate solutions, where large

subsets of fruitless candidates are discarded en masse, by using upper and lower estimated bounds of the quantity being optimized

Fork-Join for Divide and conquer

• The fork-join “supporting pattern” is ideal for divide and conquer.– Fork threads each of which is using the same function … this

creates a call graph of recursive function calls (as we showed earlier for the Fibonacci sequence).

– Join “from the bottom” as you unwind the stack.

• Cilk is an ideal programming language for the Fork-join patterns:– In many ways it acts like a high level framework for recursive

algorithms.

Cilk in one slide:• Extends C to create a parallel language but maintains serial semantics.• A task oriented programming model perfect for recursive algorithms

(e.g. branch-and-bound) … shared memory machines only!• Solid theoretical foundation … can prove performance theorems.

cilk Marks a function as a “cilk” function that can be spawned

spawn Spawns a cilk function … only 2 to 5 times the cost of a regular function call

sync Wait until immediate children spawned functions return

• “Advanced” key wordsinlet Define a function to handle return values from a cilk task

cilk_fence A portable memory fence.

abort Terminate all currently existing spawned tasks

• Also Includes locks and a few other odds and ends.

A simple Cilk example: Example

• Compute Fibonacci numbers ... recursively split the problem until its small enough to compute directly

int fib (int n) {if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); }}

C version

cilk int fib (int n) {if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}

Cilk version

Remove cilk key words and you produce the correct C program (the C elision)

Based on Charles E. Leiserson, multithreaded programming in Cilk, lecture 1, July 13, 2006

Cilk supports an incremental parallelism software methodology.Cilk supports an incremental parallelism software methodology.

Recursion is at the heart of cilk

• Cilk makes it inexpensive to spawn new tasks.• Instead of loops, recursively generate lots of tasks.• Creates nested queues of tasks. A scheduler

intelligently uses work-stealing to keep all the cores busy as they work on these tasks.

With Cilk, the programmer worries about With Cilk, the programmer worries about expressing concurrency, not the details of expressing concurrency, not the details of how it is implementedhow it is implemented

Common pattern for Cilk

• Start with a program with a loop.

• Convert to a recursive structure … splitting range in half until the remaining chunk is small enough to compute directly.

void vadd (real *A, real *B, int n){void vadd (real *A, real *B, int n){ int i; for(i=0; i<n; i++) A[i] += B[i]; int i; for(i=0; i<n; i++) A[i] += B[i];}}

void vadd (real *A, real *B, int n){ void vadd (real *A, real *B, int n){ if (n<MIN) { if (n<MIN) { int i; for(i=0; i<n; i++) A[i] += B[i]; int i; for(i=0; i<n; i++) A[i] += B[i]; } else { } else { vadd(A, B, n/2); vadd(A, B, n/2); vadd(A+n/2, B+n/2, n-n/2); vadd(A+n/2, B+n/2, n-n/2); } } } }• Add Cilk keywords

spawnspawnspawnspawn

syncsync;;

cilkcilk

Recursive algorithms in OpenMP• OpenMP 3.0 added constructs to support recursive

algorithms• Consider the following example

– Count the incidence of a “key” in an array.• We will solve this two different ways using OpenMP:

– Geometric Decomposition with SPMD– Divide and conquer with fork-join (fine grained)

Count keys: Main program#define N 131072int main(){ long a[N]; int i; long key = 42, nkey=0;

// fill the array and make sure it has a few instances of the key

for (i=0;i<N;i++) a[i] = random()%N; a[N%43] = key; a[N%73] = key; a[N%3] = key; // count key in a with geometric decomposition nkey = search_geom(N, a, key);

// count key in a with divide and conquer (aka: recursive splitting)

nkey = search_recur(N, a, key);

This is included for completeness … it just shows how we call the different functions to count instances of a key.

Count keys with OpenMP

// geometric decomposition implemented with the SPMD patternlong search_geom(long Nlen, long *a, long key){ long count = 0;

#pragma omp parallel reduction(+:count) { int i, num_threads = omp_get_num_threads(); int ID = omp_get_thread_num(); int istart = ID * N/num_threads; int iend = (ID+1)*N/num_threads;

if(ID == (num_threads-1)) iend = N;

for (i=istart; i<iend; i++) if(a[i]==key) count++; } return count;}

• Design Patterns used:• Geometric

Decomposition• SPMD

This is a common trick to handle the case when N is not evenly divided by the number of threads

Count keys: with OpenMPlong search_recur(long Nlen, long *a, long key){ long count = 0; #pragma omp parallel reduction(+:count) { #pragma omp single count = search(Nlen, a, key); } return count;}

• Design Patterns used:• Divide and conquer • Fork-Join

long search(long Nlen, long *a, long key){ long count1=0, count2=0, Nl2; if (Nlen == 2){ if (*(a) == key) count1=1; if (*(a+1) == key) count2=1; return count1+count2; } else { Nl2 = Nlen/2; #pragma omp task shared(count1) count1 = search(Nl2, a, key); #pragma omp task shared(count2) count2 = search(Nl2, a+Nl2, key);

#pragma omp taskwait

return count1+count2; }}

Count keys: Random number generator

// A simple random number generator static unsigned long long MULTIPLIER = 764261123;static unsigned long long PMOD = 2147483647;unsigned long long random_last = 42;

long random(){ unsigned long long random_next;

// // compute an integer random number from zero to pmod// random_next = (unsigned long long)(( MULTIPLIER * random_last)% PMOD); random_last = random_next; return (long) random_next;

I include this for completeness … it has nothing to do with any parallelism

Outline

Merge Sort

• Sorting: An important class of algorithms that take an input list and generate a sorted list.\

• Merge sort– Split a list in two– Sort each half by a call to merge sort– Continue until you his a trivial base case.– Unwind the recursive stack to generate final list

• Example– Starting point: [3 6 4 1 5 7 3 2]– Split in two: [3 6 4 1] [5 7 3 2]– Split in two: [3 6] [4 1] [5 7] [3 2]– Base case: [3 6] [1 4] [5 7] [2 3]– Sort on merge: [1 3 4 6] [2 3 5 7]– Sort on merge: [1 2 3 3 4 5 6 7]

• Fill stack of recursive calls to merge sort, after base case (n<2) unwind the stack to generate sorted list)

Serial Merge Sort:

void mergesort(int * X, int n, int * tmp){ if (n < 2) return;

/* recursively sort each half of list */ mergesort(X, n/2, tmp); mergesort(X+(n/2), n-(n/2), tmp);

/* merge sorted halves into sorted list */ merge(X, n, tmp);}

Note: we include the merge function in a later slide … it’s the same for both the serial and parallel cases.

tmp points to space equal in size to X and is used as a buffer to sort into

• Each mergesort is independent so parallel version is trivial to create.

Cilk parallel Merge Sort:

cilk void mergesort(int * X, int n, int * tmp){ if (n < 2) return;

/* recursively sort each half of list */ spawn mergesort(X, n/2, tmp); spawn mergesort(X+(n/2), n-(n/2), tmp); sync;

Parallel program created by inserting 4 cilk keywords

• OpenMP 3.0 tasks let you write the same algorithm as with Cilk but …• OpenMP exposes threads … must call inside a parallel region.

• OpenMP has a more flexible data model … must explicitly define how to scope data in the tasks.

OpenMP parallel Merge Sort:

void mergesort(int * X, int n, int * tmp){ if (n < 2) return;

#pragma omp task firstprivate (X, n, tmp) mergesort(X, n/2, tmp);

#pragma omp task firstprivate (X, n, tmp) mergesort(X+(n/2), n-(n/2), tmp); #pragma omp taskwait;

Main program for OpenMP merge sort

• The only way to create threads in OpenMP is with a parallel construct.• Hence our parallel merge sort must occur within a parallel region.

#include “omp.h”#define MAX_SIZE 1000Int main(){ int n = 100; int data[MAX_SIZE], tmp[MAX_SIZE]; generate_list(data, n) #pragma omp parallel { #pragma omp single mergesort(data, n, tmp); }}

Create a team of threads using the “default number” of threads

The single construct causes only one member of the team to call the first mergesort

Background for Merge Sort: The merge routine

#include <string.h>#include <stdlib.h>void merge(int * X, int n, int * tmp) { int i = 0; int j = n/2; int ti = 0;

while (i<n/2 && j<n) { if (X[i] < X[j]) { tmp[ti] = X[i]; ti++; i++; } else { tmp[ti] = X[j]; ti++; j++; } }

while (i<n/2) { /* finish up lower half */ tmp[ti] = X[i]; ti++; i++; } while (j<n) { /* finish up upper half */ tmp[ti] = X[j]; ti++; j++; } memcpy(X, tmp, n*sizeof(int));

} // end of merge()

• This is the merge function used by both the serial and parallel versions of the program

Background for merge sort: Generate_list

• A function to generate a list of integers (included for completeness … this has nothing to do with sorting or parallelism)

void generate_list(int * x, int n) { int i; srand(10000); for (i = 0; i < n; i++) { int val = n * ((double) rand() / ((double) RAND_MAX + (double) 1)); x[i] = val; } }

Outline

Sudoko

• A game where you fill in a grid with numbers– A number cannot appear more than once in any column– A number cannot appear more than once in any row– A number can not appear more than once in any “region”

• Typically presented with a 9 by 9 grid … but for simplicity we’ll consider a 4 by 4 grid

A 4 x 4 Sudoku puzzle with 11 open positions … we show three steps in the solution

Since 1 is the only number missing in this column

Since 3 already appears in this region

Since 3 is the only number missing in this row

Sudoko Algorithm

• The two-dimensional Sudoko grid is flattened into a vector– Unsolved locations are filled with zeros– The first two rows of the initial 4 x 4 puzzle are shown– The current working location [loc=0] is shown in red and the subgrid size is 3– Initially call spawn solve(size=3, grid, loc=0)

3 0 0 4 0 0 0 2 …

• The first location has a solution so move to next location– Recursively call spawn solve(size=3, grid, loc=loc+1)

3 0 0 4 0 0 0 2 …

Exhaustive Search

• The next location [loc=1] has no solution (‘0’ in the current cell) so …– Create 4 new grids and try each of the 4 possibilities (1,2,3,4) concurrently– Note: the search goes much faster if the guess is first tested to see if it is legal– Spawn a new search tree for each guess k– Call: spawn solve(size=3, grid[k], loc=loc+1)

3 1 0 4 0 0 0 2 …

new grids

3 2 0 4 0 0 0 2 …

3 3 0 4 0 0 0 2 …

3 4 0 4 0 0 0 2 …

Illegal since 3 and 4 are already in the same row

Cilk Sudoko solution (part 1 of 3)

cilk int solve(int size, int* grid, int loc){ int i, k, solved, solution[MAX_NUM]; int* grid[MAX_NUM]; int numNumbers = size*size: int Girdlen = numNumbers*numNumbers;

if (loc == Gridlen) { /* maximum depth; reached the end of the puzzle */ return check_solution(size, grid); }

/* if this node has a solution (given by puzzle) at this location */ /* move to next node location */ if (grid[loc] != 0) { solved = spawn solve(size, g, loc+1); return solved; }

/* try each number (unique to row,col,sq) */ numGrids = 0; for (i = 0, k = 0; i < MAX_NUM; i++) { k = next_guess(size, k, loc, grid); if (k == 0) break; /* no more legal solutions at t his location */

/* need new grid to work with */ myGrid[i] = new_grid(size, grid); myGrid[i][loc] = k; solution[i] = spawn solve(size, myGrid[i], loc+1); nGrids += 1; }

/* check to see if there is a solution */

solved = 0; for (i = 0; i < nGrids; i++) { if (solution[i] == 1) { int n; /* found a solution, copy result to parent */ for (n = loc; n < len; n++) { grid[n] = (myGrid[i])[n]; } solved = 1; } free(myGrid[i]); }

return solved;}

© 2009 matthew j. sottile, timothy g. mattson, and craig e rasmussen 1 introduction to concurrency...

Documents

the computer basics 1 mattson

in re: robbyn dale mattson and renee diane mattson, 9th cir....

la visione sottile - creating an integral …...4 la visione...

mattson aspen ii - ag · pdf filetool id ac02 oem/model...

good guys -- mattson orendi

the study quran - ingrid mattson

ensefalitis rasmussen

rasmussen dan1329scanned

code of ethics 2019 en - laminazione sottile

sottile 2d led | porta sottile led | metallo sottile...

camille mattson

mattson core disciplines

web page development: part ii heather rasmussen...

mattson technology, inc

intervento pier angelo sottile dell'8 maggio 2014

speculative v. mattson - complaint

mattson ast2800, single chamber rtp, s/n 94060103, vintage...

cromatografia su strato sottile - tlc - progetto lauree...

mattson pr-marketing

sottile 2d rv2a - microsoft · 2014. 7. 10. ·...