part 4: parallel patterns - wordpress.com processors work in parallel, each taking its input from...

PART 4: PARALLEL PATTERNS

WEEK 12:

Design of a Parallel Program

* Flynn’s Taxonomy

* Levels of Parallelism

* Principal Parallel Patterns

* Result Parallelism

* Agenda Parallelism

* Specialist Parallelism

CSC526: Parallel Processing

Fall 2016

Dr. Soha S. Zaghloul 1

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

FLYNN’S TAXONOMY

2Dr. Soha S. Zaghloul 2

SISD: Single Instruction, Single Datum

Flynn categorized computer architectures into four main classes according to the

number of instructions and data streams. These are:

SIMD: Single Instruction, Multiple Data

MISD: Multiple Instructions, Single Datum

MIMD: Multiple Instructions, Multiple Data

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

FLYNN’S TAXONOMY – SISD


One stream of instructions processes a single stream of data.

This architecture is shown in the figure below:

Control

Unit

Processor

instructions

Input Data

Output Data

Obviously, this is the common model of single-processor computers.

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

FLYNN’S TAXONOMY – SIMD


A single instruction stream is broadcast to multiple processors, each with its own data

stream.


Control

Unit

Processor Processor Processor Processor

instructions

Input Data Input Data Input Data Input Data

Output Data Output Data Output Data Output Data

Obviously, this is the SMP.

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

FLYNN’S TAXONOMY – MISD


No well-known system fits this designation. It is mentioned only for the sake of

completeness.

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

FLYNN’S TAXONOMY – MIMD


Each processing element has its own stream of instructions operating on its own

data.


Control

Unit

Processor

instructions

Input Data

Output Data

Control

Unit

Processor

instructions

Input Data

Output Data

Control

Unit

Processor

instructions

Input Data

Output Data

Control

Unit

Processor

instructions

Input Data

Output Data

Obviously, this is the MPP architecture.

Interconnection Network

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

GRANULARITY


Three main grain sizes are identified:

Fine grain

Medium grain

Granularity or grain size is a measure of the amount of computation involved in a

software process.

In other words, the granularity defines the parallelism level of a process.

Coarse grain

In general, the execution of a program may involve a combination of these levels.

The actual combination depends on many factors such as:

Algorithm

Language

Compiler support

Hardware limitations

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

PARALLELISM LEVELS


According to the grain size, five levels of parallelism are identified:

Instruction Level

Loop Level

Procedure Level

Subprogram Level

Job (Program) Level

The figure in the next slide shows the correspondence of parallelism levels to grain

sizes.

SU

PP

OR

TE

D B

Y S

MP

SU

PP

OR

TE

D B

Y

MP

P

FIN

E G

RA

IN

CO

AR

SE

GR

AIN

CO

AR

SE

OR

ME

DIU

M

GR

AIN

ME

DIU

M

GR

AIN

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

PARALLELISM LEVELS TO GRAIN SIZE


Level 5: Jobs/Programs

Level 4: Subprograms

Level 3: Procedures

Level 2: Loops

Level 1: Instructions

DE

GR

EE

OF

PA

RA

LLE

LIS

M

CO

MM

UN

ICATIO

N F

RE

QU

EN

CY

SC

HE

DU

LIN

G O

VE

RH

EA

D

From 2 to thousands

of instructions

Less than 500 inst.

Less than 2000 inst.

Thousands of inst.

Tens of thousands of

instructions

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

GRANULARITY - EXAMPLE


Consider the problem of calculating all the pixels in all the frames of a computer-

animated film. This may be solved in one of two ways:

Assign a distinct processor to calculate each pixel

Assign a distinct processor to render each entire frame

Each result requires a small amount of computation.

This is fine-grained parallelism.

Each result requires a large amount of computation.

This is coarse-grained parallelism.

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

PARALLELISM PATTERNS


Three principal patterns for designing parallel programs are identified. These are:

Result Parallelism

Agenda Parallelism

Specialist Parallelism

Using the above patterns, the steps for designing a parallel program are:

Identify the pattern that best matches the problem

Take the pattern’s suggested design as a starting point

Implement the pattern using appropriate constructs in a parallel programming

language

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

RESULT PARALLELISM (1) – CONCEPT


Result Parallelism pattern has the following criteria:

There is a collection of multiple results

The individual results are all computed in parallel, each by its own processor

Each processor is able to carry out the complete computation to produce one

result

The conceptual parallel program design is as follows:

Processor 1: Compute Result 1

Processor 2: Compute Result 2

….

Processor N: Compute Result N

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

RESULT PARALLELISM (2) – EXAMPLE 1


Consider the problem of calculating the factorials of a set of numbers stored in an

array data of size N:

Processor 1 is assigned to compute the factorial of data[0]

Processor 2 is assigned to compute the factorial of data[1]

Processor N is assigned to compute the factorial of data[N-1]

The figure in the next slide illustrates the result pattern:

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

SRESULT PARALLELISM (3) – FIGURE EXAMPLE 1


Result Parallelism is depicted in the following figure:

Processor

1

Factorial

data[0]

Processor

2

Factorial

data[1]

Processor

3

Factorial

data[2]

Processor

8

Factorial

data[7]

All processors’ results are independent of each other.

We are concerned with the result calculated by each stand-alone processor.

Note that there is no data sharing between processors.

Conceptually speaking, all processors can start and finish at the same time.

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

SRESULT PARALLELISM (4) – SEQUENTIAL DEPENDENCY

EXAMPLE 2


Recalculating the formulae in a spreadsheet is another example of Result Parallelism.

Conceptually, each cell has its own processor that computes the value of the

cell’s formula.

However, if the formula for cell B1 uses the value of cell A1, then B1 must wait

until A1 finishes: This is known as Result Parallelism with Sequential

Dependency.

The figure in the next slide depicts this concept.

Tim

e =

t1

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

SRESULT PARALLELISM (5) – SEQUENTIAL DEPENDENCY

EXAMPLE 2 FIGURE


Processor

1

Result 1

Processor

2

Result 2

Processor

3

Result 3

Processor

4

Result 4

Tim

e =

t2

Processor

5

Result 5

Processor

6

Result 6

Processor

7

Result 7

Processor

8

Result 8

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

AGENDA PARALLELISM (1) – CONCEPT


Agenda Parallelism pattern has the following criteria:

There is a collection of multiple tasks

We are interested in one result only, or a small number of results

Each processor is able to carry out the complete computations to produce one

result for the assigned task


Processor 1: Perform task 1

Processor 2: Perform task 2

….

Processor N: Perform task N

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

AGENDA PARALLELISM (2) – FIGURE


Agenda Parallelism is depicted in the following figure:

Processor

1

Task 1

Processor

2

Task 2

Processor

3

Task 3

Processor

8

Task 8

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

SAGENDA PARALLELISM (3) – SEQUENTIAL DEPENDENCY

EXAMPLE 3


Consider the following problem for an array of numbers data[4]:

Get the factorial of each number in the array data

Get the Fibonacci of each factorial

Classify into three categories:

The following code segment illustrates the problem:

Numbers that are less than threshold1

Numbers that are greater than threshold2

Numbers between threshold1 and threshold2

Phase 1

Phase 2

Phase 3

//calculate Factorial

for (i=0; i < N; i++) factorial[i] = Facto(data[i]); //Facto is a method

//calculate Fibonacci

for (i=0; i < N; i++) fibonacci[i] = Fibo (factorial[i]); //Fibo is a method

//classify according to thresholds

x = 0; y = 0; x = 0;

for (i=0; i < N; i++)

if (fibonacci[i] < threshold1) {class1[x] = fibonacci[i]; x++;}

else if(fibonacci[i] > threshold3) {class3[z] = fibonacci[i]; z++;}

else {class2[y] = fibonacci[i]; y++}

Ph

ase

1

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

SAGENDA PARALLELISM (4) – FIGURE EXAMPLE 3


Processor

1

Factorial

data[0]

Processor

2

Factorial

data[1]

Processor

3

Factorial

data[2]

Processor

4

Factorial

data[3]

Ph

ase

2

Processor

5

Fibonnaci

facto[0]

Processor

6

Fibonnaci

facto[1]

Processor

7

Fibonnaci

facto[2]

Processor

8

Fibonnaci

facto[3]

Ph

ase

3

Processor

9

Less than

threshold1

Processor

10

Between

threshold1 &

threshold2

Processor

11

Greater

than

threshold2

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

AGENDA PARALLELISM (5) – REDUCTION


When the output of an agenda parallel program is a summary of the individual tasks’

results, the program is following the so-called reduction pattern.

Consider the example of finding the product of factorials of a set of numbers stored in

an array data of size N:

Task 1: determine the factorial of data[0]

Task 2: determine the factorial of data[1]

Task N: determine the factorial of data[N-1]

Task N+1: find the product of all factorials

The figure in the next slide depicts such pattern.

Ph

ase

2P

ha

se

1

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

SAGENDA PARALLELISM (6) – REDUCTIONEXAMPLE 4


Processor

1

Factorial

(data[0])

Processor

2

Factorial

(data[1])

Processor

3

Factorial

(data[2])

Processor

4

Factorial

(data[3])

Processor

5

Product

of

factorials

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

SPECIALIST PARALLELISM (1) – CONCEPT


Specialist Parallelism pattern has the following criteria:

There is a group of tasks that must be performed to solve the problem on a

series of (items) data

Each processor performs only one task on a series of data


Processor 1: For each item

Perform task 1 on the item

Processor 2: For each item

Perform task 2 on the item

….

Processor N: For each item

Perform task N on the item

The figure in the next slide depicts the Specialist Pattern.

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

SPECIALIST PARALLELISM (2) – FIGURE


Specialist Parallelism is depicted in the following figure:

Task 1,

Item 1

Task 1,

Item 2

Task 1,

Item 3

Task 1,

Item 4

Task 1,

Item 5

Processor

1

Task 2,

Item 1

Task 2,

Item 2

Task 2,

Item 3

Task 2,

Item 4

Task 2,

Item 5

Processor

2

Task 3,

Item 1

Task 3,

Item 2

Task 3,

Item 3

Task 3,

Item 4

Task 3,

Item 5

Processor

3

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

SPECIALIST PARALLELISM (3) – EXAMPLE 5


Given an array data[8], we need to:

Count the number of positive elements

A code segment of the sequential version of the above problem is shown below:

Count the number of negative elements

Count the number of zeroes

Processor 1

Processor 2

Processor 3

for (i=0; i < N; i++)

if (data[i] > 0) positive++;

else if (data[i] < 0) negative ++;

else zero++;

A code segment of the parallel version of the above problem is shown below:

for (i=0; i < N; i++) if (data[i] > 0) positive++; //Processor 1

for (i=0; i < N; i++) if (data[i] < 0) negative ++; //Processor 2

for (i=0; i < N; i++) if (data[i] == 0) zero++; //Processor 3

The figure in the next slide illustrates Example 5.

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

SSPECIALIST PARALLELISM (4) – FIGURE EXAMPLE 5


data [0]

data [1]

data [2]

data [3]

data [4]

Processor

1

data [5]

data [6]

data [7]Count positive

numbers

data [0]

data [1]

data [2]

data [3]

data [4]

Processor

2

data [5]

data [6]

data [7]Count negative

numbers

data [0]

data [1]

data [2]

data [3]

data [4]

Processor

3

data [5]

data [6]

data [7] Count zeroes

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

SPECIALIST PARALLELISM (5) – PIPELINE


When there are sequential dependencies between the tasks in a specialist parallel

problem, the program follows a pipelined pattern.

The output of one processor becomes the input for the next processor.

All processors work in parallel, each taking its input from the preceding processor’s

previous output.

Consider the following example in an image processing application:

Calculate all pixels of a frame

Render the frame

Compress the frame

Processor 1

Processor 2

Processor 3

Store the frame Processor 4

PROCESSOR 1:

CALCULATE

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

SSPECIALIST PARALLELISM (6) – FIGUREEXAMPLE 6


Frame 1

Frame 2

Frame 3

Frame 4

Frame 5

1

2

3

4

5

TIM

E IN

P1

PROCESSOR 2:

RENDER

Frame 1

Frame 2

Frame 3

Frame 4

Frame 5

1

2

3

4

5TIM

E IN

P2

PROCESSOR 3:

COMPRESS

Frame 1

Frame 2

Frame 3

Frame 4

Frame 5

1

2

3

4

5

TIM

E IN

P3

PROCESSOR 4:

STORE

Frame 1

Frame 2

Frame 3

Frame 4

Frame 5

1

2

3

4

5

TIM

E IN

P4

Note that the time is relative to each processor.

The next figure depicts the example with respect to absolute time.

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

SSPECIALIST PARALLELISM (6) – PIPELINEEXAMPLE 6: ABSOLUTE TIME


12 11 10 9 8 7 6 5 4 3 2 1

Frame 1 ST CO RE CA

Frame 2 ST CO RE CA

Frame 3 ST CO RE CA

Frame 4 ST CO RE CA

Frame 5 ST CO RE CA

Time in cycles

Fra

me

s

P1P2P3P4

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

NOTES


A sequential program may be completely re-written to adopt to a parallel pattern (See

Example 5).

The difference between parallelism patterns can be summarized as follows:

Result Parallelism: We are concerned with the result of each processor

Agenda Parallelism: We are concerned with only a combination of results

(sequential dependency), or a summary of the individual results (reduction).

Specialist Parallelism: focuses on the processors that can execute in parallel.

part 4: parallel patterns - wordpress.com processors work in parallel, each taking its input from...

Documents