part 4: parallel patterns - wordpress.com processors work in parallel, each taking its input from...

Post on 26-Apr-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

PART 4: PARALLEL PATTERNS

WEEK 12:

Design of a Parallel Program

* Flynn’s Taxonomy

* Levels of Parallelism

* Principal Parallel Patterns

* Result Parallelism

* Agenda Parallelism

* Specialist Parallelism

CSC526: Parallel Processing

Fall 2016

Dr. Soha S. Zaghloul 1

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

FLYNN’S TAXONOMY

2Dr. Soha S. Zaghloul 2

SISD: Single Instruction, Single Datum

Flynn categorized computer architectures into four main classes according to the

number of instructions and data streams. These are:

SIMD: Single Instruction, Multiple Data

MISD: Multiple Instructions, Single Datum

MIMD: Multiple Instructions, Multiple Data

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

FLYNN’S TAXONOMY – SISD

3Dr. Soha S. Zaghloul 3

One stream of instructions processes a single stream of data.

This architecture is shown in the figure below:

Control

Unit

Processor

instructions

Input Data

Output Data

Obviously, this is the common model of single-processor computers.

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

FLYNN’S TAXONOMY – SIMD

4Dr. Soha S. Zaghloul 4

A single instruction stream is broadcast to multiple processors, each with its own data

stream.

This architecture is shown in the figure below:

Control

Unit

Processor Processor Processor Processor

instructions

Input Data Input Data Input Data Input Data

Output Data Output Data Output Data Output Data

Obviously, this is the SMP.

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

FLYNN’S TAXONOMY – MISD

5Dr. Soha S. Zaghloul 5

No well-known system fits this designation. It is mentioned only for the sake of

completeness.

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

FLYNN’S TAXONOMY – MIMD

6Dr. Soha S. Zaghloul 6

Each processing element has its own stream of instructions operating on its own

data.

This architecture is shown in the figure below:

Control

Unit

Processor

instructions

Input Data

Output Data

Control

Unit

Processor

instructions

Input Data

Output Data

Control

Unit

Processor

instructions

Input Data

Output Data

Control

Unit

Processor

instructions

Input Data

Output Data

Obviously, this is the MPP architecture.

Interconnection Network

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

GRANULARITY

7Dr. Soha S. Zaghloul 7

Three main grain sizes are identified:

Fine grain

Medium grain

Granularity or grain size is a measure of the amount of computation involved in a

software process.

In other words, the granularity defines the parallelism level of a process.

Coarse grain

In general, the execution of a program may involve a combination of these levels.

The actual combination depends on many factors such as:

Algorithm

Language

Compiler support

Hardware limitations

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

PARALLELISM LEVELS

8Dr. Soha S. Zaghloul 8

According to the grain size, five levels of parallelism are identified:

Instruction Level

Loop Level

Procedure Level

Subprogram Level

Job (Program) Level

The figure in the next slide shows the correspondence of parallelism levels to grain

sizes.

SU

PP

OR

TE

D B

Y S

MP

SU

PP

OR

TE

D B

Y

MP

P

FIN

E G

RA

IN

CO

AR

SE

GR

AIN

CO

AR

SE

OR

ME

DIU

M

GR

AIN

ME

DIU

M

GR

AIN

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

PARALLELISM LEVELS TO GRAIN SIZE

9Dr. Soha S. Zaghloul 9

Level 5: Jobs/Programs

Level 4: Subprograms

Level 3: Procedures

Level 2: Loops

Level 1: Instructions

DE

GR

EE

OF

PA

RA

LLE

LIS

M

CO

MM

UN

ICATIO

N F

RE

QU

EN

CY

SC

HE

DU

LIN

G O

VE

RH

EA

D

From 2 to thousands

of instructions

Less than 500 inst.

Less than 2000 inst.

Thousands of inst.

Tens of thousands of

instructions

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

GRANULARITY - EXAMPLE

10Dr. Soha S. Zaghloul 10

Consider the problem of calculating all the pixels in all the frames of a computer-

animated film. This may be solved in one of two ways:

Assign a distinct processor to calculate each pixel

Assign a distinct processor to render each entire frame

Each result requires a small amount of computation.

This is fine-grained parallelism.

Each result requires a large amount of computation.

This is coarse-grained parallelism.

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

PARALLELISM PATTERNS

11Dr. Soha S. Zaghloul 11

Three principal patterns for designing parallel programs are identified. These are:

Result Parallelism

Agenda Parallelism

Specialist Parallelism

Using the above patterns, the steps for designing a parallel program are:

Identify the pattern that best matches the problem

Take the pattern’s suggested design as a starting point

Implement the pattern using appropriate constructs in a parallel programming

language

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

RESULT PARALLELISM (1) – CONCEPT

12Dr. Soha S. Zaghloul 12

Result Parallelism pattern has the following criteria:

There is a collection of multiple results

The individual results are all computed in parallel, each by its own processor

Each processor is able to carry out the complete computation to produce one

result

The conceptual parallel program design is as follows:

Processor 1: Compute Result 1

Processor 2: Compute Result 2

….

Processor N: Compute Result N

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

RESULT PARALLELISM (2) – EXAMPLE 1

13Dr. Soha S. Zaghloul 13

Consider the problem of calculating the factorials of a set of numbers stored in an

array data of size N:

Processor 1 is assigned to compute the factorial of data[0]

Processor 2 is assigned to compute the factorial of data[1]

Processor N is assigned to compute the factorial of data[N-1]

The figure in the next slide illustrates the result pattern:

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

SRESULT PARALLELISM (3) – FIGURE EXAMPLE 1

14Dr. Soha S. Zaghloul 14

Result Parallelism is depicted in the following figure:

Processor

1

Factorial

data[0]

Processor

2

Factorial

data[1]

Processor

3

Factorial

data[2]

Processor

8

Factorial

data[7]

All processors’ results are independent of each other.

We are concerned with the result calculated by each stand-alone processor.

Note that there is no data sharing between processors.

Conceptually speaking, all processors can start and finish at the same time.

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

SRESULT PARALLELISM (4) – SEQUENTIAL DEPENDENCY

EXAMPLE 2

15Dr. Soha S. Zaghloul 15

Recalculating the formulae in a spreadsheet is another example of Result Parallelism.

Conceptually, each cell has its own processor that computes the value of the

cell’s formula.

However, if the formula for cell B1 uses the value of cell A1, then B1 must wait

until A1 finishes: This is known as Result Parallelism with Sequential

Dependency.

The figure in the next slide depicts this concept.

Tim

e =

t1

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

SRESULT PARALLELISM (5) – SEQUENTIAL DEPENDENCY

EXAMPLE 2 FIGURE

16Dr. Soha S. Zaghloul 16

Processor

1

Result 1

Processor

2

Result 2

Processor

3

Result 3

Processor

4

Result 4

Tim

e =

t2

Processor

5

Result 5

Processor

6

Result 6

Processor

7

Result 7

Processor

8

Result 8

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

AGENDA PARALLELISM (1) – CONCEPT

17Dr. Soha S. Zaghloul 17

Agenda Parallelism pattern has the following criteria:

There is a collection of multiple tasks

We are interested in one result only, or a small number of results

Each processor is able to carry out the complete computations to produce one

result for the assigned task

The conceptual parallel program design is as follows:

Processor 1: Perform task 1

Processor 2: Perform task 2

….

Processor N: Perform task N

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

AGENDA PARALLELISM (2) – FIGURE

18Dr. Soha S. Zaghloul 18

Agenda Parallelism is depicted in the following figure:

Processor

1

Task 1

Processor

2

Task 2

Processor

3

Task 3

Processor

8

Task 8

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

SAGENDA PARALLELISM (3) – SEQUENTIAL DEPENDENCY

EXAMPLE 3

19Dr. Soha S. Zaghloul 19

Consider the following problem for an array of numbers data[4]:

Get the factorial of each number in the array data

Get the Fibonacci of each factorial

Classify into three categories:

The following code segment illustrates the problem:

Numbers that are less than threshold1

Numbers that are greater than threshold2

Numbers between threshold1 and threshold2

Phase 1

Phase 2

Phase 3

//calculate Factorial

for (i=0; i < N; i++) factorial[i] = Facto(data[i]); //Facto is a method

//calculate Fibonacci

for (i=0; i < N; i++) fibonacci[i] = Fibo (factorial[i]); //Fibo is a method

//classify according to thresholds

x = 0; y = 0; x = 0;

for (i=0; i < N; i++)

if (fibonacci[i] < threshold1) {class1[x] = fibonacci[i]; x++;}

else if(fibonacci[i] > threshold3) {class3[z] = fibonacci[i]; z++;}

else {class2[y] = fibonacci[i]; y++}

Ph

ase

1

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

SAGENDA PARALLELISM (4) – FIGURE EXAMPLE 3

20Dr. Soha S. Zaghloul 20

Processor

1

Factorial

data[0]

Processor

2

Factorial

data[1]

Processor

3

Factorial

data[2]

Processor

4

Factorial

data[3]

Ph

ase

2

Processor

5

Fibonnaci

facto[0]

Processor

6

Fibonnaci

facto[1]

Processor

7

Fibonnaci

facto[2]

Processor

8

Fibonnaci

facto[3]

Ph

ase

3

Processor

9

Less than

threshold1

Processor

10

Between

threshold1 &

threshold2

Processor

11

Greater

than

threshold2

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

AGENDA PARALLELISM (5) – REDUCTION

21Dr. Soha S. Zaghloul 21

When the output of an agenda parallel program is a summary of the individual tasks’

results, the program is following the so-called reduction pattern.

Consider the example of finding the product of factorials of a set of numbers stored in

an array data of size N:

Task 1: determine the factorial of data[0]

Task 2: determine the factorial of data[1]

Task N: determine the factorial of data[N-1]

Task N+1: find the product of all factorials

The figure in the next slide depicts such pattern.

Ph

ase

2P

ha

se

1

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

SAGENDA PARALLELISM (6) – REDUCTIONEXAMPLE 4

22Dr. Soha S. Zaghloul 22

Processor

1

Factorial

(data[0])

Processor

2

Factorial

(data[1])

Processor

3

Factorial

(data[2])

Processor

4

Factorial

(data[3])

Processor

5

Product

of

factorials

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

SPECIALIST PARALLELISM (1) – CONCEPT

23Dr. Soha S. Zaghloul 23

Specialist Parallelism pattern has the following criteria:

There is a group of tasks that must be performed to solve the problem on a

series of (items) data

Each processor performs only one task on a series of data

The conceptual parallel program design is as follows:

Processor 1: For each item

Perform task 1 on the item

Processor 2: For each item

Perform task 2 on the item

….

Processor N: For each item

Perform task N on the item

The figure in the next slide depicts the Specialist Pattern.

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

SPECIALIST PARALLELISM (2) – FIGURE

24Dr. Soha S. Zaghloul 24

Specialist Parallelism is depicted in the following figure:

Task 1,

Item 1

Task 1,

Item 2

Task 1,

Item 3

Task 1,

Item 4

Task 1,

Item 5

Processor

1

Task 2,

Item 1

Task 2,

Item 2

Task 2,

Item 3

Task 2,

Item 4

Task 2,

Item 5

Processor

2

Task 3,

Item 1

Task 3,

Item 2

Task 3,

Item 3

Task 3,

Item 4

Task 3,

Item 5

Processor

3

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

SPECIALIST PARALLELISM (3) – EXAMPLE 5

25Dr. Soha S. Zaghloul 25

Given an array data[8], we need to:

Count the number of positive elements

A code segment of the sequential version of the above problem is shown below:

Count the number of negative elements

Count the number of zeroes

Processor 1

Processor 2

Processor 3

for (i=0; i < N; i++)

if (data[i] > 0) positive++;

else if (data[i] < 0) negative ++;

else zero++;

A code segment of the parallel version of the above problem is shown below:

for (i=0; i < N; i++) if (data[i] > 0) positive++; //Processor 1

for (i=0; i < N; i++) if (data[i] < 0) negative ++; //Processor 2

for (i=0; i < N; i++) if (data[i] == 0) zero++; //Processor 3

The figure in the next slide illustrates Example 5.

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

SSPECIALIST PARALLELISM (4) – FIGURE EXAMPLE 5

26Dr. Soha S. Zaghloul 26

data [0]

data [1]

data [2]

data [3]

data [4]

Processor

1

data [5]

data [6]

data [7]Count positive

numbers

data [0]

data [1]

data [2]

data [3]

data [4]

Processor

2

data [5]

data [6]

data [7]Count negative

numbers

data [0]

data [1]

data [2]

data [3]

data [4]

Processor

3

data [5]

data [6]

data [7] Count zeroes

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

SPECIALIST PARALLELISM (5) – PIPELINE

27Dr. Soha S. Zaghloul 27

When there are sequential dependencies between the tasks in a specialist parallel

problem, the program follows a pipelined pattern.

The output of one processor becomes the input for the next processor.

All processors work in parallel, each taking its input from the preceding processor’s

previous output.

Consider the following example in an image processing application:

Calculate all pixels of a frame

Render the frame

Compress the frame

Processor 1

Processor 2

Processor 3

Store the frame Processor 4

PROCESSOR 1:

CALCULATE

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

SSPECIALIST PARALLELISM (6) – FIGUREEXAMPLE 6

28Dr. Soha S. Zaghloul 28

Frame 1

Frame 2

Frame 3

Frame 4

Frame 5

1

2

3

4

5

TIM

E IN

P1

PROCESSOR 2:

RENDER

Frame 1

Frame 2

Frame 3

Frame 4

Frame 5

1

2

3

4

5TIM

E IN

P2

PROCESSOR 3:

COMPRESS

Frame 1

Frame 2

Frame 3

Frame 4

Frame 5

1

2

3

4

5

TIM

E IN

P3

PROCESSOR 4:

STORE

Frame 1

Frame 2

Frame 3

Frame 4

Frame 5

1

2

3

4

5

TIM

E IN

P4

Note that the time is relative to each processor.

The next figure depicts the example with respect to absolute time.

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

SSPECIALIST PARALLELISM (6) – PIPELINEEXAMPLE 6: ABSOLUTE TIME

29Dr. Soha S. Zaghloul 29

12 11 10 9 8 7 6 5 4 3 2 1

Frame 1 ST CO RE CA

Frame 2 ST CO RE CA

Frame 3 ST CO RE CA

Frame 4 ST CO RE CA

Frame 5 ST CO RE CA

Time in cycles

Fra

me

s

P1P2P3P4

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

NOTES

30Dr. Soha S. Zaghloul 30

A sequential program may be completely re-written to adopt to a parallel pattern (See

Example 5).

The difference between parallelism patterns can be summarized as follows:

Result Parallelism: We are concerned with the result of each processor

Agenda Parallelism: We are concerned with only a combination of results

(sequential dependency), or a summary of the individual results (reduction).

Specialist Parallelism: focuses on the processors that can execute in parallel.

top related