part 4: parallel patterns - wordpress.com processors work in parallel, each taking its input from...

PART 4: PARALLEL PATTERNS

WEEK 12:

Design of a Parallel Program

* Flynn’s Taxonomy

* Levels of Parallelism

* Principal Parallel Patterns

* Result Parallelism

* Agenda Parallelism

* Specialist Parallelism

CSC526: Parallel Processing

Fall 2016

Dr. Soha S. Zaghloul 1

FLYNN’S TAXONOMY

2Dr. Soha S. Zaghloul 2

SISD: Single Instruction, Single Datum

Flynn categorized computer architectures into four main classes according to the

number of instructions and data streams. These are:

SIMD: Single Instruction, Multiple Data

MISD: Multiple Instructions, Single Datum

MIMD: Multiple Instructions, Multiple Data

FLYNN’S TAXONOMY – SISD

One stream of instructions processes a single stream of data.

This architecture is shown in the figure below:

Control

Processor

instructions

Input Data

Output Data

Obviously, this is the common model of single-processor computers.

FLYNN’S TAXONOMY – SIMD

A single instruction stream is broadcast to multiple processors, each with its own data

stream.

Control

Processor Processor Processor Processor

instructions

Input Data Input Data Input Data Input Data

Output Data Output Data Output Data Output Data

Obviously, this is the SMP.

FLYNN’S TAXONOMY – MISD

No well-known system fits this designation. It is mentioned only for the sake of

completeness.

FLYNN’S TAXONOMY – MIMD

Each processing element has its own stream of instructions operating on its own

Control

Processor

instructions

Input Data

Output Data

Control

Processor

instructions

Input Data

Output Data

Control

Processor

instructions

Input Data

Output Data

Control

Processor

instructions

Input Data

Output Data

Obviously, this is the MPP architecture.

Interconnection Network

GRANULARITY

Three main grain sizes are identified:

Fine grain

Medium grain

Granularity or grain size is a measure of the amount of computation involved in a

software process.

In other words, the granularity defines the parallelism level of a process.

Coarse grain

In general, the execution of a program may involve a combination of these levels.

The actual combination depends on many factors such as:

Algorithm

Language

Compiler support

Hardware limitations

PARALLELISM LEVELS

According to the grain size, five levels of parallelism are identified:

Instruction Level

Loop Level

Procedure Level

Subprogram Level

Job (Program) Level

The figure in the next slide shows the correspondence of parallelism levels to grain

sizes.

PARALLELISM LEVELS TO GRAIN SIZE

Level 5: Jobs/Programs

Level 4: Subprograms

Level 3: Procedures

Level 2: Loops

Level 1: Instructions

ICATIO

From 2 to thousands

of instructions

Less than 500 inst.

Less than 2000 inst.

Thousands of inst.

Tens of thousands of

instructions

GRANULARITY - EXAMPLE

Consider the problem of calculating all the pixels in all the frames of a computer-

animated film. This may be solved in one of two ways:

Assign a distinct processor to calculate each pixel

Assign a distinct processor to render each entire frame

Each result requires a small amount of computation.

This is fine-grained parallelism.

Each result requires a large amount of computation.

This is coarse-grained parallelism.

PARALLELISM PATTERNS

Three principal patterns for designing parallel programs are identified. These are:

Result Parallelism

Agenda Parallelism

Specialist Parallelism

Using the above patterns, the steps for designing a parallel program are:

Identify the pattern that best matches the problem

Take the pattern’s suggested design as a starting point

Implement the pattern using appropriate constructs in a parallel programming

language

RESULT PARALLELISM (1) – CONCEPT

Result Parallelism pattern has the following criteria:

There is a collection of multiple results

The individual results are all computed in parallel, each by its own processor

Each processor is able to carry out the complete computation to produce one

result

The conceptual parallel program design is as follows:

Processor 1: Compute Result 1

Processor 2: Compute Result 2

Processor N: Compute Result N

RESULT PARALLELISM (2) – EXAMPLE 1

Consider the problem of calculating the factorials of a set of numbers stored in an

array data of size N:

Processor 1 is assigned to compute the factorial of data[0]

Processor 2 is assigned to compute the factorial of data[1]

Processor N is assigned to compute the factorial of data[N-1]

The figure in the next slide illustrates the result pattern:

SRESULT PARALLELISM (3) – FIGURE EXAMPLE 1

Result Parallelism is depicted in the following figure:

Processor

Factorial

data[0]

Processor

Factorial

data[1]

Processor

Factorial

data[2]

Processor

Factorial

data[7]

All processors’ results are independent of each other.

We are concerned with the result calculated by each stand-alone processor.

Note that there is no data sharing between processors.

Conceptually speaking, all processors can start and finish at the same time.

SRESULT PARALLELISM (4) – SEQUENTIAL DEPENDENCY

EXAMPLE 2

Recalculating the formulae in a spreadsheet is another example of Result Parallelism.

Conceptually, each cell has its own processor that computes the value of the

cell’s formula.

However, if the formula for cell B1 uses the value of cell A1, then B1 must wait

until A1 finishes: This is known as Result Parallelism with Sequential

Dependency.

The figure in the next slide depicts this concept.

SRESULT PARALLELISM (5) – SEQUENTIAL DEPENDENCY

EXAMPLE 2 FIGURE

Processor

Result 1

Processor

Result 2

Processor

Result 3

Processor

Result 4

Processor

Result 5

Processor

Result 6

Processor

Result 7

Processor

Result 8

AGENDA PARALLELISM (1) – CONCEPT

Agenda Parallelism pattern has the following criteria:

There is a collection of multiple tasks

We are interested in one result only, or a small number of results

Each processor is able to carry out the complete computations to produce one

result for the assigned task

Processor 1: Perform task 1

Processor 2: Perform task 2

Processor N: Perform task N

AGENDA PARALLELISM (2) – FIGURE

Agenda Parallelism is depicted in the following figure:

Processor

Task 1

Processor

Task 2

Processor

Task 3

Processor

Task 8

SAGENDA PARALLELISM (3) – SEQUENTIAL DEPENDENCY

EXAMPLE 3

Consider the following problem for an array of numbers data[4]:

Get the factorial of each number in the array data

Get the Fibonacci of each factorial

Classify into three categories:

The following code segment illustrates the problem:

Numbers that are less than threshold1

Numbers that are greater than threshold2

Numbers between threshold1 and threshold2

Phase 1

Phase 2

Phase 3

//calculate Factorial

for (i=0; i < N; i++) factorial[i] = Facto(data[i]); //Facto is a method

//calculate Fibonacci

for (i=0; i < N; i++) fibonacci[i] = Fibo (factorial[i]); //Fibo is a method

//classify according to thresholds

x = 0; y = 0; x = 0;

for (i=0; i < N; i++)

if (fibonacci[i] < threshold1) {class1[x] = fibonacci[i]; x++;}

else if(fibonacci[i] > threshold3) {class3[z] = fibonacci[i]; z++;}

else {class2[y] = fibonacci[i]; y++}

SAGENDA PARALLELISM (4) – FIGURE EXAMPLE 3

Processor

Factorial

data[0]

Processor

Factorial

data[1]

Processor

Factorial

data[2]

Processor

Factorial

data[3]

Processor

Fibonnaci

facto[0]

Processor

Fibonnaci

facto[1]

Processor

Fibonnaci

facto[2]

Processor

Fibonnaci

facto[3]

Processor

Less than

threshold1

Processor

Between

threshold1 &

threshold2

Processor

Greater

threshold2

AGENDA PARALLELISM (5) – REDUCTION

When the output of an agenda parallel program is a summary of the individual tasks’

results, the program is following the so-called reduction pattern.

Consider the example of finding the product of factorials of a set of numbers stored in

an array data of size N:

Task 1: determine the factorial of data[0]

Task 2: determine the factorial of data[1]

Task N: determine the factorial of data[N-1]

Task N+1: find the product of all factorials

The figure in the next slide depicts such pattern.

SAGENDA PARALLELISM (6) – REDUCTIONEXAMPLE 4

Processor

Factorial

(data[0])

Processor

Factorial

(data[1])

Processor

Factorial

(data[2])

Processor

Factorial

(data[3])

Processor

Product

factorials

SPECIALIST PARALLELISM (1) – CONCEPT

Specialist Parallelism pattern has the following criteria:

There is a group of tasks that must be performed to solve the problem on a

series of (items) data

Each processor performs only one task on a series of data

Processor 1: For each item

Perform task 1 on the item

Processor 2: For each item

Perform task 2 on the item

Processor N: For each item

Perform task N on the item

The figure in the next slide depicts the Specialist Pattern.

SPECIALIST PARALLELISM (2) – FIGURE

Specialist Parallelism is depicted in the following figure:

Task 1,

Item 1

Task 1,

Item 2

Task 1,

Item 3

Task 1,

Item 4

Task 1,

Item 5

Processor

Task 2,

Item 1

Task 2,

Item 2

Task 2,

Item 3

Task 2,

Item 4

Task 2,

Item 5

Processor

Task 3,

Item 1

Task 3,

Item 2

Task 3,

Item 3

Task 3,

Item 4

Task 3,

Item 5

Processor

SPECIALIST PARALLELISM (3) – EXAMPLE 5

Given an array data[8], we need to:

Count the number of positive elements

A code segment of the sequential version of the above problem is shown below:

Count the number of negative elements

Count the number of zeroes

Processor 1

Processor 2

Processor 3

for (i=0; i < N; i++)

if (data[i] > 0) positive++;

else if (data[i] < 0) negative ++;

else zero++;

A code segment of the parallel version of the above problem is shown below:

for (i=0; i < N; i++) if (data[i] > 0) positive++; //Processor 1

for (i=0; i < N; i++) if (data[i] < 0) negative ++; //Processor 2

for (i=0; i < N; i++) if (data[i] == 0) zero++; //Processor 3

The figure in the next slide illustrates Example 5.

SSPECIALIST PARALLELISM (4) – FIGURE EXAMPLE 5

data [0]

data [1]

data [2]

data [3]

data [4]

Processor

data [5]

data [6]

data [7]Count positive

numbers

data [0]

data [1]

data [2]

data [3]

data [4]

Processor

data [5]

data [6]

data [7]Count negative

numbers

data [0]

data [1]

data [2]

data [3]

data [4]

Processor

data [5]

data [6]

data [7] Count zeroes

SPECIALIST PARALLELISM (5) – PIPELINE

When there are sequential dependencies between the tasks in a specialist parallel

problem, the program follows a pipelined pattern.

The output of one processor becomes the input for the next processor.

All processors work in parallel, each taking its input from the preceding processor’s

previous output.

Consider the following example in an image processing application:

Calculate all pixels of a frame

Render the frame

Compress the frame

Processor 1

Processor 2

Processor 3

Store the frame Processor 4

PROCESSOR 1:

CALCULATE

SSPECIALIST PARALLELISM (6) – FIGUREEXAMPLE 6

Frame 1

Frame 2

Frame 3

Frame 4

Frame 5

PROCESSOR 2:

RENDER

Frame 1

Frame 2

Frame 3

Frame 4

Frame 5

PROCESSOR 3:

COMPRESS

Frame 1

Frame 2

Frame 3

Frame 4

Frame 5

PROCESSOR 4:

Frame 1

Frame 2

Frame 3

Frame 4

Frame 5

Note that the time is relative to each processor.

The next figure depicts the example with respect to absolute time.

SSPECIALIST PARALLELISM (6) – PIPELINEEXAMPLE 6: ABSOLUTE TIME

12 11 10 9 8 7 6 5 4 3 2 1

Frame 1 ST CO RE CA

Frame 2 ST CO RE CA

Frame 3 ST CO RE CA

Frame 4 ST CO RE CA

Frame 5 ST CO RE CA

Time in cycles

P1P2P3P4

A sequential program may be completely re-written to adopt to a parallel pattern (See

Example 5).

The difference between parallelism patterns can be summarized as follows:

Result Parallelism: We are concerned with the result of each processor

Agenda Parallelism: We are concerned with only a combination of results

(sequential dependency), or a summary of the individual results (reduction).

Specialist Parallelism: focuses on the processors that can execute in parallel.

part 4: parallel patterns - wordpress.com processors work in parallel, each taking its input from...

Documents

settlement to post settlement processes from 2009 and beyond...

b-value variations preceding the devastating, 1999

sleep disturbance preceding completed suicide in

calculation of the hoop burst speed for rotating discs ·...

legislation preceding industrial disputes act,1947

pain preceding recurrent head and neck cancer

adams-carr et al 2015 - constipation preceding...

child’s anxiety preceding the dental appointment ... ·...

genomics and proteomics: a signal processor’s...

!h vsuf%imperative;!length!lost!if!preceding!vowel!long.!all...

events preceding the civil war(2325764)

gingerbread to jellybean: satisfying a mobile processor’s...

cdigital.dgb.uanl.mxcdigital.dgb.uanl.mx/la/1080010055/1080010055_062.pdftaec—leh...

robotics: science preceding science ctionﬁ

phd workshop preceding adaptation research conference 27...

robotics: science preceding science fiction

dynegy midstream services, l.p. experience in methane...

differences between classical and preceding era

non-auditory, electrophysiological potentials preceding ......

iminds' course: preceding exercises