part 4: parallel patterns - wordpress.com processors work in parallel, each taking its input from...
TRANSCRIPT
PART 4: PARALLEL PATTERNS
WEEK 12:
Design of a Parallel Program
* Flynn’s Taxonomy
* Levels of Parallelism
* Principal Parallel Patterns
* Result Parallelism
* Agenda Parallelism
* Specialist Parallelism
CSC526: Parallel Processing
Fall 2016
Dr. Soha S. Zaghloul 1
PA
RT 4
: P
AR
ALLE
L P
ATTE
RN
S
FLYNN’S TAXONOMY
2Dr. Soha S. Zaghloul 2
SISD: Single Instruction, Single Datum
Flynn categorized computer architectures into four main classes according to the
number of instructions and data streams. These are:
SIMD: Single Instruction, Multiple Data
MISD: Multiple Instructions, Single Datum
MIMD: Multiple Instructions, Multiple Data
PA
RT 4
: P
AR
ALLE
L P
ATTE
RN
S
FLYNN’S TAXONOMY – SISD
3Dr. Soha S. Zaghloul 3
One stream of instructions processes a single stream of data.
This architecture is shown in the figure below:
Control
Unit
Processor
instructions
Input Data
Output Data
Obviously, this is the common model of single-processor computers.
PA
RT 4
: P
AR
ALLE
L P
ATTE
RN
S
FLYNN’S TAXONOMY – SIMD
4Dr. Soha S. Zaghloul 4
A single instruction stream is broadcast to multiple processors, each with its own data
stream.
This architecture is shown in the figure below:
Control
Unit
Processor Processor Processor Processor
instructions
Input Data Input Data Input Data Input Data
Output Data Output Data Output Data Output Data
Obviously, this is the SMP.
PA
RT 4
: P
AR
ALLE
L P
ATTE
RN
S
FLYNN’S TAXONOMY – MISD
5Dr. Soha S. Zaghloul 5
No well-known system fits this designation. It is mentioned only for the sake of
completeness.
PA
RT 4
: P
AR
ALLE
L P
ATTE
RN
S
FLYNN’S TAXONOMY – MIMD
6Dr. Soha S. Zaghloul 6
Each processing element has its own stream of instructions operating on its own
data.
This architecture is shown in the figure below:
Control
Unit
Processor
instructions
Input Data
Output Data
Control
Unit
Processor
instructions
Input Data
Output Data
Control
Unit
Processor
instructions
Input Data
Output Data
Control
Unit
Processor
instructions
Input Data
Output Data
Obviously, this is the MPP architecture.
Interconnection Network
PA
RT 4
: P
AR
ALLE
L P
ATTE
RN
S
GRANULARITY
7Dr. Soha S. Zaghloul 7
Three main grain sizes are identified:
Fine grain
Medium grain
Granularity or grain size is a measure of the amount of computation involved in a
software process.
In other words, the granularity defines the parallelism level of a process.
Coarse grain
In general, the execution of a program may involve a combination of these levels.
The actual combination depends on many factors such as:
Algorithm
Language
Compiler support
Hardware limitations
PA
RT 4
: P
AR
ALLE
L P
ATTE
RN
S
PARALLELISM LEVELS
8Dr. Soha S. Zaghloul 8
According to the grain size, five levels of parallelism are identified:
Instruction Level
Loop Level
Procedure Level
Subprogram Level
Job (Program) Level
The figure in the next slide shows the correspondence of parallelism levels to grain
sizes.
SU
PP
OR
TE
D B
Y S
MP
SU
PP
OR
TE
D B
Y
MP
P
FIN
E G
RA
IN
CO
AR
SE
GR
AIN
CO
AR
SE
OR
ME
DIU
M
GR
AIN
ME
DIU
M
GR
AIN
PA
RT 4
: P
AR
ALLE
L P
ATTE
RN
S
PARALLELISM LEVELS TO GRAIN SIZE
9Dr. Soha S. Zaghloul 9
Level 5: Jobs/Programs
Level 4: Subprograms
Level 3: Procedures
Level 2: Loops
Level 1: Instructions
DE
GR
EE
OF
PA
RA
LLE
LIS
M
CO
MM
UN
ICATIO
N F
RE
QU
EN
CY
SC
HE
DU
LIN
G O
VE
RH
EA
D
From 2 to thousands
of instructions
Less than 500 inst.
Less than 2000 inst.
Thousands of inst.
Tens of thousands of
instructions
PA
RT 4
: P
AR
ALLE
L P
ATTE
RN
S
GRANULARITY - EXAMPLE
10Dr. Soha S. Zaghloul 10
Consider the problem of calculating all the pixels in all the frames of a computer-
animated film. This may be solved in one of two ways:
Assign a distinct processor to calculate each pixel
Assign a distinct processor to render each entire frame
Each result requires a small amount of computation.
This is fine-grained parallelism.
Each result requires a large amount of computation.
This is coarse-grained parallelism.
PA
RT 4
: P
AR
ALLE
L P
ATTE
RN
S
PARALLELISM PATTERNS
11Dr. Soha S. Zaghloul 11
Three principal patterns for designing parallel programs are identified. These are:
Result Parallelism
Agenda Parallelism
Specialist Parallelism
Using the above patterns, the steps for designing a parallel program are:
Identify the pattern that best matches the problem
Take the pattern’s suggested design as a starting point
Implement the pattern using appropriate constructs in a parallel programming
language
PA
RT 4
: P
AR
ALLE
L P
ATTE
RN
S
RESULT PARALLELISM (1) – CONCEPT
12Dr. Soha S. Zaghloul 12
Result Parallelism pattern has the following criteria:
There is a collection of multiple results
The individual results are all computed in parallel, each by its own processor
Each processor is able to carry out the complete computation to produce one
result
The conceptual parallel program design is as follows:
Processor 1: Compute Result 1
Processor 2: Compute Result 2
….
Processor N: Compute Result N
PA
RT 4
: P
AR
ALLE
L P
ATTE
RN
S
RESULT PARALLELISM (2) – EXAMPLE 1
13Dr. Soha S. Zaghloul 13
Consider the problem of calculating the factorials of a set of numbers stored in an
array data of size N:
Processor 1 is assigned to compute the factorial of data[0]
Processor 2 is assigned to compute the factorial of data[1]
Processor N is assigned to compute the factorial of data[N-1]
The figure in the next slide illustrates the result pattern:
PA
RT 4
: P
AR
ALLE
L P
ATTE
RN
SRESULT PARALLELISM (3) – FIGURE EXAMPLE 1
14Dr. Soha S. Zaghloul 14
Result Parallelism is depicted in the following figure:
Processor
1
Factorial
data[0]
Processor
2
Factorial
data[1]
Processor
3
Factorial
data[2]
Processor
8
Factorial
data[7]
All processors’ results are independent of each other.
We are concerned with the result calculated by each stand-alone processor.
Note that there is no data sharing between processors.
Conceptually speaking, all processors can start and finish at the same time.
PA
RT 4
: P
AR
ALLE
L P
ATTE
RN
SRESULT PARALLELISM (4) – SEQUENTIAL DEPENDENCY
EXAMPLE 2
15Dr. Soha S. Zaghloul 15
Recalculating the formulae in a spreadsheet is another example of Result Parallelism.
Conceptually, each cell has its own processor that computes the value of the
cell’s formula.
However, if the formula for cell B1 uses the value of cell A1, then B1 must wait
until A1 finishes: This is known as Result Parallelism with Sequential
Dependency.
The figure in the next slide depicts this concept.
Tim
e =
t1
PA
RT 4
: P
AR
ALLE
L P
ATTE
RN
SRESULT PARALLELISM (5) – SEQUENTIAL DEPENDENCY
EXAMPLE 2 FIGURE
16Dr. Soha S. Zaghloul 16
Processor
1
Result 1
Processor
2
Result 2
Processor
3
Result 3
Processor
4
Result 4
Tim
e =
t2
Processor
5
Result 5
Processor
6
Result 6
Processor
7
Result 7
Processor
8
Result 8
PA
RT 4
: P
AR
ALLE
L P
ATTE
RN
S
AGENDA PARALLELISM (1) – CONCEPT
17Dr. Soha S. Zaghloul 17
Agenda Parallelism pattern has the following criteria:
There is a collection of multiple tasks
We are interested in one result only, or a small number of results
Each processor is able to carry out the complete computations to produce one
result for the assigned task
The conceptual parallel program design is as follows:
Processor 1: Perform task 1
Processor 2: Perform task 2
….
Processor N: Perform task N
PA
RT 4
: P
AR
ALLE
L P
ATTE
RN
S
AGENDA PARALLELISM (2) – FIGURE
18Dr. Soha S. Zaghloul 18
Agenda Parallelism is depicted in the following figure:
Processor
1
Task 1
Processor
2
Task 2
Processor
3
Task 3
Processor
8
Task 8
PA
RT 4
: P
AR
ALLE
L P
ATTE
RN
SAGENDA PARALLELISM (3) – SEQUENTIAL DEPENDENCY
EXAMPLE 3
19Dr. Soha S. Zaghloul 19
Consider the following problem for an array of numbers data[4]:
Get the factorial of each number in the array data
Get the Fibonacci of each factorial
Classify into three categories:
The following code segment illustrates the problem:
Numbers that are less than threshold1
Numbers that are greater than threshold2
Numbers between threshold1 and threshold2
Phase 1
Phase 2
Phase 3
//calculate Factorial
for (i=0; i < N; i++) factorial[i] = Facto(data[i]); //Facto is a method
//calculate Fibonacci
for (i=0; i < N; i++) fibonacci[i] = Fibo (factorial[i]); //Fibo is a method
//classify according to thresholds
x = 0; y = 0; x = 0;
for (i=0; i < N; i++)
if (fibonacci[i] < threshold1) {class1[x] = fibonacci[i]; x++;}
else if(fibonacci[i] > threshold3) {class3[z] = fibonacci[i]; z++;}
else {class2[y] = fibonacci[i]; y++}
Ph
ase
1
PA
RT 4
: P
AR
ALLE
L P
ATTE
RN
SAGENDA PARALLELISM (4) – FIGURE EXAMPLE 3
20Dr. Soha S. Zaghloul 20
Processor
1
Factorial
data[0]
Processor
2
Factorial
data[1]
Processor
3
Factorial
data[2]
Processor
4
Factorial
data[3]
Ph
ase
2
Processor
5
Fibonnaci
facto[0]
Processor
6
Fibonnaci
facto[1]
Processor
7
Fibonnaci
facto[2]
Processor
8
Fibonnaci
facto[3]
Ph
ase
3
Processor
9
Less than
threshold1
Processor
10
Between
threshold1 &
threshold2
Processor
11
Greater
than
threshold2
PA
RT 4
: P
AR
ALLE
L P
ATTE
RN
S
AGENDA PARALLELISM (5) – REDUCTION
21Dr. Soha S. Zaghloul 21
When the output of an agenda parallel program is a summary of the individual tasks’
results, the program is following the so-called reduction pattern.
Consider the example of finding the product of factorials of a set of numbers stored in
an array data of size N:
Task 1: determine the factorial of data[0]
Task 2: determine the factorial of data[1]
Task N: determine the factorial of data[N-1]
Task N+1: find the product of all factorials
The figure in the next slide depicts such pattern.
Ph
ase
2P
ha
se
1
PA
RT 4
: P
AR
ALLE
L P
ATTE
RN
SAGENDA PARALLELISM (6) – REDUCTIONEXAMPLE 4
22Dr. Soha S. Zaghloul 22
Processor
1
Factorial
(data[0])
Processor
2
Factorial
(data[1])
Processor
3
Factorial
(data[2])
Processor
4
Factorial
(data[3])
Processor
5
Product
of
factorials
PA
RT 4
: P
AR
ALLE
L P
ATTE
RN
S
SPECIALIST PARALLELISM (1) – CONCEPT
23Dr. Soha S. Zaghloul 23
Specialist Parallelism pattern has the following criteria:
There is a group of tasks that must be performed to solve the problem on a
series of (items) data
Each processor performs only one task on a series of data
The conceptual parallel program design is as follows:
Processor 1: For each item
Perform task 1 on the item
Processor 2: For each item
Perform task 2 on the item
….
Processor N: For each item
Perform task N on the item
The figure in the next slide depicts the Specialist Pattern.
PA
RT 4
: P
AR
ALLE
L P
ATTE
RN
S
SPECIALIST PARALLELISM (2) – FIGURE
24Dr. Soha S. Zaghloul 24
Specialist Parallelism is depicted in the following figure:
Task 1,
Item 1
Task 1,
Item 2
Task 1,
Item 3
Task 1,
Item 4
Task 1,
Item 5
Processor
1
Task 2,
Item 1
Task 2,
Item 2
Task 2,
Item 3
Task 2,
Item 4
Task 2,
Item 5
Processor
2
Task 3,
Item 1
Task 3,
Item 2
Task 3,
Item 3
Task 3,
Item 4
Task 3,
Item 5
Processor
3
PA
RT 4
: P
AR
ALLE
L P
ATTE
RN
S
SPECIALIST PARALLELISM (3) – EXAMPLE 5
25Dr. Soha S. Zaghloul 25
Given an array data[8], we need to:
Count the number of positive elements
A code segment of the sequential version of the above problem is shown below:
Count the number of negative elements
Count the number of zeroes
Processor 1
Processor 2
Processor 3
for (i=0; i < N; i++)
if (data[i] > 0) positive++;
else if (data[i] < 0) negative ++;
else zero++;
A code segment of the parallel version of the above problem is shown below:
for (i=0; i < N; i++) if (data[i] > 0) positive++; //Processor 1
for (i=0; i < N; i++) if (data[i] < 0) negative ++; //Processor 2
for (i=0; i < N; i++) if (data[i] == 0) zero++; //Processor 3
The figure in the next slide illustrates Example 5.
PA
RT 4
: P
AR
ALLE
L P
ATTE
RN
SSPECIALIST PARALLELISM (4) – FIGURE EXAMPLE 5
26Dr. Soha S. Zaghloul 26
data [0]
data [1]
data [2]
data [3]
data [4]
Processor
1
data [5]
data [6]
data [7]Count positive
numbers
data [0]
data [1]
data [2]
data [3]
data [4]
Processor
2
data [5]
data [6]
data [7]Count negative
numbers
data [0]
data [1]
data [2]
data [3]
data [4]
Processor
3
data [5]
data [6]
data [7] Count zeroes
PA
RT 4
: P
AR
ALLE
L P
ATTE
RN
S
SPECIALIST PARALLELISM (5) – PIPELINE
27Dr. Soha S. Zaghloul 27
When there are sequential dependencies between the tasks in a specialist parallel
problem, the program follows a pipelined pattern.
The output of one processor becomes the input for the next processor.
All processors work in parallel, each taking its input from the preceding processor’s
previous output.
Consider the following example in an image processing application:
Calculate all pixels of a frame
Render the frame
Compress the frame
Processor 1
Processor 2
Processor 3
Store the frame Processor 4
PROCESSOR 1:
CALCULATE
PA
RT 4
: P
AR
ALLE
L P
ATTE
RN
SSPECIALIST PARALLELISM (6) – FIGUREEXAMPLE 6
28Dr. Soha S. Zaghloul 28
Frame 1
Frame 2
Frame 3
Frame 4
Frame 5
1
2
3
4
5
TIM
E IN
P1
PROCESSOR 2:
RENDER
Frame 1
Frame 2
Frame 3
Frame 4
Frame 5
1
2
3
4
5TIM
E IN
P2
PROCESSOR 3:
COMPRESS
Frame 1
Frame 2
Frame 3
Frame 4
Frame 5
1
2
3
4
5
TIM
E IN
P3
PROCESSOR 4:
STORE
Frame 1
Frame 2
Frame 3
Frame 4
Frame 5
1
2
3
4
5
TIM
E IN
P4
Note that the time is relative to each processor.
The next figure depicts the example with respect to absolute time.
PA
RT 4
: P
AR
ALLE
L P
ATTE
RN
SSPECIALIST PARALLELISM (6) – PIPELINEEXAMPLE 6: ABSOLUTE TIME
29Dr. Soha S. Zaghloul 29
12 11 10 9 8 7 6 5 4 3 2 1
Frame 1 ST CO RE CA
Frame 2 ST CO RE CA
Frame 3 ST CO RE CA
Frame 4 ST CO RE CA
Frame 5 ST CO RE CA
Time in cycles
Fra
me
s
P1P2P3P4
PA
RT 4
: P
AR
ALLE
L P
ATTE
RN
S
NOTES
30Dr. Soha S. Zaghloul 30
A sequential program may be completely re-written to adopt to a parallel pattern (See
Example 5).
The difference between parallelism patterns can be summarized as follows:
Result Parallelism: We are concerned with the result of each processor
Agenda Parallelism: We are concerned with only a combination of results
(sequential dependency), or a summary of the individual results (reduction).
Specialist Parallelism: focuses on the processors that can execute in parallel.