university of reginazhang/cs8022003/part1_2003.doc · web viewparallel processing. cs802. part i....
TRANSCRIPT
PARALLEL PROCESSING
CS802
Part I
Instructor: Dr. C. N. Zhang
Department of Computer ScienceUniversity of Regina
Regina, SaskatchewanCanada, S4S 0A2
Table of Contents1. Introduction to Parallel Processing
2. Pipeline
2.1. Design of pipeline
2.2. RISC instruction pipeline
2.3. Mapping nested loop algorithms into systolic arrays
3. Parallelism Analysis
3.1. Data dependency analysis for non loops
3.2. Data dependency analysis for loops
4. Parallel Computer Architectures
4.1. Review of computer architecture
4.2. Computer classification
4.3. SIMD Computers
4.4. MIMD Computers
5. Parallel Programming
5.1. Parallelism for single loop algorithms
5.2. Parallelism for nested loop algorithms
5.3. Introduction to MultiPascal (Part II)
6. Examples of Parallel Processing
6.1. Intel MMX technology (Part III)
6.2. Cluster computing (Part IV)
6.3. Task scheduling in message passing MIMD
6.4. Associative processing
6.5. Data flow computing
1
1. Introduction to Parallel Processing
Parallel Processing: two or more tasks being performed (executed) simultaneously. Basic techniques for parallel processing
a. pipelineb. replicated unitsc. pipeline plus replicated units
Parallel processing can be implemented at the following levelsa. at processor levelb. at instruction execution level c. at arithmetic and logic operation leveld. at logic circuit level
Speed-up of the parallel processing:
Where TS is the time by processing sequential (non-parallel) manner and TP is the processing time required by parallel processing technique.
Efficiency of the parallel processing
Where n is the number of units used to perform the parallel processing.Note: 0 S n and 0 E 1
How To Improve Performance of Computer Systems?a. by electronic technology, e.g., new VLSI developmentb. by software development, e.g., better OS and compilerc. by architecture enhancements
Architecture Enhancementsa. pipelineb. parallelismc. memory hierarchyd. multiple processor computer systemse. hardware supports for OS and compilerf. special purpose architectures
Where aijs are integers, i=1, 2, … , n, j=1, 2, …, m. Assume that the time delay of the Adder is t. Sequential processing: Use one Adder. TS = ( n-1 ) m t
Parallel Processing:
a. use two adders:
2
S = 2 E = 1b. use m adders: Tp = ( n-1 ) t S = m E = 1c. use n-1 adders ( n = 2k ) as pipelining
e.g. n = 8 Tp = ( log n + m-1 ) t
a11 a21 a31 a41 a51 a61 a71 a81
a12 a22 a32 a42 a52 a62 a72 a82
… … … … … … … … a1m a2m a3m a4m a5m a6m a7m a8m
Figure 1.1
3
+
+ +
++++
2. Pipeline
2.1 Design of Pipelinea. The task is able to be divided into a number of subtasks in sequence, and hopefully all subtasks
require the same amount time.b. Design an independent hardware unit for each subtask.c. Connect all units in order.d. Arrange the input data and collect the output data.
Subtask1: S1 = a1j + a2j
Subtask2: S2 = S1 + a3j
Subtask3: Yj = S2 + a4j
b. unit for subtask1: S1
a1j a2j
unit for subtask2: S2
S1 a3j
unit for subtask3: Yj
+++
S2 a3j
c.
S1 S2
a1j
a2j Yj
a3j a4j
4
+
+
+
+ + +
d. 0 0 a14 a13 a12 a11
0 0 a24 a23 a22 a21 Y4Y3Y2Y1
0 0 a31 0 a32 a41
a33 a42
a34 a43
0 a44
Tp = ( n + m - 2 ) t
Second pipeline design:
Subtask1: S1 = a1j + a2j and S2 = a3j + a4j
Subtask2: Yj = S1 + S2 S1 S2
b. unit for subtask1:
a1j a2j a3j a4j
unit for subtask2: Yj
S1 S2 c. Yj
S1 S2
a1j a2j a3j a4j
d. Y1
5
+ + +
+
+
+
++
+
Y2
Y3
Y4
a11 a21 a31 a41
a12 a22 a32 a42
a13
a23
a33
a43
a14 a24 a34 a44
Tp = ( m -1 + log n ) t
2.2 RISC instruction pipeline2.2.1 Features of RISC• Register-oriented organization a) register is the fastest available storage device b) save memory access time for most data passing operations c) VLSI makes it possible to build large number of registers on a single chip
• Few instructions: a) there are two groups of instructions: register to register, memory to register or register to memory b) complex instructions are done by software c) fixed size instruction format
• Simple addressing models a) simplify memory address calculation b) makes it easy to pipeline
• Use register window technique a) optimizing compilers with fewer instructions b) based on large register file
• CPU on a single VLSI chip a) avoid long time delay of off-chip signals b) regular ALU and simple control unit• Pipelined instruction execution
6
++
+
a) two-, three-, or four-stage pipelines b) delayed-branch and data forward
2.2.2 Two types of instruction in RISCV. 1) Register-to-register, each of which can be done by the following two phases: I : Instruction fetch E: Instruction execution 2) Memory-to-register (load) and register-to-memory (store), each of which can be done by the following three phases: I: Instruction fetch E: Memory address calculation D: Memory operation
• Pipeline: a task or operation is divided into a number of subtasks; that need to be performed in sequence, each of which requires the same amount of time.
where Ts is the total time required for a job performed by non-pipeline.Tp is the time required by pipeline.
• Pipeline conflict: If two subtasks on the two pipeline units cannot be computed simultaneously. One task must wastes until another task is completed.
• To resolve the pipeline conflict in the instruction pipeline the following techniques are used a. inset a no operation stage ( N ) used by two-way pipeline. b. inset no operation instruction ( I E ) used by three-way and four-way pipeline. • Two-way Pipeline : there is only one memory in the computer which stores both instructions and data. Phases of I and E can be overlapped. Example:
Load A M Load B M Add C A + B Store M C Branch X X: Add B C + A Halt
Figure 2.1 Sequential execution
Load A M Load B M Add C A + B Store M C Branch X X: Add B C + A Halt
Figure 2.2 Two-way pipelining
7
I E D
I
I
I
I
I
I
E
E
E
E
E
E
D
D
I
I
N
n
N
I
I
I
I
I
E
E
E
E
E
E
E
D
D
D
NNN
NNN
I
N
N
N N
NN
Speed-up = 17 / 11 = 1.54Where N represent the NO operation stage (waiting stage) inserted by OS.
• Three-way pipeline: there are two memory units, one stores instructions, another stores data. Phases of I, E, and D can be overlapped. The NOOP instruction ( two-stage ) is used when Example:
Load A M Load B M NOOP Add C A + B Store M C Branch X NOOP X: Add B C + A Halt
Figure 2.3 Three-way pipelining
Speed-up = 17 / 10 = 1.7Where NOOP represent the NO operation stage (waiting stage) inserted by OS.
• Four-way pipeline: dame as three-way, but phase E is divided into: E1: register file read E2: ALU operation and register write Phases of I, D, E1, and E2 can be overlapped. Example:
Load A M Load B M NOOP NOOP Add C A + B Store M C Branch X NOOP NOOP X: Add B C + A Halt
Figure 2.4 Four-way pipelining
Speed-up = 24 / 13 = 1.85
• Optimization of pipelining. Goal is to reduce the number of NOOP instructions by reordering the sequence of the program. Example: (three-way pipeline)
8
I E D
I
I
I
I
I
I
E
E
E
E
E
E
D
D
E
E
I
I
I E1 DE2
D
D
I
I
I
I
I
I
I
I
I
I
E1
E1
E1
E1
E1
E1
E1
E1
E1
E1
E2
E2
E2
E2
E2
E2
E2
E2
E2
E2
100 Load A M Inserted NOOP
101 Add B B + 1
102 Jump 106
103 NOOP
106 Store M B
100 Load A M Reversed Instructions
101 Jump 105
102 Add B B + 1
106 Store M B
Use of the delayed branch Figure 2.5
2.3 Mapping nested loop algorithms into systolic arrays2.3.1 Systolic Array Processors • It consists of: (i) Host computer: controls whole processing (ii) Control unit: provides system clock, control signals input data to systolic array, and collects results from systolic array. Only boundary PEs are connected to control unit. (iii) Systolic array: is a multiple processors network plus pipelining. Possible types of systolic arrays: (a) Linear array:
Figure 2.6
(b) Mesh connected 2-D array:
Figure 2.7
9
I
I
I
I
I
E
E
E
E
E
D
D
I
I
I
I
E
E
E
E
D
D
(c) 2-D hexagonal array
Figure 2.8 (d) Binary tree
Figure 2.9
(e) Triangular array
Figure 2.10
• Features of systolic arrays:
(i) PEs are simple, usually contain a few registers and simple ALU circuits (ii) There are a few types of PEs (iii) Interconnections are local and regular (iv) All PEs are acting rhythmically controlled by a common clock
(v) The input data are used repeatedly when they pass the PEs
10
Conclusion: - systolic arrays are very suitable for VLSI design- systolic arrays are recommended for computation-intensive algorithms
Example-1: matrix-vector multiplication: Y = A X ,where A is an (n n) matrix and Y and X are (n 1) matrices.
t6 a44
t5 a43 a34
t4 a42 a33 a24 t3 a41 a32 a23 a14
t2 a31 a22 a13
t1 a21 a12
t0 a11
x1 x2 x3 x4
Figure 2.11
where A, B, and C are (n n) matrices.
j k j
i i k =
a31 a32 a33 a21 a22 a23
a11 a12 a13
11
kj
time fronts
b11
b12 b21
b13 b22 b31
b23 b32
b33
c33 c32 c31
c23
c13 c22 c21
c12
c11
A two-dimensional systolic array for matrix multiplication.
Figure 2.12
2.3.2 Mapping Nested Loop Algorithms onto Systolic Arrays
• Some design objective parameters: (i) Number of PEs ( np ): total number of PEs required. (ii) Computation time ( nc ): total number of clocks required for systolic array to
complete a computation. (iii) Pipeline period (p): time period of pipelining. (iv) Space time2 ( np tc
2 ): is used to measure cost and performance, suppose speed is more* important than cost.
• In general, a mapping procedure involves the following steps: Step 1: algorithm normalization (all variables should be with all indices) Step 2: find the data dependence matrix of the algorithm called D Step 3: pick up a valid l l transformation T (suppose the algorithm is a l-nested loops) Step 4: PE design and systolic array design Step 5: systolic array I/O scheduling
12
Step1: Algorithm normalization Some variables of a given nested algorithms may not have all indices which are called
broadcasting variables. In order to find out the data dependence matrix of the algorithm, it is necessary to eliminate all broadcasting variables by renaming or adding more variables.
which can be described by the following nested algorithm:
for i : = 1 to n for j : = 1 to n for k : = 1 to n C( i, j ) : = C( i, j ) + A( i, k ) * B( k, j );
(suppose all C( i, j ) = 0 before the computation) Where A, B, and C are broadcasting variables.
Algorithm normalization:
for i : = 1 to n for j : = 1 to n for k : = 1 to n begin A( i, j, k ) : = A( i, j-1, k ); B( i, j, k ) : = B( i-1, j, k ); C( i, j, k ) : = C( i, j, k-1 ) + A( i, j, k ) * B( i, k, j );
End
where A( i, 0, k ) = aik , B( 0, j, k ) = bkj, C( i, j, 0 ) = 0, and C( i, j, n ) = cij
Step2: Data dependence matrixIn a normalized algorithm, variables can be classified as(i) generated variables: those variables appearing on the left side of the assignment statements(ii) used variables: those variables appearing on the right side of the assignment statements
Consider a normalized algorithm:
for i1 : = l11 to ln
1
for i2 : = l22 to ln
2
: : for il : = ll
1 to ln1
begin S1
S2 : : Sm
13
end
Let A( i11 , i2
1 , …, in1 ) and B( i1
2 , i22 , …, il
2 ) be a pair of generated variable and used variable in a statement.
Vector is called a data dependence vector denoted by dj with
respect to pair of A( i1
1 , i21 , …, in
1 ) and B( i12 , i2
2 , …, il2 )
The data dependence matrix consists of all possible data dependence vectors of an algorithm.
Example: In the above example ( matrix-matrix multiplication ), we have
Step3: Find a valid linear transformation matrix T: Suppose l = 3 ( i1 = i, i2 = j, i3 = k )
Where is a ( 1 3), matrix, S is a ( 2 3) matrix
T is a valid transformation if it meets the following conditions (a) det (T) 0 (b) di > 0 for all i
In fact, T can be viewed as a linear mapping which maps an index space into
is called the time transformation, S is called the space transformation.
Det (T) 0 implies that it is a one to one mapping. A computation originally is performed at ( i, j, k ) is now computed at a PE located at ( x, y ) at time t, where t and ( x, y ) are determined as shown above. Note that for a given algorithm there may be a large number of valid transformations exist. An important goal of mapping is to choose an "optimal" one which is defined by user. For example, it can reach the minimum value of tc
2 np .
Step4.1: PE design including:(a) arithmetic unit design according to the computations required in the body of the loop(b) inputs and outputs of PE and their directions.
14
The number of inputs and outputs are always the same as the number of vectors in the matrix D. The direction of each pair of (input, output) is determined by the new vector obtained by:
y
1
-1 x
Step4.2: Systolic array design include: (a) Locations of all PEs which are obtained by the equation of
be all possible index points in the given algorithm. For
instance in the above example, suppose n = 3, then the set of all possible index points are J = ({ 1, 1, 1 }, ( 1, 1, 2 ), ... , ( 3, 3, 3 )). All possible different values of pair ( x, y ) form all locations of PEs.
(b) After having located all PEs in the X-Y plane, interconnections between PES are obtained according to the data path of each pair of input and output determined in the PE design stage.
Example: Consider the matrix-matrix multiplication again. We have:
• PE design: According to the body of the algorithm, it requires one adder and one multiplier. Since
15
there are three pairs of inputs and outputs: ( ain, aout ), ( bin, bout ), and ( cin, cout ).
Example:
Consider the matrix-matrix multiplication again. We have:
• PE design: According to the body of the algorithm, it requires one adder and one multiplier. Since
there are three pairs of inputs and outputs: ( ain, aout ), ( bin, bout ), and ( cin, cout ).
Y
aout
c aout : = ain bout : = bin c : = c + ain * bin bout bin
16
PE
1 1 11 1 21 1 31 2 11 2 21 2 31 3 11 3 21 3 3
2 1 12 1 22 1 32 2 12 2 22 2 32 3 12 3 22 3 3
3 1 13 1 23 1 33 2 13 2 23 2 33 3 13 3 23 3 3
-1 1 -1 1 -1 1 -1 2 -1 2 -1 2 -1 3 -1 3 -1 3
-3 1 -3 1 -3 1 -3 2 -3 2 -3 2 -3 3 -3 3 -3 3
-2 1 -2 1 -2 1 -2 2 -2 2 -2 2 -2 3 -2 3 -2 3
x y x y x y i j k i j k i j k
ain
XFigure 2.13
The mapped vector indicates that variable will stay in the register.
• More detail design of the PE:
aout
bout bin
ain
Figure 2.14
• Array design (n = 3)
According to we have the following PE location table, where
1 i 3, 1 j 3, and 1 k 3.
Table 2.1
17
+
C *
All possible PE locations are (-1, 1), (-1, 2), (-1, 3), (-2, 1), (-2, 2), (-2, 3), (-3, 1), (-3, 2), and (-3, 3).
Place all PEs on the X-Y plane according to those locations and connect them according to the data paths determined in the PE design. Thus, we have the following array ( np = 9 ):
Y
X Figure 2.15
Step5: I/O Scheduling: determine locations of all input and output data.
Computation time: tc = tmax - tmin + 1
In the above example, we have tmax = 9, tmin = 3, tc = 9 – 3 + 1 =7. Reference time tref = tmin – 1 = 2. Input data are: a11 a12 a13 a21 a22 a23 a31 a32 a33
b11 b12 b13 b21 b22 b23 b31 b32 b33
c11 c12 c13 c21 c22 c23 c31 c32 c33
Output data are: c11 c12 c13 c21 c22 c23 c31 c32 c33
Location of all: according to the assumption of the normalized algorithm all = a(1,0,1).
Since t = tref = 2, all should be placed
18
at x = -1 and y =0.
Location of a12 :
Since t = 3 > reference time (tref = 2 ), therefore, a12 will be placed at x = -1 and y = -1 (shift back the difference along the direction of "a" data path).
Location of c11 : c11 = c( 1, 1, 0 ),
Therefore, c11 is located in PE at x = -1 and y = 1.
Continue this process. Finally, we have the following 1/0 scheduled systolic design:
0 0 b13 b23 b33
0 b12 b22 b32
19
C3
2
C3
1
C1
1
C2
1
C2
2
C2
3
C1
3
C1
2
C3
3
b11 b21 b31
0 0 a11 0 a21 a12 a31 a22 a13 a32 a23 a33
Figure 2.16
2.3.3 Approaches To Find Valid Transformations For a given algorithm, there may be a large number of valid transformations exist. An important task in the mapping problem is to find an efficient systematic approach to search for
all possible solutions and find an optimal one. In general, we hope to find a valid transformation such that:
(i) computation time is minimized (ii) total number of PEs is minimized (iii) pipeline period is 1
As shown in the above example, all these parameters ( tc, np, p ) can be calculated after the whole design steps are completed. In the following, we give simple formulae for those design parameters without proof.
Suppose the normalized algorithm has the form: for i : = 1 to n for j : = 1 to n for k : = 1 to n
begin S1
S2 : : Sm
end
The valid transformation is
• Computation time ( tc ): tc = ( l1 - 1 ) | t11 | + ( l2 - 1 ) | t12 | +( l3 - 1 ) | t13 |
• Number of PEs ( np ):
20
Tij is the ( i, j )-cofactor of matrix T.
• Pipeline period ( p ):
• Approach-1:
Three basic row operations: (1) rowi < --- > rowj
(2) rowi : = rowi + K rowj , i j, K is an integer (3) rowI : = rowi (-1)
Each row operation can be done by multiplying an elemental matrix to the given matrix, for example:
Suppose for a given matrix D, after applying those basic row operations ( R1, R2, Rn ), we have matrix as the result, such that all elements of the first row become positive integers. We conclude that matrix T = Rn ... R2 R1 , is a valid transformation.
In fact, it is easy to verify: (i) T = Rn ... R2 R1 , det(T) = det(Rn) .... det(R1)
Since det(Ri) = 1, therefore, det(T) = 1 (ii) Since all elements of the first row of T D > 0, it means di > 0 for all i.
Besides, according to the formula , we have p = 1.
Example: we use the same example again.
21
• Approach-2:
Given D = [d1, d2 ... dm]
Define: a( 2 9 ) universal neighbour connection matrix:
When m is the number of vectors in data dependency matrix D.
Step-1: select a such that di > 0 for 1 i m (under the condition of )
Step-2: pick up a (9 m) matrix K (under the condition of ) Step-3: find S by solving equation S D = P K
Step-4: if det(T) 0 then done , otherwise pick up another and repeat
those steps until a valid transformation matrix T is found.
Example: Given
Step -1: pick up = [1 0 –1]
22
From equations:t2l - t22 = 0t2l - t23 = 0t2l + t22 - 2 t23 = 03 t22 - 2 t23 = 1
t3l - t32 = 0t3l - t33 = 0t3l + t32 - 2 t33 = 03 t32 - 2 t33 = 1
3. Analysis of Parallelism in Computer Algorithms
3.1 Data dependences in non loops• > Data dependence of two computations: one computation can not start until the other one is
completed. • > Parallelism detection: find out sets of computations that can be performed simultaneously.• > Types of data dependencies on non-loop algorithms or programs: Suppose that Si and Sj are two computations (statements or instructions in a non-loop algorithm or
program). Ii , Ij and Oi , Oj are the inputs and outputs of Si and Sj , respectively.
a) Type-1: data flow dependence ( read after write, denoted by RAW )
23
•) condition (1) j > i(2) Ij Oi
example-1 : : 100 R1 R2 R3
: 105 R5 Rl + R4 example-2: : :
Si : A : = B + C Sj : D : = A E + 2
•) notation: ( computation Sj defends on computation Si )
b) Type-2: data antidependence ( write after read, denoted by WAR ) •) condition: (1) j > i
(2) Oj Ii example-1 : : 200 R1 R2 R3
: 205 R5 R2 + R5 example-2: : :
Si : A : = B C Sj : B : = D + 5
•) notation: ( computation Sj defends on computation Si )
a) Type-3: data-output dependence ( write after read , denoted by WAW )
•) condition (1) j > i(2) Oi Oj
24
Si
Sj
Si
Sj
example: : :
Si : A : = B + C :
Sj : A : = D E
•) notation: ( computation Sj defends on computation Si )
•> Data dependence graph ( DDG ): represents all data dependencies among statements (instructions). Example: find DDG for the following program.
100 R1 R3 div R2
101 R4 R4 R5
102 R6 R1 + R3
103 R4 R2 + R3
104 R2 R4 + R2
Solution: Si, Sj Dependency di
100, 101 no100, 102 type-1 1100, 103 no 100, 104 type-2 2
101, 102 no101, 103 type-3 and type-2 4 and 5101, 104 type-1 6
102, 103 no102, 104 no
103, 104 type-1 and type-2 7 and 8
DDG: d2
d8
d7
25
Si
Sj
100
101
102
103
104
d5 d4
d1 O d6
3.2 Data dependences in loops Consider the case of single loop in the form:
For l = l1 to ln Do l = l1, ln Begin S1 S1 S2 or S2 : : SN SN End Enddo
A loop can be viewed as a sequence of statements in which the statements ( the body of an algorithm ) are repeated n times.
e.g., ( n = 3, N = 2 )
There are still three types of data dependencies. However, the conditions should be modified. Suppose that Si(li) and Sj(lj) are two statements in the loop, Where li and 1j are the indices of the variable shared by Si and Sj respectively
a) Type-1: (RAW) •) condition: (1) li - 1j > 0 or li = 1j and j>i
(2) Oi Ij •) notation: distance, d = li - 1j 0
d
•) example: for l : = 1 to 5 begin
S1 : B( l + 1 ) : = A(l) C( l + 1 ) S2 : C( l + 4 ) : = B(l) A( l + 1 )
26
Si
Sj
S1 ( l = 1 )S2 ( l = 1 )S1 ( l = 2 )S2 ( l = 2 )S1 ( l = 3 )S2 ( l = 3 )
end Si Sj shared variable li - 1j = d S1 S2 B ( l + 1 ) – l = 1 S2 S1 C ( l + 4 ) – ( l + 1 ) = 3
d = 1 d = 3
b) Type-2: (WAR) •) condition: (1) li - 1j > 0 or li = 1j and j>i
(2) Oj Ii •) notation: distance, d = li - 1j 0
d
c) Type-3: (WAW) ( i j ) •) condition: (1) li - 1j > 0 or li = 1j and j>i
(2) Oi Oj •) notation: distance, d = li - 1j 0
d
•) example: DO l = 1, 4
S1 : A(l) : = B(l) + C(l) S2 : D(l) : = A( l - 1 ) + C(l) S3 : E(l) : = A( l - 1 ) + D( l – 2 ) ENDDO
Si Sj shared variable li - 1j = d Type di S1 S1 no S1 S2 A l - ( l - 1 ) = 1 1 d1
S1 S3 A l - ( l - 1 ) = 1 1 d2
S2 S1 A ( l – 1) - l = -1 no S2 S2 no S2 S3 D l - ( l - 2 ) = 2 1 d3
S3 S1 A ( l - 1 ) - l = -1 no S3 S2 D l - 2 - l = -2 no S3 S3 no
DDG: l = 1 l = 2 l = 3 l = 4
27
Si
Sj
Si
Sj
Si
Sj
S1
S1
S1
S1
d1 d1 d1
d2 d2 d2
d3 d3
3.4. Techniques for removing some Data Dependencies
• Variable renaming.
S1 : A : = B C S1 : A : = B C S2 : D : = A + 1 ======> S2 : D : = A + 1 S3 : A : = A D S3 : AA : = A D
Data dependencies:
======>
• Node splitting (introducing new variable).
DO l = 1, N DO l = 1, N
28
S2
S3
S2
S2
S2
S3
S3
S3
S1
S1
S3
S3
S2
S2
S0 : AA(l) : = A( l + 1) S1 : A(l) : = B(l) + C(l) S1 : A(l) : = B(l) + C(l) S2 : D(l) : = A(l) + 2 ======> S2 : D(l) : = A(l) + 2 S3 : F(l) : = D(l) + A(l+1) S3 : F(l) : = D(l) + AA(l) ENDDO ENDDO Data dependencies:
d=1
d=0
d=0 d=0 d=1
d=1
======> • Variable substitution.
S1 : X : = A + B S1 : X : = A + B S2 : Y : = C D ======> S2 : Y : = C D S3 : Z : = EX + FY S3 : Z : = E( A + B ) + F( C D )
Data dependencies:
======>
29
S1
S1
S3
S3
S2
S2
S0
S1
S1
S3
S3
S2
S2
4. Parallel Computer Architectures
4.1 Review of Computer Architecture
a. Basic Components of Computer System CPU: includes ALU and Control Unit Memory: cache memory, main memory, secondary memory I/O: DMA, I/O interface, and I/O devices Bus: a set of communication pathways (data, address, and control signals) connect two or
more componentsb. Single Processor Configurations
Multiple bus connected computer systems.
30
CPU Memory
I/O
I/O
CPUMemory
I/O
I/O
I/O Bus Memory Bus
or
I/O Bus Memory Bus
Figure 4.1
Single bus connected systems
BBSY BR
BG
System Bus
Figure 4.2
Note: At any time only one unit can control the bus, though two or more units require to use the bus at the
same time. The arbiter is used to select which unit is going to use the bus next time. BR (bus request), BG (bus grant), SACK (Select acknowledgement), and BBSY (bus busy) are the
most important control signals used by the arbiter. The timing diagram of these control signals are:
BBSY
BR
31
CPU Memory I/O I/ODMAArbitercache
BG
SACK
Figure 4.3
c. Central Processing Unit (CPU) ALU : performs all logic and arithmetic operations and has some registers which store
operands and results. Control Unit: produces all control signals such that instructions can be fetched and executed
one by one. Control unit can be implemented by logic circuits or microprogramming.
An instruction cycle includes several smaller function cycles, e.g.,1. instruction fetch cycle.2. indirect (operand fetch) cycle.3. execution cycle.
d. Memory• Store instructions and data
• Memory configuration: Hardware mapping
DMA transfer
Figure 4.4
• cache memory : a small fast memory which is placed between CPU and main memory • why cache? : there is a big speed gap between CPU and main memory • hardware mapping: direct mapping, associative mapping, and set associative mapping
e. Inputs and Outputs (I/O)• Programming I/O: CPU is idle While waiting• interrupt driven I/O: CPU is interrupted when I/O requires service
Bus
32
Cache
Main Memory
Secondary Memory
I/O Device
I/O Device
I/O Interface
Figure 4.5
• DMA 1/0 VO directly data to/from memory
Bus
Figure 4.6
4.2 Computer Classification
According to Flynn's classification all normal stored program computers (von Neumann computers) can be classified into:
• Single-instruction single-data computers (SISD)
• Single-instruction multiple-data computers (SIMD)
• Multi ple-instruction multiple-data computers (MIMD)
• Multiple-instruction single-data computers (MISD)
Processors Processors Control (ALU and Control (ALU and Units registers) units registers)
SISD MISD Instructions Instruction Data Data
SIMD Data MIMD Instructions Data Instruction
Figure 4.7
33
High speed I/O
DMA controller
4.3 SIMD Computers
A single control unit generates a single instruction stream and the instructions are broadcast to all other processors (PEs). Each processor executes the same instruction on different data.
There are two kinds of SIMD computers: (a) Shared memory SIMD
Figure 4.8
(i) Memories are separated from PEs (ii) There are several IN architectures for shared memory access:
(1) Single bus
Processors/memories
Figure 4.9
(2) Multiple busesProcessors/memories
34
PE1
M1
PEnPE2
M2 Mm
HOST
CU memory
CU
Interconnection Network ( IN )
Figure 4.10
(3) Cross-bar switch
Memories
Figure 4.11
(4) Multistage interconnection networks
Switching Network
Figure 4.12
(b) Local memory SBM
I/O Data bus
IN control
Control bus
35
Proc
esso
rsPr
oces
sors
Mem
orie
s
PE1 PEnPE2
HOST
CU memory
CU
Figure 4.13
(i) Each PE has its own memory (ii) The interconnection network (IN) can be:
(1) Mesh
Figure 4.14
(2) Nodes with eight links
36
M1 M2 Mn
Interconnection Network ( IN )
Figure 4.15
(3) Cube
Figure 4.16
(4) Exhaustive
Figure 4.17
(5) Tree
Figure 4.18
(iii) There are two kinds of instructions: - vector instructions- non vector instructions
37
(iv) All control instructions and scalar instructions (non vector) are executed by the host.
(v) If it is a vector instruction, the CU broadcasts: - instruction to all PEs - constant to all PEs (if it is needed)
- address to all PEs (if the instruction requires PEs to access data from their memories.
Note: Each PE has its own address index register
read address = address sent by CU + index register
Suppose n = 4 and each PE has three address index registers: RA, RB, and RC. The data stored in the four PEs are as follow:
k j j
i k i =
Figure 4.19
Program:
for i = 1 to n in parallel for all j, 1 j n
RC(j) = 0 /* processors clear RC registers */
38
M1 M2 M4M3
a11
a21
a31
a41
b11
b21
b31
b41
c11
c21
c31
c41
a12
a22
a32
a42
b12
b22
b32
b42
c12
c22
c32
c42
a13
a23
a33
a43
b13
b23
b33
b43
c13
c23
c33
c43
a14
a24
a34
a44
b14
b24
b34
b44
c14
c24
c34
c44
for k= 1 to n fetch a( i, k ) /* by control unit */ broadcast a( i, k ) /* by control unit */ RA(j) = a( i, k ) /* processors store broadcast data */ RB(j) = b( k, j ) /* processors fetch local operands */ RA(j) = RA(j) * RB(j) /* processor multiply */ RC(j) = RA(j) + RC(j) /* processors update RC registers */
end loop k c( i, j ) = RC(j) /* processors store RC registers */
end loop i
4.4 MIMD Computers
In a MIMD system, a number of independent processors operate upon separate data concurrently.There are two types of MIMD computers
1. Shared memory MIMD2. Message passing MIMD
4.4.1 Shared memory MIMD
Shared memory MIMD uses the centralized memory for communication purpose, as shown in Fig.4.20 Processors …
…
MemoriesFigure 4.20 Shared memory, MIMD
Possible interconnection networks (see shared memory SIMD) Single bus Multiple bus Cross-bar switch Multistage network
Advantages of shared memory MIMD: easy programming
Disadvantages of shared memory MIMD: memory connection which happens when the shared memory is dealing with many processor requests within a very short time period. Some processors must wait for other processors to be served. It increases as the number of processors increases.
Some techniques helping reduce the memory contention: Local cash memory Shared memory with multiple memory modules.
39
Interconnection network
P1 P2 Pn
Mn M1 M2
4.4.2 Message passing MIMD
Message passing MIMD consists of a number of independent processors, each of which has its own memory. All processors connect through a interconnect network as shown in Fig. 4.21. The communications between processors are done by passing data through the direct links.
. . .
Figure 4.21 Message Passing MIMD
Possible interconnection networks (see local memory SIMD) Mesh Node with eight links Cube Exhaustive (star) Tree
Advantages of message passing MIMD: Large number of processors can be used; potential higher performance
Disadvantages of message passing MIMD: difficult programming, Higher communication cost.
5. Parallel Programming
5.1 Parallelism for Single Loop Algorithms
FORALL (DOALL) and FORALL GROUPING Statements
• FORALL (DOALL) is the parallel form of FOR (DO) statement for loops.
• Condition of using FORALL (DOALL) : there are no data dependences in the loop or distance of all data dependences is zero.
• Assumption: there are n+1 processors when n is the number of iterations in the loop.
• Consider a loop:
40
Interconnect network
for l : = l1 to ln DO l1 , ln begin S1
S1 S2
S2 : : Sm
Sm ENDDO end It requires the following two steps:
1. a processor creates n processes by assign one iteration to one processor.2. n processors execute n iterations at the same time.
Where t1 is the processing time for one iteration, t2 is the time to create one process. Example: For i : = 1 to 100 begin A(i) : = SQRT(A(i)); ( S1 ) B(i) : = A(i) B(i); ( S2 ) end
Data dependence graph :
d1 = 0
d2 = 0
applying FORALL : FORALL i : = 1 to 100 DO begin A(i) : = SQRT(A(i)) ; B(i) : = A(i) B(i) ; end
Assuming the process creation time is 10 time units ( 20t ), the execution time for each iteration is 100t.
It requires 100 + 1 = 101 processors.
41
S1
S2
• Process granularity Granularity of the process refers to the execution time of each process ( iteration ). In order to have meaningful speed-up, the execution time of the process must be larger than the creation time (overhead).
For example, in the previous example. Assuming the creation time is 50 t and the execution time of each process is 50 t.
In parallel programming, it uses the keyword GROUPING to group a number of indices together granularity problem. It also can be used when the number of processes are longer than the number of processors in the system.
Format : FORALL GROUPING DO Where G is the number of processes (iterations) in a group. Applying FORALL GROUPING to the precious example:
FORALL i : = 1 to 100 GROUPING 10 DO begin A(i) : = SQRT(A(i)) ; B(i) : = A(i) B(i) ; end
• Optimal group size Lot: n be total number of iterations G be group size C be time to create one process T be execution time of each process The total time for FORALL GROUPING G is
DOACROSS Transformation for loops with no zero distance data dependence(s)
a) Assumptions: • There are N processors. • Each - processor can communicate with others.
42
• Simple loop algorithm has analyzed and some data dependencies have been removed. The algorithm will have the following form, and there is (are) data dependence(s) among the statements.
DO l = 1, N S1
S2
: Sn
ENDDO
• Each statement, Si requires the same time, t.
b) DOACROSS is a parallel language construct to transform sequential loops into parallel forms. DOACROSS is used when there are data dependencies between iterations.
•> Each processor performs all computations that belong to the sad6 iteration, e.g., all computations belong to the first iteration are performed by processor_1 and processor_n will perform all computations belong to the nth iteration.
•> If there is a data dependency, dk , between Si and Sj and dk 0, then a special statement, synchronization dk , must be inserted before the statement Sj. This means that statement Sj can not be executed until statement Si is completed.
•> A special table, called space-time diagram, can be used to calculate the speed-up and other performance indices.
Example-1: DO l = 1, 4 S1 : A(l) : = B(l) + C(l) S2 : D(l) : = B( l - 1 ) + C(l) S3 : E(l) : = A( l - 1 ) + D( l - 2 ) ENDDO Data dependencies:
d1 = 1
d2 = 2
DDG:
43
S1
S3
S2
l = 1 l = 2 l = 3 l = 4
DOACROSS Transformation:
DOACROSS l = 1, 4 S1 : A(l) : = B(l) + C(l) S2 : D(l) : = B( l - 1 ) + C(l) Synchronization d1
Synchronization d2
S3 : E(l) : = A( l - 1 ) + D( l - 2 ) ENDDOACROSS
Space-time Diagram: (assuming no time is required for Synchronization)
time
1 2 3 4 5
1
2
3
4
44
S1
S2
S3
S1
S1
S1
S2
S2
S2
S3
S3
S3
proc
esso
r
Example-2:
DO l = 1, 4 S1 : B(l) : = A( l - 2 ) + 2 S2 : A(l) : = D(l) + C(l) S3 : C(l) : = A( l - 1 ) + 3 ENDDO Data dependencies:
d1 = 2
d2 = 1
DDG:
l = 1 l = 2 l = 3 l = 4
45
S1
S3
S2
S1
S2
S3
S1
S1
S1
S2
S2
S2
S3
S3
S3
DOACROSS Transformation:
DOACROSS l = 1, 4 Synchronization d1
S1 : B(l) : = A( l - 2 ) + 2 S2 : A(l) : = D(l ) + C(l)
Synchronization d2
S3 : C(l) : = A( l - 1 ) + 3 ENDDOACROSS
Space-time Diagram: ( assuming no time is required for Synchronization )
time
1 2 3 4 5
1
2
3
4
Note: If the order of statement S1, and S2 is changed ( the algorithm remains equivalent to the original one ), then the speed-up equals 4 ( see the same example in the text book on page 138 ).
Pipelining Transformation
a) Assumptions:
•> There are n processors and N statements in each iteration. Each processor computes one statement in the loop. •> The n processors form a pipeline (ring).
Output
46
proc
esso
r
P1 P2 Pn
Figure 5.1
Example:
DO l = 1, 4 S1 : A(l) : = A( l - 1 ) + C(l) S2 : B(l) : = C(l) + B( l - 1 ) S3 : C(l) : = A( l - 2 ) + C( l - 1 ) ENDDO Data dependencies: d1 = 1
d3 = 2 d2 = 1
DDG:
l = 1 l = 2 l = 3 l = 4
47
S1
S3
S2
S1
S2
S3
S1
S1
S1
S2
S2
S2
S3
S3
S3
Space-time Diagram: ( by pipelining transformation )
time
1 2 3 4 5
1
2
3
Pipeline: A A.B A. B. C
Compare with the space-time diagram by DOACROSS Transformation:
time
1 2 3 4 5 6
1
2
3
4
??? Example of vector processor program:
A + B = C, where A, B, and C are n-element vectors.
48
proc
esso
r
P1 P2 P3
proc
esso
r
Vector load VR1 ; A VR1 Vector load VR2 ; B VR2 Vector load VR1 VR2 VR3 ; VR1 + VR2 VR3 Vector load VR3 ; VR3 C
b) Loop vectorization: If there are no data dependence cycles in the body of a loop, then the vector computer can assign
one iteration to an ALU such that all iteration can be executed simultaneously.
Note: In a high level language, vector operation for C = A + B is represented by
C(1:n) = A(1:n) + B(1:n)
If there is a data dependence cycle, then only statements outside the cycle can be vectorized.
Example: DO l = 1, N
S1 : D(l) : = A(1+1) + 3 S2 : A(l) : = B(1-1) + C(l) S3 : B(l) : = A(l) - 5
ENDDO
Data dependencies:
There is a cycle between S2 and S3. Only S1 can be vectorized.
S1 : D( 1 : N ) : = A( 2 : N + 1 ) + 3 DO l = 1, N
S2 : A(l) : = B(1-1) + C(l) S3 : B(l) : = A(l) - 5
ENDDO
Speed-up: Tp = ( 1 + N 2 ) t T1 = 3 N t
49
S1
S3
S2
5.2 Parallelism for Nested Loop Algorithms
There are two kinds of approaches to parallel processing of nested loop algorithms(i) Apply FORALL (DOALL). ( If there are not any data dependencies in the loop or the distance of all data dependencies is zero. ) Example for i : = 1 to 100 do for j : = 1 to 100 do begin a( i, j ) : = a( i, j ) + b( i, j ) ; S1
b( i, j ) : = a( i, j ) b( i, j ) ; S2
end
d1 = 0
d2 = 0
Apply FORALL :
FORALL i : = 1 to 100 do FORALL j : = 1 to 100 do begin a( i, j ) : = a( i, j ) + b( i, j ) ; b( i, j ) : = a( i, j ) b( i, j ) ; end Assuming time for each process creation is 10 t , and execution time of each process is 500 t.
(ii) Trade nested loop as single loop. All approaches introduced to single loop algorithms can be applied. However, the degree of parallelism is limited due to only one loop that can be parallelized.
Example: C = A + B, where A = (aik) and B= (bkj) are two n n matrices.
DO i =1, nDO j =1, n
S1 : c(ij) = a(i, j) + b(i, j);ENDDOENDDO
T1 = n2 t
50
S1
S2
By using DOACROSS:DO i =1, nDOACROSS j =1, n
S1 : c(ij) = a(i, j) + b(i, j);ENDDOACROSSENDDO
Tp= n t
Speed-up = T1 / Tp = (n2 t) / (n t) = n
(iii) Map nested loop algorithms onto array processors. Array processors: • An array processor is a synchronous parallel computer which consists of a number of processing elements (PEs) controlled by a host computer (control unit) • Array processors can be classified into three groups: (a) single-instruction multiple-data (SIMD) processors (b) systolic arrays (c) associative processors
5.3 Introduction to MultiPascal (Part II)
6. Examples of Parallel Processing
6.1 Intel MMX Technology (Part III)
6.2 Cluster Computing (Part IV)
6.3 Task scheduling in message passing MIMD
6.3.1 Scheduling Model and Notations A parallel program is represented by a directed acyclic graph (data dependent graph)
G = (V, E) Where V is a set of tasks,
implies task depends on , andrepresents the communication cost between and .
Each node is associated with a computation cost . The target machine is a message passing MIMD computer made up of an arbitrary number of PEs. Each PE can run one task at a time, and all tasks can be processed by any PEs All PEs are connected through an interconnection network (IN). Associated with each edge
connecting two PEs is the transfer rate over the link.
51
Vi Vj
Example: Figure 1 shows an example of a DDG and a target machine. There are total 9 tasks. Each node represents one task. The number shown in the upper portion is the task number. The number shown in the lower portion is the computation cost
Gantt Chart (Space-time diagram) can be used to represent the task schedule. It consists of a list of PEs in the target machine, and for each PE a list of tasks allocated to that PE ordered by their execution (computation cost) time.
The goal of task scheduling is to minimize the total completion time of a parallel program.
6.3.2 Communication Cost Model
There are different communication models. The following is one of them. We assume that each PE in the system has an I/O processor, and each PE can execute a task and communicate with another PE at the same time. That is the communication time between two tasks allocated to the same PE is zero. If let represent that task is scheduled to be PEi at time t and assume
and then can be scheduled by either PEk on time and , or PEl, on time and
A. The target machine is an eight-node hypercube IN.
52
VjVi
Example: Figure 2 shows the Gantt chart for an example by the above communication cost model assume the target machine has 3 PEs, and tasks 1, 4, 5 and 8 are assigned to PE1, task 2 and 9 are assigned to PE2, and task 3, 6 and 7 are assigned to PE3. Let for all i and j.
6.3.3 Optimal Scheduling AlgorithmsIt has been proven that the problem of finding an optimal schedule for a set of tasks is NP – complete in the general case. In this section we discuss the cases in which scheduling task graphs are to be polynomial.
When communication cost is ignored
Case 1: scheduling tree-structured task graphs.
Notations:
Task level: Let the level of a node x in a task graph be the maximum number of node (including x) on any path form x to a terminal task (leaf). In a tree, there is exactly on such path.Ready tasks: Let task be ready when it has no preprocessors or when all its preprocessors have already been executed.
AlgorithmStep 1. Calculate task level for each node (task) in the tree as the node’s priority.Step 2. Whenever a PE becomes available, assign it the unexecuted real task with the highest
priority.
53
Example: Consider the tree structured task graph in Figure 3 (a) to be scheduled on a fully connected system with 3 PEs. By applying the above algorithm:
Step1. Calculate task level for each nodeStep 2. Task 8, 9, 10, 11, 13, and 14 are Ready tasks. Assign tasks 13 and 14 (both have priority
5) to two PEs, and assign task 11 to the third PE.
The completed Gantt chart is shown in Figure 3 (b).
Case 2: Scheduling an arbitrary task graph on two PEs Algorithm (see reference).
Considering Communication Cost
Case 1. Scheduling a tree structured task graph on two PEsAlgorithm (see reference)
6.3.4 Heuristic Algorithms
A heuristic algorithm will use less than exponential time but does not guarenttee an optimal solution.
An example of heuristic algorithm (list algorithm):Step 1. Assign a priority to each node in the task graph.Step 2. A priority queue is initialized for all ready tasks and sort the queue in descending order. Step 3. As long as the priority queue is not empty, do the following:
3.1 Obtain a task from the front of queue3.2 select an idle PE to run the task3.3 Insert tasks into the queue when they become the ready tasks.
54
6.4 Associative Processing
• Associative Memory (AM): is a special memory. A word of AM is addressed on the basis of its content rather than on the basis of its memory address. It is also called Content Addressable Memory (CAM).
• Structure of Associative Processor:
1 m
Control
55
Host Computer
bij
Comparand
Mask
1
n words
wi
n m bits Tags Data gathering
Figure 6.1
(i) Host computer: an associative memory is controlled by a host computer which sends data and control signals to AM, collects results from AM, and executes instructions that are not executable by AM. (ii) Control unit: accepts control signal from host computer and controls associative processing. (iii) Comparand register (C): stores data (pattern) which is used for searching in AM. (iv) Mask register (M): it has the same number of bits as C. All bits in C whose corresponding bits in M is “0” will participate in searching. (v) AM: is composed of n words, each word has m bits ( wi = ( bil b2l … bim )). Each bit, bij, can be read and written and is compared with Cj if Mj = 0. (vi) Tag register (T): is used to indicate the result of a search or the operands in AM. T = ( t1 t2 ... tn ). (vii) Data gathering registers (D): One or more registers ( D = (d1, d2 ... dn )) are used to store the results from T. T and D can perform some simple logic operations. (viii) Flag some/none: it will be set to one if there is at least one word in AM matches C (masked by M), otherwise it is set to zero.
• Basic Operations:
(1) Set and Reset: to set/reset registers T, D, C, M.
56
(2) Load: to load registers C and M. (3) Search: before searching, C and M should be loaded and T should be set to one. The search will perform the following steps:
(a) for j = 1 to m parallel for all i do if bij Cj ( Mj = 1 ) then set dismatchi
j = 1 end parallel do (b) set dismatchi = dismatchi
1 + dismatchi2 + ... + dismatchim
(c) set ti = 0 if dismatchi = 1. The words whose tag bits remain one match C.
(4) Conditional branch: depends on the value of flag some/none. (5) Move: to move T to D. (6) Read: read out words, whose tag bits are one, one by one to host computer.
(7) Write: write content of C to all words, whose tag bits are one, in parallel.
Examples of AM program: (i) Search for a pattern (101010…10)
1. Set T2. Load C = 101010…103. Load M = 111111…114. Search5. If some/none = 1 then
else done (ii) Find maximum number in AM
1. Set D and T2. Load C = 111…13. For j = m to do
Begin Load M = 2j
Search if some/none = 1 then D = DT and T = D end4. Move D to T5. If some/none = 1 then Read then
else done
6.5 Data Flow Computing
• Idea of data flow computing: computations will take place when all operands are available. • Comparison between ordinary computer (von Neumann machines) and data flow computers: . von Neumann computers: (i) program and data are stored in memory (ii) sequence of instruction executions is controlled by program counter (PC)
57
(iii) operands may be given as values or addresses
. Data flow computers: (i) a data-flow operation is executed when all its operands become available (ii) all operands must be given as values (iii) data-flow programs are described in terms of directed graphs called data flow graphs
Example of data flow graph: Compute:
B
A C
Dataflow graph of Result computation A / B + B C
Figure 6.2
• Token: is placed on the input of computation node to indicate the operand is available. After the execution (called firing), the token will move t0 the output of the computation node.
Examples of token movement:
B B
A C A C B B
58
copy
/
+
x
copy
/ x
copy
/ x
Result Result
After inputs applied After copy instruction executed
B B
A C A C
A / B B C
A / B + B C Result Result After divide and multiply After addition instructions executed instruction executed
Figure 6.3
• Initially, tokens are placed on all inputs of a program. The program is complete when token appears on the output of the program.
Example of data flow program represented by data flow graph:
input ( x, y ) while x < y do z = 0 if z < 10 then z = z + 1 else z = z - 5 x = x +1 end output (z)
59
+ +
copy
/
+
x
copy
/
+
x
x y 0 10
z
Figure 6.4
60
T
<
<F
T F
F T
T F
+1
T
T F
+1 -5
T