university of reginazhang/cs8022003/part1_2003.doc · web viewparallel processing. cs802. part i....

PARALLEL PROCESSING

CS802

Part I

Instructor: Dr. C. N. Zhang

Department of Computer ScienceUniversity of Regina

Regina, SaskatchewanCanada, S4S 0A2

Table of Contents1. Introduction to Parallel Processing

2. Pipeline

2.1. Design of pipeline

2.2. RISC instruction pipeline

2.3. Mapping nested loop algorithms into systolic arrays

3. Parallelism Analysis

3.1. Data dependency analysis for non loops

3.2. Data dependency analysis for loops

4. Parallel Computer Architectures

4.1. Review of computer architecture

4.2. Computer classification

4.3. SIMD Computers

4.4. MIMD Computers

5. Parallel Programming

5.1. Parallelism for single loop algorithms

5.2. Parallelism for nested loop algorithms

5.3. Introduction to MultiPascal (Part II)

6. Examples of Parallel Processing

6.1. Intel MMX technology (Part III)

6.2. Cluster computing (Part IV)

6.3. Task scheduling in message passing MIMD

6.4. Associative processing

6.5. Data flow computing

1

1. Introduction to Parallel Processing

Parallel Processing: two or more tasks being performed (executed) simultaneously. Basic techniques for parallel processing

a. pipelineb. replicated unitsc. pipeline plus replicated units

Parallel processing can be implemented at the following levelsa. at processor levelb. at instruction execution level c. at arithmetic and logic operation leveld. at logic circuit level

Speed-up of the parallel processing:

Where TS is the time by processing sequential (non-parallel) manner and TP is the processing time required by parallel processing technique.

Efficiency of the parallel processing

Where n is the number of units used to perform the parallel processing.Note: 0 S n and 0 E 1

How To Improve Performance of Computer Systems?a. by electronic technology, e.g., new VLSI developmentb. by software development, e.g., better OS and compilerc. by architecture enhancements

Architecture Enhancementsa. pipelineb. parallelismc. memory hierarchyd. multiple processor computer systemse. hardware supports for OS and compilerf. special purpose architectures

Where aijs are integers, i=1, 2, … , n, j=1, 2, …, m. Assume that the time delay of the Adder is t. Sequential processing: Use one Adder. TS = ( n-1 ) m t

Parallel Processing:

a. use two adders:

2

S = 2 E = 1b. use m adders: Tp = ( n-1 ) t S = m E = 1c. use n-1 adders ( n = 2k ) as pipelining

e.g. n = 8 Tp = ( log n + m-1 ) t

a11 a21 a31 a41 a51 a61 a71 a81

a12 a22 a32 a42 a52 a62 a72 a82

… … … … … … … … a1m a2m a3m a4m a5m a6m a7m a8m

Figure 1.1

3

+

+ +

++++

2. Pipeline

2.1 Design of Pipelinea. The task is able to be divided into a number of subtasks in sequence, and hopefully all subtasks

require the same amount time.b. Design an independent hardware unit for each subtask.c. Connect all units in order.d. Arrange the input data and collect the output data.

Subtask1: S1 = a1j + a2j

Subtask2: S2 = S1 + a3j

Subtask3: Yj = S2 + a4j

b. unit for subtask1: S1

a1j a2j

unit for subtask2: S2

S1 a3j

unit for subtask3: Yj

+++

S2 a3j

c.

S1 S2

a1j

a2j Yj

a3j a4j

4

+

+

+

+ + +

d. 0 0 a14 a13 a12 a11

0 0 a24 a23 a22 a21 Y4Y3Y2Y1

0 0 a31 0 a32 a41

a33 a42

a34 a43

0 a44

Tp = ( n + m - 2 ) t

Second pipeline design:

Subtask1: S1 = a1j + a2j and S2 = a3j + a4j

Subtask2: Yj = S1 + S2 S1 S2

b. unit for subtask1:

a1j a2j a3j a4j

unit for subtask2: Yj

S1 S2 c. Yj

S1 S2

a1j a2j a3j a4j

d. Y1

5

+ + +

+

+

+

++

+

Y2

Y3

Y4

a11 a21 a31 a41

a12 a22 a32 a42

a13

a23

a33

a43

a14 a24 a34 a44

Tp = ( m -1 + log n ) t

2.2 RISC instruction pipeline2.2.1 Features of RISC• Register-oriented organization a) register is the fastest available storage device b) save memory access time for most data passing operations c) VLSI makes it possible to build large number of registers on a single chip

• Few instructions: a) there are two groups of instructions: register to register, memory to register or register to memory b) complex instructions are done by software c) fixed size instruction format

• Simple addressing models a) simplify memory address calculation b) makes it easy to pipeline

• Use register window technique a) optimizing compilers with fewer instructions b) based on large register file

• CPU on a single VLSI chip a) avoid long time delay of off-chip signals b) regular ALU and simple control unit• Pipelined instruction execution

6

++

+

a) two-, three-, or four-stage pipelines b) delayed-branch and data forward

2.2.2 Two types of instruction in RISCV. 1) Register-to-register, each of which can be done by the following two phases: I : Instruction fetch E: Instruction execution 2) Memory-to-register (load) and register-to-memory (store), each of which can be done by the following three phases: I: Instruction fetch E: Memory address calculation D: Memory operation

• Pipeline: a task or operation is divided into a number of subtasks; that need to be performed in sequence, each of which requires the same amount of time.

where Ts is the total time required for a job performed by non-pipeline.Tp is the time required by pipeline.

• Pipeline conflict: If two subtasks on the two pipeline units cannot be computed simultaneously. One task must wastes until another task is completed.

• To resolve the pipeline conflict in the instruction pipeline the following techniques are used a. inset a no operation stage ( N ) used by two-way pipeline. b. inset no operation instruction ( I E ) used by three-way and four-way pipeline. • Two-way Pipeline : there is only one memory in the computer which stores both instructions and data. Phases of I and E can be overlapped. Example:

Load A M Load B M Add C A + B Store M C Branch X X: Add B C + A Halt

Figure 2.1 Sequential execution

Load A M Load B M Add C A + B Store M C Branch X X: Add B C + A Halt

Figure 2.2 Two-way pipelining

7

I E D

I

I

I

I

I

I

E

E

E

E

E

E

D

D

I

I

N

n

N

I

I

I

I

I

E

E

E

E

E

E

E

D

D

D

NNN

NNN

I

N

N

N N

NN

Speed-up = 17 / 11 = 1.54Where N represent the NO operation stage (waiting stage) inserted by OS.

• Three-way pipeline: there are two memory units, one stores instructions, another stores data. Phases of I, E, and D can be overlapped. The NOOP instruction ( two-stage ) is used when Example:

Load A M Load B M NOOP Add C A + B Store M C Branch X NOOP X: Add B C + A Halt

Figure 2.3 Three-way pipelining

Speed-up = 17 / 10 = 1.7Where NOOP represent the NO operation stage (waiting stage) inserted by OS.

• Four-way pipeline: dame as three-way, but phase E is divided into: E1: register file read E2: ALU operation and register write Phases of I, D, E1, and E2 can be overlapped. Example:

Load A M Load B M NOOP NOOP Add C A + B Store M C Branch X NOOP NOOP X: Add B C + A Halt

Figure 2.4 Four-way pipelining

Speed-up = 24 / 13 = 1.85

• Optimization of pipelining. Goal is to reduce the number of NOOP instructions by reordering the sequence of the program. Example: (three-way pipeline)

8

I E D

I

I

I

I

I

I

E

E

E

E

E

E

D

D

E

E

I

I

I E1 DE2

D

D

I

I

I

I

I

I

I

I

I

I

E1

E1

E1

E1

E1

E1

E1

E1

E1

E1

E2

E2

E2

E2

E2

E2

E2

E2

E2

E2

100 Load A M Inserted NOOP

101 Add B B + 1

102 Jump 106

103 NOOP

106 Store M B

100 Load A M Reversed Instructions

101 Jump 105

102 Add B B + 1

106 Store M B

Use of the delayed branch Figure 2.5

2.3 Mapping nested loop algorithms into systolic arrays2.3.1 Systolic Array Processors • It consists of: (i) Host computer: controls whole processing (ii) Control unit: provides system clock, control signals input data to systolic array, and collects results from systolic array. Only boundary PEs are connected to control unit. (iii) Systolic array: is a multiple processors network plus pipelining. Possible types of systolic arrays: (a) Linear array:

Figure 2.6

(b) Mesh connected 2-D array:

Figure 2.7

9

I

I

I

I

I

E

E

E

E

E

D

D

I

I

I

I

E

E

E

E

D

D

(c) 2-D hexagonal array

Figure 2.8 (d) Binary tree

Figure 2.9

(e) Triangular array

Figure 2.10

• Features of systolic arrays:

(i) PEs are simple, usually contain a few registers and simple ALU circuits (ii) There are a few types of PEs (iii) Interconnections are local and regular (iv) All PEs are acting rhythmically controlled by a common clock

(v) The input data are used repeatedly when they pass the PEs

10

Conclusion: - systolic arrays are very suitable for VLSI design- systolic arrays are recommended for computation-intensive algorithms

Example-1: matrix-vector multiplication: Y = A X ,where A is an (n n) matrix and Y and X are (n 1) matrices.

t6 a44

t5 a43 a34

t4 a42 a33 a24 t3 a41 a32 a23 a14

t2 a31 a22 a13

t1 a21 a12

t0 a11

x1 x2 x3 x4

Figure 2.11

where A, B, and C are (n n) matrices.

j k j

i i k =

a31 a32 a33 a21 a22 a23

a11 a12 a13

11

kj

time fronts

b11

b12 b21

b13 b22 b31

b23 b32

b33

c33 c32 c31

c23

c13 c22 c21

c12

c11

A two-dimensional systolic array for matrix multiplication.

Figure 2.12

2.3.2 Mapping Nested Loop Algorithms onto Systolic Arrays

• Some design objective parameters: (i) Number of PEs ( np ): total number of PEs required. (ii) Computation time ( nc ): total number of clocks required for systolic array to

complete a computation. (iii) Pipeline period (p): time period of pipelining. (iv) Space time2 ( np tc

2 ): is used to measure cost and performance, suppose speed is more* important than cost.

• In general, a mapping procedure involves the following steps: Step 1: algorithm normalization (all variables should be with all indices) Step 2: find the data dependence matrix of the algorithm called D Step 3: pick up a valid l l transformation T (suppose the algorithm is a l-nested loops) Step 4: PE design and systolic array design Step 5: systolic array I/O scheduling

12

Step1: Algorithm normalization Some variables of a given nested algorithms may not have all indices which are called

broadcasting variables. In order to find out the data dependence matrix of the algorithm, it is necessary to eliminate all broadcasting variables by renaming or adding more variables.

which can be described by the following nested algorithm:

for i : = 1 to n for j : = 1 to n for k : = 1 to n C( i, j ) : = C( i, j ) + A( i, k ) * B( k, j );

(suppose all C( i, j ) = 0 before the computation) Where A, B, and C are broadcasting variables.

Algorithm normalization:

for i : = 1 to n for j : = 1 to n for k : = 1 to n begin A( i, j, k ) : = A( i, j-1, k ); B( i, j, k ) : = B( i-1, j, k ); C( i, j, k ) : = C( i, j, k-1 ) + A( i, j, k ) * B( i, k, j );

End

where A( i, 0, k ) = aik , B( 0, j, k ) = bkj, C( i, j, 0 ) = 0, and C( i, j, n ) = cij

Step2: Data dependence matrixIn a normalized algorithm, variables can be classified as(i) generated variables: those variables appearing on the left side of the assignment statements(ii) used variables: those variables appearing on the right side of the assignment statements

Consider a normalized algorithm:

for i1 : = l11 to ln

1

for i2 : = l22 to ln

2

: : for il : = ll

1 to ln1

begin S1

S2 : : Sm

13

end

Let A( i11 , i2

1 , …, in1 ) and B( i1

2 , i22 , …, il

2 ) be a pair of generated variable and used variable in a statement.

Vector is called a data dependence vector denoted by dj with

respect to pair of A( i1

1 , i21 , …, in

1 ) and B( i12 , i2

2 , …, il2 )

The data dependence matrix consists of all possible data dependence vectors of an algorithm.

Example: In the above example ( matrix-matrix multiplication ), we have

Step3: Find a valid linear transformation matrix T: Suppose l = 3 ( i1 = i, i2 = j, i3 = k )

Where is a ( 1 3), matrix, S is a ( 2 3) matrix

T is a valid transformation if it meets the following conditions (a) det (T) 0 (b) di > 0 for all i

In fact, T can be viewed as a linear mapping which maps an index space into

is called the time transformation, S is called the space transformation.

Det (T) 0 implies that it is a one to one mapping. A computation originally is performed at ( i, j, k ) is now computed at a PE located at ( x, y ) at time t, where t and ( x, y ) are determined as shown above. Note that for a given algorithm there may be a large number of valid transformations exist. An important goal of mapping is to choose an "optimal" one which is defined by user. For example, it can reach the minimum value of tc

2 np .

Step4.1: PE design including:(a) arithmetic unit design according to the computations required in the body of the loop(b) inputs and outputs of PE and their directions.

14

The number of inputs and outputs are always the same as the number of vectors in the matrix D. The direction of each pair of (input, output) is determined by the new vector obtained by:

y

1

-1 x

Step4.2: Systolic array design include: (a) Locations of all PEs which are obtained by the equation of

be all possible index points in the given algorithm. For

instance in the above example, suppose n = 3, then the set of all possible index points are J = ({ 1, 1, 1 }, ( 1, 1, 2 ), ... , ( 3, 3, 3 )). All possible different values of pair ( x, y ) form all locations of PEs.

(b) After having located all PEs in the X-Y plane, interconnections between PES are obtained according to the data path of each pair of input and output determined in the PE design stage.

Example: Consider the matrix-matrix multiplication again. We have:

• PE design: According to the body of the algorithm, it requires one adder and one multiplier. Since

15

there are three pairs of inputs and outputs: ( ain, aout ), ( bin, bout ), and ( cin, cout ).

Example:

Consider the matrix-matrix multiplication again. We have:

• PE design: According to the body of the algorithm, it requires one adder and one multiplier. Since

there are three pairs of inputs and outputs: ( ain, aout ), ( bin, bout ), and ( cin, cout ).

Y

aout

c aout : = ain bout : = bin c : = c + ain * bin bout bin

16

PE

1 1 11 1 21 1 31 2 11 2 21 2 31 3 11 3 21 3 3

2 1 12 1 22 1 32 2 12 2 22 2 32 3 12 3 22 3 3

3 1 13 1 23 1 33 2 13 2 23 2 33 3 13 3 23 3 3

-1 1 -1 1 -1 1 -1 2 -1 2 -1 2 -1 3 -1 3 -1 3

-3 1 -3 1 -3 1 -3 2 -3 2 -3 2 -3 3 -3 3 -3 3

-2 1 -2 1 -2 1 -2 2 -2 2 -2 2 -2 3 -2 3 -2 3

x y x y x y i j k i j k i j k

ain

XFigure 2.13

The mapped vector indicates that variable will stay in the register.

• More detail design of the PE:

aout

bout bin

ain

Figure 2.14

• Array design (n = 3)

According to we have the following PE location table, where

1 i 3, 1 j 3, and 1 k 3.

Table 2.1

17

+

C *

All possible PE locations are (-1, 1), (-1, 2), (-1, 3), (-2, 1), (-2, 2), (-2, 3), (-3, 1), (-3, 2), and (-3, 3).

Place all PEs on the X-Y plane according to those locations and connect them according to the data paths determined in the PE design. Thus, we have the following array ( np = 9 ):

Y

X Figure 2.15

Step5: I/O Scheduling: determine locations of all input and output data.

Computation time: tc = tmax - tmin + 1

In the above example, we have tmax = 9, tmin = 3, tc = 9 – 3 + 1 =7. Reference time tref = tmin – 1 = 2. Input data are: a11 a12 a13 a21 a22 a23 a31 a32 a33

b11 b12 b13 b21 b22 b23 b31 b32 b33

c11 c12 c13 c21 c22 c23 c31 c32 c33

Output data are: c11 c12 c13 c21 c22 c23 c31 c32 c33

Location of all: according to the assumption of the normalized algorithm all = a(1,0,1).

Since t = tref = 2, all should be placed

18

at x = -1 and y =0.

Location of a12 :

Since t = 3 > reference time (tref = 2 ), therefore, a12 will be placed at x = -1 and y = -1 (shift back the difference along the direction of "a" data path).

Location of c11 : c11 = c( 1, 1, 0 ),

Therefore, c11 is located in PE at x = -1 and y = 1.

Continue this process. Finally, we have the following 1/0 scheduled systolic design:

0 0 b13 b23 b33

0 b12 b22 b32

19

C3

2

C3

1

C1

1

C2

1

C2

2

C2

3

C1

3

C1

2

C3

3

b11 b21 b31

0 0 a11 0 a21 a12 a31 a22 a13 a32 a23 a33

Figure 2.16

2.3.3 Approaches To Find Valid Transformations For a given algorithm, there may be a large number of valid transformations exist. An important task in the mapping problem is to find an efficient systematic approach to search for

all possible solutions and find an optimal one. In general, we hope to find a valid transformation such that:

(i) computation time is minimized (ii) total number of PEs is minimized (iii) pipeline period is 1

As shown in the above example, all these parameters ( tc, np, p ) can be calculated after the whole design steps are completed. In the following, we give simple formulae for those design parameters without proof.

Suppose the normalized algorithm has the form: for i : = 1 to n for j : = 1 to n for k : = 1 to n

begin S1

S2 : : Sm

end

The valid transformation is

• Computation time ( tc ): tc = ( l1 - 1 ) | t11 | + ( l2 - 1 ) | t12 | +( l3 - 1 ) | t13 |

• Number of PEs ( np ):

20

Tij is the ( i, j )-cofactor of matrix T.

• Pipeline period ( p ):

• Approach-1:

Three basic row operations: (1) rowi < --- > rowj

(2) rowi : = rowi + K rowj , i j, K is an integer (3) rowI : = rowi (-1)

Each row operation can be done by multiplying an elemental matrix to the given matrix, for example:

Suppose for a given matrix D, after applying those basic row operations ( R1, R2, Rn ), we have matrix as the result, such that all elements of the first row become positive integers. We conclude that matrix T = Rn ... R2 R1 , is a valid transformation.

In fact, it is easy to verify: (i) T = Rn ... R2 R1 , det(T) = det(Rn) .... det(R1)

Since det(Ri) = 1, therefore, det(T) = 1 (ii) Since all elements of the first row of T D > 0, it means di > 0 for all i.

Besides, according to the formula , we have p = 1.

Example: we use the same example again.

21

• Approach-2:

Given D = [d1, d2 ... dm]

Define: a( 2 9 ) universal neighbour connection matrix:

When m is the number of vectors in data dependency matrix D.

Step-1: select a such that di > 0 for 1 i m (under the condition of )

Step-2: pick up a (9 m) matrix K (under the condition of ) Step-3: find S by solving equation S D = P K

Step-4: if det(T) 0 then done , otherwise pick up another and repeat

those steps until a valid transformation matrix T is found.

Example: Given

Step -1: pick up = [1 0 –1]

22

From equations:t2l - t22 = 0t2l - t23 = 0t2l + t22 - 2 t23 = 03 t22 - 2 t23 = 1

t3l - t32 = 0t3l - t33 = 0t3l + t32 - 2 t33 = 03 t32 - 2 t33 = 1

3. Analysis of Parallelism in Computer Algorithms

3.1 Data dependences in non loops• > Data dependence of two computations: one computation can not start until the other one is

completed. • > Parallelism detection: find out sets of computations that can be performed simultaneously.• > Types of data dependencies on non-loop algorithms or programs: Suppose that Si and Sj are two computations (statements or instructions in a non-loop algorithm or

program). Ii , Ij and Oi , Oj are the inputs and outputs of Si and Sj , respectively.

a) Type-1: data flow dependence ( read after write, denoted by RAW )

23

•) condition (1) j > i(2) Ij Oi

example-1 : : 100 R1 R2 R3

: 105 R5 Rl + R4 example-2: : :

Si : A : = B + C Sj : D : = A E + 2

•) notation: ( computation Sj defends on computation Si )

b) Type-2: data antidependence ( write after read, denoted by WAR ) •) condition: (1) j > i

(2) Oj Ii example-1 : : 200 R1 R2 R3

: 205 R5 R2 + R5 example-2: : :

Si : A : = B C Sj : B : = D + 5


a) Type-3: data-output dependence ( write after read , denoted by WAW )

•) condition (1) j > i(2) Oi Oj

24

Si

Sj

Si

Sj

example: : :

Si : A : = B + C :

Sj : A : = D E


•> Data dependence graph ( DDG ): represents all data dependencies among statements (instructions). Example: find DDG for the following program.

100 R1 R3 div R2

101 R4 R4 R5

102 R6 R1 + R3

103 R4 R2 + R3

104 R2 R4 + R2

Solution: Si, Sj Dependency di

100, 101 no100, 102 type-1 1100, 103 no 100, 104 type-2 2

101, 102 no101, 103 type-3 and type-2 4 and 5101, 104 type-1 6

102, 103 no102, 104 no

103, 104 type-1 and type-2 7 and 8

DDG: d2

d8

d7

25

Si

Sj

100

101

102

103

104

d5 d4

d1 O d6

3.2 Data dependences in loops Consider the case of single loop in the form:

For l = l1 to ln Do l = l1, ln Begin S1 S1 S2 or S2 : : SN SN End Enddo

A loop can be viewed as a sequence of statements in which the statements ( the body of an algorithm ) are repeated n times.

e.g., ( n = 3, N = 2 )

There are still three types of data dependencies. However, the conditions should be modified. Suppose that Si(li) and Sj(lj) are two statements in the loop, Where li and 1j are the indices of the variable shared by Si and Sj respectively

a) Type-1: (RAW) •) condition: (1) li - 1j > 0 or li = 1j and j>i

(2) Oi Ij •) notation: distance, d = li - 1j 0

d

•) example: for l : = 1 to 5 begin

S1 : B( l + 1 ) : = A(l) C( l + 1 ) S2 : C( l + 4 ) : = B(l) A( l + 1 )

26

Si

Sj

S1 ( l = 1 )S2 ( l = 1 )S1 ( l = 2 )S2 ( l = 2 )S1 ( l = 3 )S2 ( l = 3 )

end Si Sj shared variable li - 1j = d S1 S2 B ( l + 1 ) – l = 1 S2 S1 C ( l + 4 ) – ( l + 1 ) = 3

d = 1 d = 3

b) Type-2: (WAR) •) condition: (1) li - 1j > 0 or li = 1j and j>i

(2) Oj Ii •) notation: distance, d = li - 1j 0

d

c) Type-3: (WAW) ( i j ) •) condition: (1) li - 1j > 0 or li = 1j and j>i

(2) Oi Oj •) notation: distance, d = li - 1j 0

d

•) example: DO l = 1, 4

S1 : A(l) : = B(l) + C(l) S2 : D(l) : = A( l - 1 ) + C(l) S3 : E(l) : = A( l - 1 ) + D( l – 2 ) ENDDO

Si Sj shared variable li - 1j = d Type di S1 S1 no S1 S2 A l - ( l - 1 ) = 1 1 d1

S1 S3 A l - ( l - 1 ) = 1 1 d2

S2 S1 A ( l – 1) - l = -1 no S2 S2 no S2 S3 D l - ( l - 2 ) = 2 1 d3

S3 S1 A ( l - 1 ) - l = -1 no S3 S2 D l - 2 - l = -2 no S3 S3 no

DDG: l = 1 l = 2 l = 3 l = 4

27

Si

Sj

Si

Sj

Si

Sj

S1

S1

S1

S1

d1 d1 d1

d2 d2 d2

d3 d3

3.4. Techniques for removing some Data Dependencies

• Variable renaming.

S1 : A : = B C S1 : A : = B C S2 : D : = A + 1 ======> S2 : D : = A + 1 S3 : A : = A D S3 : AA : = A D

Data dependencies:

======>

• Node splitting (introducing new variable).

DO l = 1, N DO l = 1, N

28

S2

S3

S2

S2

S2

S3

S3

S3

S1

S1

S3

S3

S2

S2

S0 : AA(l) : = A( l + 1) S1 : A(l) : = B(l) + C(l) S1 : A(l) : = B(l) + C(l) S2 : D(l) : = A(l) + 2 ======> S2 : D(l) : = A(l) + 2 S3 : F(l) : = D(l) + A(l+1) S3 : F(l) : = D(l) + AA(l) ENDDO ENDDO Data dependencies:

d=1

d=0

d=0 d=0 d=1

d=1

======> • Variable substitution.

S1 : X : = A + B S1 : X : = A + B S2 : Y : = C D ======> S2 : Y : = C D S3 : Z : = EX + FY S3 : Z : = E( A + B ) + F( C D )

Data dependencies:

======>

29

S1

S1

S3

S3

S2

S2

S0

S1

S1

S3

S3

S2

S2

4. Parallel Computer Architectures

4.1 Review of Computer Architecture

a. Basic Components of Computer System CPU: includes ALU and Control Unit Memory: cache memory, main memory, secondary memory I/O: DMA, I/O interface, and I/O devices Bus: a set of communication pathways (data, address, and control signals) connect two or

more componentsb. Single Processor Configurations

Multiple bus connected computer systems.

30

CPU Memory

I/O

I/O

CPUMemory

I/O

I/O

I/O Bus Memory Bus

or

I/O Bus Memory Bus

Figure 4.1

Single bus connected systems

BBSY BR

BG

System Bus

Figure 4.2

Note: At any time only one unit can control the bus, though two or more units require to use the bus at the

same time. The arbiter is used to select which unit is going to use the bus next time. BR (bus request), BG (bus grant), SACK (Select acknowledgement), and BBSY (bus busy) are the

most important control signals used by the arbiter. The timing diagram of these control signals are:

BBSY

BR

31

CPU Memory I/O I/ODMAArbitercache

BG

SACK

Figure 4.3

c. Central Processing Unit (CPU) ALU : performs all logic and arithmetic operations and has some registers which store

operands and results. Control Unit: produces all control signals such that instructions can be fetched and executed

one by one. Control unit can be implemented by logic circuits or microprogramming.

An instruction cycle includes several smaller function cycles, e.g.,1. instruction fetch cycle.2. indirect (operand fetch) cycle.3. execution cycle.

d. Memory• Store instructions and data

• Memory configuration: Hardware mapping

DMA transfer

Figure 4.4

• cache memory : a small fast memory which is placed between CPU and main memory • why cache? : there is a big speed gap between CPU and main memory • hardware mapping: direct mapping, associative mapping, and set associative mapping

e. Inputs and Outputs (I/O)• Programming I/O: CPU is idle While waiting• interrupt driven I/O: CPU is interrupted when I/O requires service

Bus

32

Cache

Main Memory

Secondary Memory

I/O Device

I/O Device

I/O Interface

Figure 4.5

• DMA 1/0 VO directly data to/from memory

Bus

Figure 4.6

4.2 Computer Classification

According to Flynn's classification all normal stored program computers (von Neumann computers) can be classified into:

• Single-instruction single-data computers (SISD)

• Single-instruction multiple-data computers (SIMD)

• Multi ple-instruction multiple-data computers (MIMD)

• Multiple-instruction single-data computers (MISD)

Processors Processors Control (ALU and Control (ALU and Units registers) units registers)

SISD MISD Instructions Instruction Data Data

SIMD Data MIMD Instructions Data Instruction

Figure 4.7

33

High speed I/O

DMA controller

4.3 SIMD Computers

A single control unit generates a single instruction stream and the instructions are broadcast to all other processors (PEs). Each processor executes the same instruction on different data.

There are two kinds of SIMD computers: (a) Shared memory SIMD

Figure 4.8

(i) Memories are separated from PEs (ii) There are several IN architectures for shared memory access:

(1) Single bus

Processors/memories

Figure 4.9

(2) Multiple busesProcessors/memories

34

PE1

M1

PEnPE2

M2 Mm

HOST

CU memory

CU

Interconnection Network ( IN )

Figure 4.10

(3) Cross-bar switch

Memories

Figure 4.11

(4) Multistage interconnection networks

Switching Network

Figure 4.12

(b) Local memory SBM

I/O Data bus

IN control

Control bus

35

Proc

esso

rsPr

oces

sors

Mem

orie

s

PE1 PEnPE2

HOST

CU memory

CU

Figure 4.13

(i) Each PE has its own memory (ii) The interconnection network (IN) can be:

(1) Mesh

Figure 4.14

(2) Nodes with eight links

36

M1 M2 Mn

Interconnection Network ( IN )

Figure 4.15

(3) Cube

Figure 4.16

(4) Exhaustive

Figure 4.17

(5) Tree

Figure 4.18

(iii) There are two kinds of instructions: - vector instructions- non vector instructions

37

(iv) All control instructions and scalar instructions (non vector) are executed by the host.

(v) If it is a vector instruction, the CU broadcasts: - instruction to all PEs - constant to all PEs (if it is needed)

- address to all PEs (if the instruction requires PEs to access data from their memories.

Note: Each PE has its own address index register

read address = address sent by CU + index register

Suppose n = 4 and each PE has three address index registers: RA, RB, and RC. The data stored in the four PEs are as follow:

k j j

i k i =

Figure 4.19

Program:

for i = 1 to n in parallel for all j, 1 j n

RC(j) = 0 /* processors clear RC registers */

38

M1 M2 M4M3

a11

a21

a31

a41

b11

b21

b31

b41

c11

c21

c31

c41

a12

a22

a32

a42

b12

b22

b32

b42

c12

c22

c32

c42

a13

a23

a33

a43

b13

b23

b33

b43

c13

c23

c33

c43

a14

a24

a34

a44

b14

b24

b34

b44

c14

c24

c34

c44

for k= 1 to n fetch a( i, k ) /* by control unit */ broadcast a( i, k ) /* by control unit */ RA(j) = a( i, k ) /* processors store broadcast data */ RB(j) = b( k, j ) /* processors fetch local operands */ RA(j) = RA(j) * RB(j) /* processor multiply */ RC(j) = RA(j) + RC(j) /* processors update RC registers */

end loop k c( i, j ) = RC(j) /* processors store RC registers */

end loop i

4.4 MIMD Computers

In a MIMD system, a number of independent processors operate upon separate data concurrently.There are two types of MIMD computers

1. Shared memory MIMD2. Message passing MIMD

4.4.1 Shared memory MIMD

Shared memory MIMD uses the centralized memory for communication purpose, as shown in Fig.4.20 Processors …

…

MemoriesFigure 4.20 Shared memory, MIMD

Possible interconnection networks (see shared memory SIMD) Single bus Multiple bus Cross-bar switch Multistage network

Advantages of shared memory MIMD: easy programming

Disadvantages of shared memory MIMD: memory connection which happens when the shared memory is dealing with many processor requests within a very short time period. Some processors must wait for other processors to be served. It increases as the number of processors increases.

Some techniques helping reduce the memory contention: Local cash memory Shared memory with multiple memory modules.

39

Interconnection network

P1 P2 Pn

Mn M1 M2

4.4.2 Message passing MIMD

Message passing MIMD consists of a number of independent processors, each of which has its own memory. All processors connect through a interconnect network as shown in Fig. 4.21. The communications between processors are done by passing data through the direct links.

. . .

Figure 4.21 Message Passing MIMD

Possible interconnection networks (see local memory SIMD) Mesh Node with eight links Cube Exhaustive (star) Tree

Advantages of message passing MIMD: Large number of processors can be used; potential higher performance

Disadvantages of message passing MIMD: difficult programming, Higher communication cost.

5. Parallel Programming

5.1 Parallelism for Single Loop Algorithms

FORALL (DOALL) and FORALL GROUPING Statements

• FORALL (DOALL) is the parallel form of FOR (DO) statement for loops.

• Condition of using FORALL (DOALL) : there are no data dependences in the loop or distance of all data dependences is zero.

• Assumption: there are n+1 processors when n is the number of iterations in the loop.

• Consider a loop:

40

Interconnect network

for l : = l1 to ln DO l1 , ln begin S1

S1 S2

S2 : : Sm

Sm ENDDO end It requires the following two steps:

1. a processor creates n processes by assign one iteration to one processor.2. n processors execute n iterations at the same time.

Where t1 is the processing time for one iteration, t2 is the time to create one process. Example: For i : = 1 to 100 begin A(i) : = SQRT(A(i)); ( S1 ) B(i) : = A(i) B(i); ( S2 ) end

Data dependence graph :

d1 = 0

d2 = 0

applying FORALL : FORALL i : = 1 to 100 DO begin A(i) : = SQRT(A(i)) ; B(i) : = A(i) B(i) ; end

Assuming the process creation time is 10 time units ( 20t ), the execution time for each iteration is 100t.

It requires 100 + 1 = 101 processors.

41

S1

S2

• Process granularity Granularity of the process refers to the execution time of each process ( iteration ). In order to have meaningful speed-up, the execution time of the process must be larger than the creation time (overhead).

For example, in the previous example. Assuming the creation time is 50 t and the execution time of each process is 50 t.

In parallel programming, it uses the keyword GROUPING to group a number of indices together granularity problem. It also can be used when the number of processes are longer than the number of processors in the system.

Format : FORALL GROUPING DO Where G is the number of processes (iterations) in a group. Applying FORALL GROUPING to the precious example:

FORALL i : = 1 to 100 GROUPING 10 DO begin A(i) : = SQRT(A(i)) ; B(i) : = A(i) B(i) ; end

• Optimal group size Lot: n be total number of iterations G be group size C be time to create one process T be execution time of each process The total time for FORALL GROUPING G is

DOACROSS Transformation for loops with no zero distance data dependence(s)

a) Assumptions: • There are N processors. • Each - processor can communicate with others.

42

• Simple loop algorithm has analyzed and some data dependencies have been removed. The algorithm will have the following form, and there is (are) data dependence(s) among the statements.

DO l = 1, N S1

S2

: Sn

ENDDO

• Each statement, Si requires the same time, t.

b) DOACROSS is a parallel language construct to transform sequential loops into parallel forms. DOACROSS is used when there are data dependencies between iterations.

•> Each processor performs all computations that belong to the sad6 iteration, e.g., all computations belong to the first iteration are performed by processor_1 and processor_n will perform all computations belong to the nth iteration.

•> If there is a data dependency, dk , between Si and Sj and dk 0, then a special statement, synchronization dk , must be inserted before the statement Sj. This means that statement Sj can not be executed until statement Si is completed.

•> A special table, called space-time diagram, can be used to calculate the speed-up and other performance indices.

Example-1: DO l = 1, 4 S1 : A(l) : = B(l) + C(l) S2 : D(l) : = B( l - 1 ) + C(l) S3 : E(l) : = A( l - 1 ) + D( l - 2 ) ENDDO Data dependencies:

d1 = 1

d2 = 2

DDG:

43

S1

S3

S2

l = 1 l = 2 l = 3 l = 4

DOACROSS Transformation:

DOACROSS l = 1, 4 S1 : A(l) : = B(l) + C(l) S2 : D(l) : = B( l - 1 ) + C(l) Synchronization d1

Synchronization d2

S3 : E(l) : = A( l - 1 ) + D( l - 2 ) ENDDOACROSS

Space-time Diagram: (assuming no time is required for Synchronization)

time

1 2 3 4 5

1

2

3

4

44

S1

S2

S3

S1

S1

S1

S2

S2

S2

S3

S3

S3

proc

esso

r

Example-2:

DO l = 1, 4 S1 : B(l) : = A( l - 2 ) + 2 S2 : A(l) : = D(l) + C(l) S3 : C(l) : = A( l - 1 ) + 3 ENDDO Data dependencies:

d1 = 2

d2 = 1

DDG:

l = 1 l = 2 l = 3 l = 4

45

S1

S3

S2

S1

S2

S3

S1

S1

S1

S2

S2

S2

S3

S3

S3

DOACROSS Transformation:

DOACROSS l = 1, 4 Synchronization d1

S1 : B(l) : = A( l - 2 ) + 2 S2 : A(l) : = D(l ) + C(l)

Synchronization d2

S3 : C(l) : = A( l - 1 ) + 3 ENDDOACROSS

Space-time Diagram: ( assuming no time is required for Synchronization )

time

1 2 3 4 5

1

2

3

4

Note: If the order of statement S1, and S2 is changed ( the algorithm remains equivalent to the original one ), then the speed-up equals 4 ( see the same example in the text book on page 138 ).

Pipelining Transformation

a) Assumptions:

•> There are n processors and N statements in each iteration. Each processor computes one statement in the loop. •> The n processors form a pipeline (ring).

Output

46

proc

esso

r

P1 P2 Pn

Figure 5.1

Example:

DO l = 1, 4 S1 : A(l) : = A( l - 1 ) + C(l) S2 : B(l) : = C(l) + B( l - 1 ) S3 : C(l) : = A( l - 2 ) + C( l - 1 ) ENDDO Data dependencies: d1 = 1

d3 = 2 d2 = 1

DDG:

l = 1 l = 2 l = 3 l = 4

47

S1

S3

S2

S1

S2

S3

S1

S1

S1

S2

S2

S2

S3

S3

S3

Space-time Diagram: ( by pipelining transformation )

time

1 2 3 4 5

1

2

3

Pipeline: A A.B A. B. C

Compare with the space-time diagram by DOACROSS Transformation:

time

1 2 3 4 5 6

1

2

3

4

??? Example of vector processor program:

A + B = C, where A, B, and C are n-element vectors.

48

proc

esso

r

P1 P2 P3

proc

esso

r

Vector load VR1 ; A VR1 Vector load VR2 ; B VR2 Vector load VR1 VR2 VR3 ; VR1 + VR2 VR3 Vector load VR3 ; VR3 C

b) Loop vectorization: If there are no data dependence cycles in the body of a loop, then the vector computer can assign

one iteration to an ALU such that all iteration can be executed simultaneously.

Note: In a high level language, vector operation for C = A + B is represented by

C(1:n) = A(1:n) + B(1:n)

If there is a data dependence cycle, then only statements outside the cycle can be vectorized.

Example: DO l = 1, N

S1 : D(l) : = A(1+1) + 3 S2 : A(l) : = B(1-1) + C(l) S3 : B(l) : = A(l) - 5

ENDDO

Data dependencies:

There is a cycle between S2 and S3. Only S1 can be vectorized.

S1 : D( 1 : N ) : = A( 2 : N + 1 ) + 3 DO l = 1, N

S2 : A(l) : = B(1-1) + C(l) S3 : B(l) : = A(l) - 5

ENDDO

Speed-up: Tp = ( 1 + N 2 ) t T1 = 3 N t

49

S1

S3

S2

5.2 Parallelism for Nested Loop Algorithms

There are two kinds of approaches to parallel processing of nested loop algorithms(i) Apply FORALL (DOALL). ( If there are not any data dependencies in the loop or the distance of all data dependencies is zero. ) Example for i : = 1 to 100 do for j : = 1 to 100 do begin a( i, j ) : = a( i, j ) + b( i, j ) ; S1

b( i, j ) : = a( i, j ) b( i, j ) ; S2

end

d1 = 0

d2 = 0

Apply FORALL :

FORALL i : = 1 to 100 do FORALL j : = 1 to 100 do begin a( i, j ) : = a( i, j ) + b( i, j ) ; b( i, j ) : = a( i, j ) b( i, j ) ; end Assuming time for each process creation is 10 t , and execution time of each process is 500 t.

(ii) Trade nested loop as single loop. All approaches introduced to single loop algorithms can be applied. However, the degree of parallelism is limited due to only one loop that can be parallelized.

Example: C = A + B, where A = (aik) and B= (bkj) are two n n matrices.

DO i =1, nDO j =1, n

S1 : c(ij) = a(i, j) + b(i, j);ENDDOENDDO

T1 = n2 t

50

S1

S2

By using DOACROSS:DO i =1, nDOACROSS j =1, n

S1 : c(ij) = a(i, j) + b(i, j);ENDDOACROSSENDDO

Tp= n t

Speed-up = T1 / Tp = (n2 t) / (n t) = n

(iii) Map nested loop algorithms onto array processors. Array processors: • An array processor is a synchronous parallel computer which consists of a number of processing elements (PEs) controlled by a host computer (control unit) • Array processors can be classified into three groups: (a) single-instruction multiple-data (SIMD) processors (b) systolic arrays (c) associative processors

5.3 Introduction to MultiPascal (Part II)

6. Examples of Parallel Processing

6.1 Intel MMX Technology (Part III)

6.2 Cluster Computing (Part IV)

6.3 Task scheduling in message passing MIMD

6.3.1 Scheduling Model and Notations A parallel program is represented by a directed acyclic graph (data dependent graph)

G = (V, E) Where V is a set of tasks,

implies task depends on , andrepresents the communication cost between and .

Each node is associated with a computation cost . The target machine is a message passing MIMD computer made up of an arbitrary number of PEs. Each PE can run one task at a time, and all tasks can be processed by any PEs All PEs are connected through an interconnection network (IN). Associated with each edge

connecting two PEs is the transfer rate over the link.

51

Vi Vj

Example: Figure 1 shows an example of a DDG and a target machine. There are total 9 tasks. Each node represents one task. The number shown in the upper portion is the task number. The number shown in the lower portion is the computation cost

Gantt Chart (Space-time diagram) can be used to represent the task schedule. It consists of a list of PEs in the target machine, and for each PE a list of tasks allocated to that PE ordered by their execution (computation cost) time.

The goal of task scheduling is to minimize the total completion time of a parallel program.

6.3.2 Communication Cost Model

There are different communication models. The following is one of them. We assume that each PE in the system has an I/O processor, and each PE can execute a task and communicate with another PE at the same time. That is the communication time between two tasks allocated to the same PE is zero. If let represent that task is scheduled to be PEi at time t and assume

and then can be scheduled by either PEk on time and , or PEl, on time and

A. The target machine is an eight-node hypercube IN.

52

VjVi

Example: Figure 2 shows the Gantt chart for an example by the above communication cost model assume the target machine has 3 PEs, and tasks 1, 4, 5 and 8 are assigned to PE1, task 2 and 9 are assigned to PE2, and task 3, 6 and 7 are assigned to PE3. Let for all i and j.

6.3.3 Optimal Scheduling AlgorithmsIt has been proven that the problem of finding an optimal schedule for a set of tasks is NP – complete in the general case. In this section we discuss the cases in which scheduling task graphs are to be polynomial.

When communication cost is ignored

Case 1: scheduling tree-structured task graphs.

Notations:

Task level: Let the level of a node x in a task graph be the maximum number of node (including x) on any path form x to a terminal task (leaf). In a tree, there is exactly on such path.Ready tasks: Let task be ready when it has no preprocessors or when all its preprocessors have already been executed.

AlgorithmStep 1. Calculate task level for each node (task) in the tree as the node’s priority.Step 2. Whenever a PE becomes available, assign it the unexecuted real task with the highest

priority.

53

Example: Consider the tree structured task graph in Figure 3 (a) to be scheduled on a fully connected system with 3 PEs. By applying the above algorithm:

Step1. Calculate task level for each nodeStep 2. Task 8, 9, 10, 11, 13, and 14 are Ready tasks. Assign tasks 13 and 14 (both have priority

5) to two PEs, and assign task 11 to the third PE.

The completed Gantt chart is shown in Figure 3 (b).

Case 2: Scheduling an arbitrary task graph on two PEs Algorithm (see reference).

Considering Communication Cost

Case 1. Scheduling a tree structured task graph on two PEsAlgorithm (see reference)

6.3.4 Heuristic Algorithms

A heuristic algorithm will use less than exponential time but does not guarenttee an optimal solution.

An example of heuristic algorithm (list algorithm):Step 1. Assign a priority to each node in the task graph.Step 2. A priority queue is initialized for all ready tasks and sort the queue in descending order. Step 3. As long as the priority queue is not empty, do the following:

3.1 Obtain a task from the front of queue3.2 select an idle PE to run the task3.3 Insert tasks into the queue when they become the ready tasks.

54

6.4 Associative Processing

• Associative Memory (AM): is a special memory. A word of AM is addressed on the basis of its content rather than on the basis of its memory address. It is also called Content Addressable Memory (CAM).

• Structure of Associative Processor:

1 m

Control

55

Host Computer

bij

Comparand

Mask

1

n words

wi

n m bits Tags Data gathering

Figure 6.1

(i) Host computer: an associative memory is controlled by a host computer which sends data and control signals to AM, collects results from AM, and executes instructions that are not executable by AM. (ii) Control unit: accepts control signal from host computer and controls associative processing. (iii) Comparand register (C): stores data (pattern) which is used for searching in AM. (iv) Mask register (M): it has the same number of bits as C. All bits in C whose corresponding bits in M is “0” will participate in searching. (v) AM: is composed of n words, each word has m bits ( wi = ( bil b2l … bim )). Each bit, bij, can be read and written and is compared with Cj if Mj = 0. (vi) Tag register (T): is used to indicate the result of a search or the operands in AM. T = ( t1 t2 ... tn ). (vii) Data gathering registers (D): One or more registers ( D = (d1, d2 ... dn )) are used to store the results from T. T and D can perform some simple logic operations. (viii) Flag some/none: it will be set to one if there is at least one word in AM matches C (masked by M), otherwise it is set to zero.

• Basic Operations:

(1) Set and Reset: to set/reset registers T, D, C, M.

56

(2) Load: to load registers C and M. (3) Search: before searching, C and M should be loaded and T should be set to one. The search will perform the following steps:

(a) for j = 1 to m parallel for all i do if bij Cj ( Mj = 1 ) then set dismatchi

j = 1 end parallel do (b) set dismatchi = dismatchi

1 + dismatchi2 + ... + dismatchim

(c) set ti = 0 if dismatchi = 1. The words whose tag bits remain one match C.

(4) Conditional branch: depends on the value of flag some/none. (5) Move: to move T to D. (6) Read: read out words, whose tag bits are one, one by one to host computer.

(7) Write: write content of C to all words, whose tag bits are one, in parallel.

Examples of AM program: (i) Search for a pattern (101010…10)

1. Set T2. Load C = 101010…103. Load M = 111111…114. Search5. If some/none = 1 then

else done (ii) Find maximum number in AM

1. Set D and T2. Load C = 111…13. For j = m to do

Begin Load M = 2j

Search if some/none = 1 then D = DT and T = D end4. Move D to T5. If some/none = 1 then Read then

else done

6.5 Data Flow Computing

• Idea of data flow computing: computations will take place when all operands are available. • Comparison between ordinary computer (von Neumann machines) and data flow computers: . von Neumann computers: (i) program and data are stored in memory (ii) sequence of instruction executions is controlled by program counter (PC)

57

(iii) operands may be given as values or addresses

. Data flow computers: (i) a data-flow operation is executed when all its operands become available (ii) all operands must be given as values (iii) data-flow programs are described in terms of directed graphs called data flow graphs

Example of data flow graph: Compute:

B

A C

Dataflow graph of Result computation A / B + B C

Figure 6.2

• Token: is placed on the input of computation node to indicate the operand is available. After the execution (called firing), the token will move t0 the output of the computation node.

Examples of token movement:

B B

A C A C B B

58

copy

/

+

x

copy

/ x

copy

/ x

Result Result

After inputs applied After copy instruction executed

B B

A C A C

A / B B C

A / B + B C Result Result After divide and multiply After addition instructions executed instruction executed

Figure 6.3

• Initially, tokens are placed on all inputs of a program. The program is complete when token appears on the output of the program.

Example of data flow program represented by data flow graph:

input ( x, y ) while x < y do z = 0 if z < 10 then z = z + 1 else z = z - 5 x = x +1 end output (z)

59

+ +

copy

/

+

x

copy

/

+

x

x y 0 10

z

Figure 6.4

60

T

<

<F

T F

F T

T F

+1

T

T F

+1 -5

T

university of reginazhang/cs8022003/part1_2003.doc · web viewparallel processing. cs802. part i....

Documents