enhanced systolic array implementations of some graph problems

22
'• Microprocessing and Microprogramming ELSEVIER Microprocessing and Microprogramming 40 (1994) 499-520 Enhanced systolic array implementations of some graph problems S. Sarkar a, A.K. Majumdar b, R.K. Sen h'* a Control Section, Electrical Engg. Dept., Jadavpur University, Calcutta 700 032, India b Computer Science & Engineering Dept., Indian Institute of Technology, Kharagpur 721 302, India (Received 22 April 1992; revised 2 January 1993; accepted 18 April 1994) Abstract The Instruction Systolic Array (ISA) and the Tagged Systolic Array (TSA) are enhanced systolic architectures which have been used to implement some graph theoretic algorithms in this paper. Designs for testing a graph for connected- ness, finding all paths from vertices to the root of a spanning tree, finding the lowest common ancestors of all vertex pairs of a spanning tree, finding a directed breadth first spanning tree of a graph, finding bridges of a graph and finding articulation points of a graph have been presented. Performances of these proposed designs have also been analysed. These design examples also demonstrate the suitability of ISA and TSA for hardware implementation of graph theoretic algorithms. Keywords: Graph theoretic algorithms; Hardware implementation of parallel algorithms; Enhanced systolic array; Systolic array; Recurrence equations; VLSI 1. Introduction The systolic array [13, 14] is a special-purpose array processor architecture suitable for area effi- cient layout in VLSI because it uses simple proces- sors and local and regular interconnection between processors. Although systolic arrays have found wide-spread applications for hardware implemen- tation of parallel algorithms, they often cannot * Corresponding author. Email: rsen~cse.iitkgp.ernet.in 0165-6074(94)$7.00 © 1994 Elsevier Science B.V. All rights reserved SSDI 0165-6074(94)00014-2 satisfactorily deal with non-local and irregular data dependencies [16] inherent in many algorithms, such as the Fast Fourier Transform, the Fast Hart- ley Transform, Viterbi decoding, etc. Additional architectural features such as buffers, multi- plexer/demultiplexer, additional control for data routing and/or computation, increased local mem- ory at a PE, etc. have been used in the literature I1 l, 12, 16-20, 29, 30, 32] to arrive at enhanced systolic designs for such algorithms. An Enhanced Systolic Array (ESA) is an array architecture that is obtained from a systolic array by these types of

Upload: s-sarkar

Post on 21-Jun-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

'• Microprocessing and Microprogramming

ELSEVIER Microprocessing and Microprogramming 40 (1994) 499-520

Enhanced systolic array implementations of some graph problems

S. Sarkar a, A.K. Majumdar b, R.K. Sen h'* a Control Section, Electrical Engg. Dept., Jadavpur University, Calcutta 700 032, India

b Computer Science & Engineering Dept., Indian Institute of Technology, Kharagpur 721 302, India

(Received 22 April 1992; revised 2 January 1993; accepted 18 April 1994)

Abstract

The Instruction Systolic Array (ISA) and the Tagged Systolic Array (TSA) are enhanced systolic architectures which have been used to implement some graph theoretic algorithms in this paper. Designs for testing a graph for connected- ness, finding all paths from vertices to the root of a spanning tree, finding the lowest common ancestors of all vertex pairs of a spanning tree, finding a directed breadth first spanning tree of a graph, finding bridges of a graph and finding articulation points of a graph have been presented. Performances of these proposed designs have also been analysed. These design examples also demonstrate the suitability of ISA and TSA for hardware implementation of graph theoretic algorithms.

Keywords: Graph theoretic algorithms; Hardware implementation of parallel algorithms; Enhanced systolic array; Systolic array; Recurrence equations; VLSI

1. Introduction

The systolic array [13, 14] is a special-purpose array processor architecture suitable for area effi- cient layout in VLSI because it uses simple proces- sors and local and regular interconnection between processors. Although systolic arrays have found wide-spread applications for hardware implemen- tation of parallel algorithms, they often cannot

* Corresponding author. Email: rsen~cse.iitkgp.ernet.in

0165-6074(94)$7.00 © 1994 Elsevier Science B.V. All rights reserved SSDI 0 1 6 5 - 6 0 7 4 ( 9 4 ) 0 0 0 1 4 - 2

satisfactorily deal with non-local and irregular data dependencies [16] inherent in many algorithms, such as the Fast Fourier Transform, the Fast Hart- ley Transform, Viterbi decoding, etc. Additional architectural features such as buffers, multi- plexer/demultiplexer, additional control for data routing and/or computation, increased local mem- ory at a PE, etc. have been used in the literature I1 l, 12, 16-20, 29, 30, 32] to arrive at enhanced systolic designs for such algorithms. An Enhanced Systolic Array (ESA) is an array architecture that is obtained from a systolic array by these types of

500 S. Sarkar et al./ Microprocessing and Microprogramming 40 (1994) 499-520

architectural modifications and/or enhancements. Examples of enhanced systolic arrays are: Pro- grammable Systolic Array [-18], Instruction Systolic Array (ISA) [191, Brown Systolic Array 1-111, and Tagged Systolic Array (TSA) E26].

It has been observed that some of the ESAs, mentioned above, can deal with a broader class of algorithms. All of the four different types of ESAs mentioned above are general-purpose arrays. These arrays are more cost-effective than a systolic array because the same array may be used for solving different problems. Since these arrays use identical processing elements (PEs) and, local and regular interconnection exist between PEs, an effi- cient utilisation of the silicon area in VLSI can be accomplished, and the design cost of the chip re- mains low [16, 22]. In this paper, the Instruction Systolic Array (ISA) and the Tagged Systolic Array (TSA) have been considered. It has been shown in [-25-28] that the ISA and the TSA can deal with non-local and irregular data communications, and are suitable for implementing a large variety of algorithms.

Although parallel graph theoretic algorithms have been mostly investigated with reference to general-purpose SIMD or MIMD architectures [3, 5, 6, 10, 241, some systolic implementations of graph theoretic algorithms such as finding transi- tive closure, connected components etc. have also been reported [4, 8, 9, 15]. Kung et al. [16] have examined systolic designs for the transitive closure (in general, the algebraic path) problem using the Warshall Floyd's algorithm [1]. These designs re- quire O(n) time with n 2 processing elements (PEs) for an n vertex graph. An ISA implementation for the Warshall Floyd's algorithm has been reported in [,21,]. This design uses an n x n PE mesh-con- nected array and requires O(n) time. Since graph theoretic algorithms often involve non-local and irregular data communications, systolic implemen- tation of such algorithms may be difficult to obtain. The suitability of the ISA and the TSA for imple-

menting a number of graph theoretic algorithms has been illustrated in this paper. The area-com- plexity and the time-complexity of these designs are comparable to those reported in the literature with general-purpose SIMD or MIMD architectures.

This paper is organised as detailed here. In Sec- tion 2, the general characteristics of the ISA and the TSA have been briefly reviewed. The technique for converting an ISA design to a TSA design has also been discussed. Section 3 presents ISA implemen- tations of some basic operations which are fre- quently used in parallel graph theoretic algorithms. In Section 4, ISA implementations of parallel algo- rithms for the following graph problems have been discussed: (a) Testing a graph for connectedness, (b) Finding all paths from vertices to the root of

a spanning tree, (c) Finding the lowest common ancestors for all

vertex pairs of a spanning tree, (d) Finding a directed breadth first spanning tree of

a graph, (e) Finding bridges of a graph, (f) Finding articulation points of a graph.

2. General characteristics of instruction systolic array and tagged systolic array

2.1. Instruction systolic array

The ISA is characterized by a rhythmic flow of instructions through the array. The ISA is classified as an MIMD architecture [,17,]. The general archi- tecture of an ISA may be a linear/mesh-connected array of identical processing elements (PEs). In an ISA, there is a single stream of instructions having n threads of control (for an n x n PE mesh-connec- ted array or an n PE linear array) entering the array from one of its boundaries. Every instruction speci- fies an operation. Each operation is assumed to take equal time, called an instruction cycle or

S. Sarkar et aL / Microprocessing and Microprogramming 40 (1994) 499-520 501

time-step. A PE in an ISA is capable of executing instructions belonging to a given set of instructions. A typical instruction set is given in Table 1.

All data communications are choreographed during the synthesis of an ISA. The mesh-connec- ted ISA uses an orthogonal stream of control bits along with the instruction stream. The control bit stream entering an n x n PE mesh-connected ISA has n threads of control bits. A control bit (having values 0 or 1) is used for masking the execution of an instruction received by a PE. The data commun- ications among PEs in the array are under the explicit control of the instruction stream (and the control bit stream, if used).

The ISA is suitable for VLSI implementation because all the PEs in an ISA are identical and local

Table 1 A typical instruction set of ISA

Data routinff instructions RL: Read data from left side PE RR: Read data from right side PE RU: Read data from upper PE RD: Read data from lower PE MOV x, y: Copy the content of register x to register y.

Computation instructions (In the following instructions x is a general-purpose register) A D D x: Adds A to x and puts the result in A SUB x: Subtracts A from x and puts the result in A M U L x: Multiplies A with x and puts the result in A DIV x: Divides A with x and puts the result in A A N D x: A N D A with x OR x: OR A with x XOR x: XOR A with x NOT: N O T A NOP: No operation

Instructions for fla# related operations SET f : Sets a f l a g f t o 1 R E S f : Resets a f l a g f t o 0 N O T f : Negates f OR f, g: OR f l a g f w i t h flag #, result i n f AND f, g: A N D . . . . . . . . . . XOR f, #: XOR . . . . . . . . . . T R F f , 0: Transfer f t o g

and regular interconnections are employed among PEs. This results in an efficient utilisation of the chip area and a reduced design cost.

The architecture of a processing element in an ISA is shown in Fig. 1. A P E has a set of gen- eral-purpose registers and three special-purpose registers, viz. the communication register (K), the flag register (F) and the accumulator (A). The com- munication register is used for routing data from one PE to a neighboring PE. For a linear ISA, a PE can communicate with (at most) two of its neigh- boring PEs via the communication register. For a mesh-connected array, a PE can communicate with (maximum) four of its neighboring PEs via the communication register.

The general-purpose registers are referred to as gl, g2, if3 . . . . etc. A P E also includes an instruction register (IR), a control bit register (CB), a control unit (CU) and an arithmetic unit (AU). The total number of general-purpose registers (i.e. local mem- ory) required will vary from algorithm to algorithm and also depends on the ISA synthesis technique used. The local memory in each PE may vary from a few bytes to a few hundred bytes.

PE boundar

E I I

IMlclzl.IPl ,l 21 31 Frog r e g i s t e r F

i I l l I I ""

lg'lg21 l¢l s! • i 1 Generol purpose reg is ters

Fig. 1. A processing element of an ISA.

502 S. Sarkar et al./ Microprocessing and Microprogramming 40 (1994) 499-520

The ISA proposed in [19] does not include the flag register F. However, it is felt that the register F along with associated operations involv- ing flags, greatly increases the computational capa- bility of a PE at the expense of a nominal increase in chip area per PE. Thus the F register and the associated flag-related operations have been included following the approach proposed in 111] for Brown systolic array which is very similar to the ISA.

The 8 bit flag register includes a mask flag, a carry/borrow flag, a zero flag, a negative flag, a phase flag and three other flags left for use by the programmer. The mask flag is used to mask an instruction, i.e. if its content is zero, an instruction is not executed.

The control unit (CU) decodes the instruction received and decides whether this instruction is to be executed or not, depending on the status of the mask flag and the control bit (if any). The arithme- tic unit (AU) performs primitive computations like addition, multiplication, AND, OR etc.

Each of the general-purpose registers and the registers K and A are of one (optionally of a few) byte. The CB register is of one bit only. The length of the IR register is estimated from the total num- ber of instructions in the instruction set. If the instruction set of a PE consists of R instructions, then the length of the IR register is proportional to [- log2 R 7 bits.

The operation of each PE in an instruction cycle may be described by:

begin (Beginning of an instruction cycle} send instruction; send control bit; {for mesh-connected ISA

only} receive instruction; receive control bit; {for mesh-connected ISA

only} execute instruction;

end. {End of an instruction cycle}

A typical mesh-connected array of PEs is shown in Fig. 2. The same figure also shows the instruction stream matrix and the control bit stream matrix fed to the ISA. To execute an algorithm on a mesh- connected ISA, an instruction stream and a control bit stream are pumped into the array. The instruc- tions and the control bits flow through the array rhythmically in a pipelined fashion. A P E receives an instruction from a neighboring PE at the begin- ning of an instruction cycle. A P E also receives a control bit from a neighboring PE at the begin- ning of an instruction cycle. In a PE, the instruction is executed only when the control bit to the PE is 1. The use of control bits results in a very flexible array processor architecture. In case of a linear ISA, where no control bit stream is used, a PE always executes the instruction received.

In an ISA, data is routed from the source PE to the destination PE (which may not be a neighbor of the source) via other PEs and local interconnects by the execution of proper instructions at inter- mediate PEs. Therefore, an ISA can deal with both local and non-local data routings. This makes the ISA more suitable for implementing algo- rithms involving local and/or non-local data routings.

The instruction stream pumped into an n PE linear ISA or an n x n PE mesh-connected ISA may be expressed as an nl x n matrix of instruc- tions. The length of the instruction stream is given by nl. The total time required to execute an algo- rithm is derived from the length of the instruction stream. Rows of instructions (starting from the first row to the nl th row) enter the array at successive time-steps. The instruction in the ith row ( l~<i~<nl ) and the j th column (l~<j~<n) of the instruction stream matrix is denoted b y R(i,j). The control bit stream can be similarly represented as an n x nl matrix of control bits. A control bit may be 0 or 1. Columns of control bits enter the array at successive time-steps. The control bit in the ith row (1 ~< i <~ n) and the jth column

S. Sarkar et al./ Microprocessing and Microprogramming 40 (1994) 499-520

In s t r u c~,ion s t r e a m

503

R ( p , j )

P E ( i , j ) C(i~q)

C o n t r o l b i t s t r e a m M e s h - c o n n e c t e d |SA of 4 x 4 PEs

Fig. 2. Instruction stream and control bit stream of a mesh-connected ISA.

(1 ~< j ~< nl) of the control bit matrix is denoted by C(i,j).

A P E of a mesh-connected ISA is identified by the row and the column of the array to which it belongs. The PE in the ith row and the j th column is denoted by PE(i,j). A PE of a linear ISA is identified as PE(j).

In order to illustrate the operation of an ISA, consider the execution of instructions at PE(i,j) in Fig. 2. Suppose that the first row of instructions enter the first row of PEs at the first time-step. Each row of instructions enter the array at consecutive time-steps. So the pth row of instructions enters the first row of PEs at the pth time-step and reaches the ith row PEs at (p + i - 1 ) t h time-step. Thus PE(i,j) receives an instruction R(p,j) at (p + i - 1)th time-step. If this instruction is required to be ex- ecuted, then the control bit that enters PE(i,j) at

(p + i - 1)th time-step must be 1. The movement of the control bit stream can be analysed in a similar manner.

In a linear ISA, the ith row of instructions enter the array at the ith time-step assuming that the 1st row of instructions enter the array at the first time- step. Therefore, R(p,j) is executed at PE(j) at the pth time-step.

2.2. Tagged systolic array

The major disadvantage of the ISA is that the instruction stream (and the control bit stream, if necessary) is supplied to the array from outside. Therefore, the I/O requirement may be a bottle- neck in deriving a high performance design. This problem is absent in another ESA, called the

504 S. Sarkar et al./ Microprocessing and Microprograrnming 40 (1994) 499-520

Tagged Systolic Array (TSA) 1-25, 26]. Necessary instructions for computation and data routing are stored in a PE of the TSA before starting computa- tion. It has been observed that the size of the local memory required at a PE in a TSA is considerably low (less than 100 bytes for many applications).

The basic design philosophy used in a TSA is that it uses a tag attached to a data for routing the data from the source PE to the destination PE. The tag gives the address of the destination PE. When a data routing originates from the source PE, a tag is attached with it.

It may be mentioned here that the use of tag in a TSA is different from the concept of tag usage in a data flow architecture. The TSA uses distributed control, and all the computations and the data routings are synchronised to global time-steps. Here, tags are used simply to achieve data routings. All the data communications among the PEs in the array are explicitly choreographed during synthesis of a TSA from algorithm specification. Thus, no data routing conflicts over data paths occur during the execution of an algorithm. Also no wait queue is required at a PE.

The general architecture of a TSA is a linear or a mesh-connected array of PEs. The TSA may be classified as an MIMD architecture. The TSA uses identical PEs with local and regular interconnec- tions among them. Therefore, it is suitable for VLSI implementation. A typical PE of a TSA is shown in Fig. 3.

The instructions required to execute an algo- rithm are stored in each PE. The instructions for computation, flag related operations and register to register data movement are similar to those for an ISA. However, the data routing instruction is differ- ent. A data routing instruction contains informa- tion about which data is to be routed and which tag is to be attached to this data for its proper routing from the source PE to the destination PE. A data routing instruction may be represented by (ROU d, t) which implies that the data 'd' is to be

PE boundary

E

-! Local

I cu I

I IclzI I IF'IF IF I Flo9 register F

Memory I

Fig. 3. A processing element of a TSA.

routed and 't' is the tag to be used for this data routing.

Therefore, only one routing instruction is re- quired to route any data from a source PE to a destination PE. This is in contrast to the ISA, where a single data routing may require more than one instructions, e.g. RL, RR, RU etc. Hence, the total number of instructions required to be stored in a PE for data routings can be considerably reduced. However, the use of tags for data routing necessitates an additional hardware logic, called tag identifier (TI) to be included in each PE. The TI is associated with the communication register (which carries out data communications with neighboring PEs) to compare the tag of a data with the PE identifier.

The data routing in a TSA with the help of tags is now examined. For the sake of illustration, it is assumed that a data moves in the vertical direction first and then in the horizontal direction in a mesh- connected TSA. For a linear TSA, however, there is only one direction of movement, i.e. horizontal in left or right. It is also assumed that a data always follows a minimum length path between the source and the destination.

S. Sarkar et al./ Microprocessing and Microprogramming 40 (1994) 499-520 505

Let PE(i,j) of a mesh-connected TSA receive a data with tag (x, y) from a neighboring PE. If x = i and y = j then this data is stored in the PE(i,j). If i < x then the data is sent to PE(i + 1,j), whereas if i > x then the data is sent to PE(i - 1,j). When i = x and j > y then the data is sent to PE(i,j - 1) and, when i = x and j < y then the data is sent to PE(i , j + 1).

It may be mentioned here that for a mesh- connected TSA, the movement of data as specified above is one of the many possible ways of routing data from the source PE to the destination PE. If the address of the destination PE is used as a tag, then rules for data movement are required for re- ducing the tag length. Thus, for a mesh-connected TSA, one of the rules for data movement may be as discussed above. If no such rule is used, the tag must somehow encode the path information. This may result in an increase of the tag length.

The data routing conflict arises in a TSA/ISA when more than one data try to travel over the same interconnecting link between two PEs at the same time. Such routing conflicts must be solved during the synthesis of ISA/TSA. If routing of a data results in a data path conflict, it can be solved by either following a different path for rout- ing the data or by introducing additional delays during the routing of the data.

2.3. Converting an ISA design to a TSA design

It has been shown in [28] that a TSA implemen- tation of an algorithm can be derived from the corresponding ISA implementation. Such a TSA implementation uses a similar type of array (i.e. linear or mesh) with an equal number of processing elements (PEs) and can complete the computation of an algorithm in the same order of time. The conversion of an ISA design to a TSA design is discussed in brief below.

Both the ISA and the TSA designs are based on mesh-connected arrays or linear arrays of PEs.

Hence, if the computations of an algorithm can be mapped on a mesh-connected ISA of n x n PEs (or a linear ISA ofn PEs) by a valid allocation function, the same algorithm can also be mapped on a mesh-connected TSA of n x n PEs (or a linear TSA of n PEs) by the same allocation function. The instruction stream pumped into an ISA perform computations and data routings. The instructions used for performing computations only may be stored in PEs of a TSA. Therefore, all such compu- tations can be performed in equal order of time.

The time for data routings in these two architec- tures are now examined. For the following dis- cussion, data routings in a mesh-connected array is considered. Similar steps can also be carried for linear arrays.

Consider data routing from a source PE to a des- tination PE via intermediate PEs in an ISA. The data routing is achieved by moving data from a PE to a neighboring PE by the execution of any of the instructions RR, RL, RU&RD. The equivalent movement of data can be simulated in a TSA by a routing instruction, which routes the data with a tag identifying the particular neighboring PE. In this manner, the movement of the data from the source PE to the destination PE may be simulated. Since any valid data routing in ISA implies that there is no data routing conflict, no path conflict should also arise in the TSA if the same path is followed. Therefore, it follows that all the data routings in an ISA implementation of an algorithm can be completed in same order of time in a TSA.

Techniques similar to above can be used to arrive at TSA implementations from the ISA imple- mentations of the graph theoretic algorithms considered here. Regarding ISA implementations, design details of simpler and obvious operations have been ommitted for brevity. In the proposed ISA implementations of the different graph theor- etic algorithms, mesh-connected arrays have been used where a graph is represented in terms of its adjacency matrix. In an n x n mesh-connected ISA,

506 S. Sarkar et al./ Microprocessing and Microprograrnming 40 (1994) 499-520

the ith row and the j th column processor PE(i , j) (where 1 ~< i,j <% n) stores an element a(i,j) of the adjacency matrix A = [a(i,j)] of an n vertex graph. The element a(i,j) whose value is either 0 or 1, may be stored in the accumulator or in a general- purpose register of PE(i,j).

3. ISA implementations of some basic operations

The following basic operations are considered for execution in an n x n mesh-connected ISA. In the ensuing discussion, by distribution of a data from a PE (called source PE) to several other PEs (called destination PEs), it is meant that the data of the source PE is written into the destination PEs.

(1) Row/column distribution. In a row/column dis- tribution, data from a row/column, say ith row/column (1 <~i~< n) is distributed to all other rows of the n x n mesh-connected ISA. Thus in a row distribution, the PE(i, j) writes its data to PE(k,j) Vkll ~< k ~< n and k # i. The column distribution operation can be similarly described.

(2) Diagonal operation. The diagonal operation may be described as follows:

Step 1. A diagonal PE(i, i) 1 ~< i ~< n (for an n x n mesh) selects a particular element residing in the PEs of row (column) i.

Step 2. The selected value is now distributed by PE(i, i), 1 ~< i ~< n, to all other PEs in row i and column i.

The selection of an element in Step 1 usually involves finding the minimum or maximum ele- ment. (3) Rotation. Data rotation in a mesh-connected

array considered here may be either in the hori- zontal di1"ection or in the vertical direction.

When the rotation is carried out in the horizon- tal direction, data flow may be either towards the left or towards the right. Similarly, in the vertical direction, the rotation may be carried out upwards or downwards. In a single step of a horizontal left rotation, columns of data are shifted left by one column. The first column of data is shifted to the last column. After the complete rotation, each column of data is in its original position. By a horizontal left rotation, each PE(i, j) receives the data residing in other PEs of row i.

It may be noted that the rotation defined here is different from the rotation discussed in [3] where rows of data move left like a row of soldiers in horizontal left rotation. The data at PE(i, 1) of a row i bounces back on the left boundary and starts moving right until it bounces back on the right boundary. After bouncing back on the right boundary, a data once again starts moving left. By this type of rotation also a PE(i,j) receives all the data residing in other PEs of row i. Therefore, if PE(i, j) is to receive all the data residing in PEs of row i, either of these two types of rotations may be used.

(4) Selecting minimum (maximum). This operation involves finding the minimum or maximum value of all the n x n data elements stored in a mesh array. The result can be stored in any arbitrarily selected PE. For some algorithms, it may also be required to select minimum or maximum value of all the data items stored in a row or column.

(5) Finding logical AND/OR. This operation finds the logical AND/OR of all the n x n data stored in a mesh array. The result can be stored in any arbitrarily selected PE. It may also be required in some applications to find AND/OR of a row (column) of data.

The following sub-section illustrates how these basic operations can be implemented in an ISA (TSA).

S. Sarkar et al./ Microprocessing and Microprograrnming 40 (1994) 499-520 507

3.1. Row~column distribution

The recurrence equation for distributing the data of the kth row to all other rows may be given by:

for i # k a n d l < ~ i , j < < , n : b ( i , j ) = a ( k , j ) (1)

where a(k,j) is the data stored in the row k and column j PE, and b(i,j) is the data received at PE(i, j) after this distribution operation.

The row distribution operation for a 4 x 4 mesh- connected array is shown in Fig. 4, where data from the third row is distributed to all other rows.

In order to distribute a row of data to all other rows, the particular row of data is transmitted to neighboring row(s). A neighboring row of PEs, after

receiving these data, send the same to its neighbor- ing row. Thus a row of data ripples from its source row eventually reaching all other rows. Fig. 5 shows the instruction sequence for the row distri- bution operation shown in Fig. 4. Let the a(k,j) value be stored in a general-purpose register gl of PE(k,j). The b(i,j) value is to be stored in another general-purpose register g2 of PE(i,j). It can be shown that the row distribution operation can be completed in an n x n mesh-connected ISA in O(n) time. A constant amount of local memory is re- quired at each PE of this ISA. An n x n mesh- connected TSA implementation of this operation also requires O(n) time-steps. Since, each PE of the TSA sends data to its neighboring PEs only, at

a b c d

I n i t i a l c o n d i t i o n

a b c d

a b c d

a b c d

a b c d

A f t e r r o w d i s t r i b u t i o n

Fig. 4. Row distribution from the third row.

T h e ~ m a r k e d PEs d i s t r i b u t e t h e i r

d a t a

1 1 0 0

1 0 0 0 0

1 0 0 1 0 1

1 0 0 0 1 0

Mov k , g 2 Mov k ~ g 2 RD

Mov k , g 2 RD Mov k , g 2 RD - RD

RD - RD RU - RD RU Mov g l , k RD RU MOV g 1 , k - RU Mov g l , k -

Mov g l s k - - -

0 0

0 -

A r r a y o f PEs

Fig. 5. Row distribution.

508 S. Sarkar et al./ Microprocessing and Microprogramming 40 (1994) 499-520

most two tags and two data routing instructions need to be stored in a PE. The TSA implementation also requires constant amount of local memory at a PE.

The column distribution operation from column k to all other columns can be performed in a similar manner in an n x n mesh-connected ISA. The col- umn distribution operation also requires O(n) time-steps and constant local memory per PE.

3.2. Diagonal operation

The diagonal operation is frequently used in many mesh array based graph theoretic algorithms. Usually, this is accomplished in a general purpose MIMD mesh array by row and column rotation [3]. The overall time complexity of such an imple- mentation is O(n).

A straightforward implementation of this opera- tion in an n x n mesh-connected ISA results in O(n 2) time-complexity. The instruction sequence for Step 1 of this operation has been shown in Fig. 6. (The instructions required for computation in a diagonal PE(i, i) has not been shown. Only the instructions required for data routing has been shown,) Note that for data routing in each row, separate instruc- tion has been used. This results in loosing an im- portant advantage of ISA, i.e. instruction pipelining through the array. In instruction pipelining, an instruction entering the array is executed at mul- tiple PE locations for data routing/computation. Such pipelining is effective in reducing the length of the instruction sequence as well as total time of computation.

An optimal n x n ISA implementation can be obtained as discussed next. Here, instead of

0 0 0 0 1 1 1 1 1 1

0 0 1 1 1 1 0 0 0 0 0

0 0 0 1 1 1 0 0 0 0 0 0 0 0

1 1 1 1 0 0 0 0 0 0 0 0 0

- ~ - R L

- - R L R L

- R L R L R L

- - R L -

- R L R L -

- - R R -

- R R - -

- R R R R -

- R L - -

R R - - -

R R R R - -

R R R R R R -

A r r a y o f P E s

Fig. 6.

s. Sarkar et aL/ Microprocessing and Microprogramming 40 (1994) 499-520 509

selecting a value at PE(i, i) in Step 1, the same value is selected at PE(i, 1). Next, in a single row distribution operation, the selected value is distrib- uted from PE(i, 1) to all PEs in row i. At this stage by one vertical rotation, the data of a diagonal PE(i, i) can be selected at PE(1, i). The selected value at PE(1, i) is now distributed to all PEs in column i. Thus the algorithm involves the following steps:

Step 1. Send columns of data to first column of PEs. As these data are received one after another at the first column PEs, the selec- tion operation is also performed simulta- neously. This operation takes O(n) time- steps in an ISA.

Step 2. Distribute the first column of selected data to all other columns. This is done by column distribution operation and hence requires O(n) time.

Step 3. Collect the data of PE(i,/) (1 ~< i ~< n) at PE(1, i). This is achieved by vertical rotation along with necessary compu- tations. This step also requires O(n) time.

Step 4. Distribute the collected data from the first row to all other row. This is imple- mented by row distribution operation in O(n) time.

Therefore, the diagonal operation may be imple- mented in an ISA in O(n) time. The local memory per PE is also constant.

3.3. Rotation

Horizontal rotation The recurrence equation describing a step of

a horizontal left rotation operation is given by:

For l ~ < i , j < < , n ; k = 1

if j = n then

a(i,j, k) = a(i, 1, k - 1)

i f j ~ n then

a(i,j, k) = a(i,j + 1, k - 1) (2)

where a(i,j, 0) is the value stored in PE(i , j) before rotation and a(i,j, 1) is the value at PE(i , j) after a step of rotation.

The complete rotation is obtained with n such steps for an n × n mesh. After the rotation is com- pleted, each column of data is in its original posi- tion (i.e. the position before rotation).

Suppose that a(i,j,O) is stored in the general purpose register gl in PE(i,j) before rotation. The instruction sequence required for implementing a single step of horizontal left rotation in a 4 × 4 mesh-connected ISA is shown in Fig. 7. After the rotation, data is available in the accumulator of a PE. From Fig. 7, it is found that this operation can be completed in O(n) time-steps with constant local memory per PE. A TSA implementation can be derived in a similar way. The TSA implementa- tion can also perform a step of rotation operation in O(n) time-steps. It is noted that a complete rota- tion can also be completed in O(n) time in an ISA/TSA.

The vertical rotation can be similarly performed in O(n) time-steps in an n × n mesh-connected ISA.

3.4. Finding minimum~maximum

The selection of a minimum or a maximum data in an n x n ISA can be done as follows:

In the first phase, data residing in columns of PEs are sent to the kth column PEs by a horizontal rotation operation. Each PE of the kth column finds the minimum (or maximum) between the data stored in the PE and the data received in successive time-steps from other columns. The minimum value is stored in the PEs of this column. This operation can be completed in O(n) time-steps with constant local memory per PE.

In the second phase, the minimum (or the maximum) values available in each PE of the kth

510 S. Sarkar et al./ Microprocessing and Microprogramming 40 (1994) 499-520

1 1 0 1 1

1 1 0 1 1 1

1 1 0 1 1 1

1 1 0 1 1 1

Mov k • A RR

M o v g l , k

1

Movk, A - Movk, A RR RR - M o v g l , k - M o v g l , k M o v k , A Movgl~ k - RL - RL RL

A r r a y ot P E s

Fig. 7.

column are sent to a rth row PE in the same column (i.e. PE(r, k)). The PE(r, k) finds the min- imum between the data stored in it and the data just received, and stores the minimum value in this PE. The operations of the second phase can also be completed in O(n) time-steps.

Therefore, the whole operation requires O(n) time-steps in an n x n mesh-connected ISA. Each PE of the array requires local memory for storing two data only. Therefore, the local memory per PE is a constant.

3.5. Finding logical AND~OR

The implementation of a logical AND/OR op- eration can be done in a manner very similar to finding minimum or maximum. The data of all rows are first sent to a particular row. The PEs of this row compute the logical AND (OR) of the columns of data. The results of computation are then sent to another PE of that row which computes the logical AND (OR) of the results of this row. The final result is available in this PE. It is noted, that the whole operation takes O(n) time-

steps and requires only constant local memory in a PE.

The performances of ISA implementations for various operations discussed earlier are sum- marized in Table 2.

4. Implementation of some graph problems in ISA

A number of important parallel graph theoretic algorithms have been considered in this section for ISA implementation. Note that in these algorithms, the previously discussed basic operations have been used. Some of these algorithms require that the transitive closure of a graph be found. It has al- ready been mentioned that the transitive closure of an n vertex graph can be found in O(n) time on an n x n mesh-connected ISA [-21]. The ISA imple- mentation of the transitive closure problem for a 4-vertex graph using the well-known Warshall Floyd algorithm has been shown in Fig. 8. The total computation time is O(n) for this design. Also the local memory in a PE is a constant because at most three data need to be stored in a PE at any time for computation.

S. Sarkar et al./ Microprocessing and Microprogramming 40 (1994) 499-520

Table 2 Performance measures for ISA/TSA implementations of different basic operations

511

Basic operation No. of PEs Time-complexity Cost Memory (Approx.)

Row (column) distribution n 2 O(n) O(n 3) constant (~< 256 bytes)

Diagonal operation n z O(n) O(n 3) constant ( ~< 256 bytes)

Rotation n 2 O(n) O(n 3) constant ( ~< 256 bytes)

Selecting minimum (maximum) value n 2 O(n) O(n 3) constant ( ~< 256 bytes)

Finding AND/OR n 2 O(n) O(n 3) constant ( ~< 256 bytes)

RD RD RU

RD RU - RD RU RR RL RU RR R L Mov A, k RR RL Mov A , k AND g l - Mov A jk AND g 1 OR k Mov A,k AND g l OR k RL AND g l OR k RL Mov k ~ g l OR k RL Mov k , g l RU Mov A , k Mov k , g l RU Mov k , A Mov k , g l RU Mov k~A RU Mov k, A Mov k j A

1 0 1 1 1 1 1 1 1 0 1

1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1

0 1 1 1 1 1 1 1 1 1 1

Fig. 8. ISA for transitive closure. (Repeat the same instruction stream n times to complete the computation.)

4.1. Testing a graph for connectedness

A graph G(V, E) is said to be connected if there is at least one path between every pair of vertices in G. Connectivity testing of a graph is often used in more involved graph theoretic algorithms such as

testing a graph for separability, planarity and isomorphism with another graph [7].

Let A be the adjacency matrix of G and A* be its transitive closure. An element a*(i,j) in A* is 1 if there is a path from vertex i to vertexj. Therefore, in a connected graph, all the elements of A* are 1.

512 s. Sarkar et al./ Microprocessing and Microprogramming 40 (1994) 499-520

Hence, to find whether a graph is connected, the algorithm given below may be followed:

Algorithm 1 Step 1. Find A*. Step 2. Check whether all the elements of A* are 1. Step 3. END.

Step 1 requires finding the transitive closure of the given graph. As discussed earlier, the transitive closure problem can be implemented in an n x n mesh-connected ISA with constant local memory in a PE. Also the computation of the transitive closure can be completed in O(n) time-steps. The Step 2 operation can be performed in O(n) time by a logical AND operation with all the data in A*. Such an operation has been discussed in Sub- section 3.5. The graph is connected if the final result of the AND operation is one. The operation in the Step 2 can also be completed in O(n) time-steps with constant local memory in a PE. Therefore, Algorithm 1 can be completed in O(n) time-steps with constant memory in a PE.

4.2. Finding all paths f rom the vertices to the root o f a spanning tree

Consider a connected g r a p h G(V,E) and a spanning tree T(IT, E') of G where V = n. Let r0 be the root of this spanning tree. It is required to construct a matrix P where each row contains a path from a vertex to the root of the tree. Matrix P may be used in a number of efficient parallel graph theoretic algorithms for SIMD shared mem- ory architectures as discussed in [31-1. Some of these algorithms are: finding the lowest common ancestors of q vertex pairs in a directed tree, finding all fundamental cycles of a connected undirected graph, finding all bridges in a connected undirected graph, the bridge connected components of a con- nected undirected graph, finding all biconnected components in a connected undirected graph, find-

ing articulation points, determining the biconnec- tivity of a connected undirected graph etc.

In order to find ancestors of a vertex, a function F is defined as follows:

F(i) -- father of vertex i in T, i ~ ro

F(ro) = ro

Let F k be defined by:

F°(i) = i

Fk(i) = F(F k- 1 (i)).

If i is a vertex in T, Fk(i) is the kth ancestor of i in T. The depth of the ith vertex is defined by:

depth(i) = min{klFk(i) = to, 0 <<. k <~ n - 1}.

Let the tree vertices be labeled with positive integers in such a manner that only a vertex with a lower integer label can be an ancestor of another vertex with a higher integer label. Given the func- tion F, F k can be computed as discussed next.

Suppose that the n x n adjacency matrix for T is stored in the n × n mesh-connected ISA such that a PE(i,j) holds an element t(i , j) of T in a general purpose register gl. The element t(i , j) is 1 only if there is an edge between the ith vertex and the jth vertex. Compute Fl(i) by using FI(/) = { j l t ( i , j ) = 1}. This computation can be done by sending column values to the first column in O(n) time by horizontal rotation. However, for computing FI(/) , the accu- mulated values need not be stored in the first col- umn PEs. As a PE(i, 1) of the first column receives a value t(i , j) from PE(i,j), it immediately checks whether t(i , j) = 1 and accordingly finds F~(i). The PE(i, 1) receives the t(i, 2) value first, t(i, 3) value second and so on. Therefore, the j value can be easily determined. As PE(i, 1) completes the compu- tation with a particular t(i,j), it discards the same data and is ready to receive the next data. Hence, not more than one data need to be stored in a PE at a time. This implies that a constant local memory is

S. Sarkar et al./Microprocessing and Microprogramming 40 (1994) 499-520 513

required at a PE. The computation of Fl(i) in PE(i, 1) can be completed in O(n) time.

A copy of the Fl(i) value computed at PE(i, 1) is sent to its rightside neighbor, i.e. PE(i, 2). This re- quires constant time. The PE(i, 1) sends a copy of F~(i) value to PE(i + 1, 1). This Fl(i) value then proceeds from PE(i + 1, 1) to PE(i + 2, 1) and thereafter, from PE(i + 2, 1) to PE(i + 3, 1) and so on until it reaches PE(n, 1).

Therefore, a PE(j, 1) receives all the Fl(i) values for i < j. The order in which these FI(/) values are available at PE(j , 1) are:

PE(j, 1) receives FI( j - 1) first

. . . . FI ( j - 2) second

. . . . FI( j - 3) third

" FI(1) at the end

Initially PE(j, 1) stores F I ( j ) , i.e. the first ances- tor of the j th vertex. At the first time-step, PE(j, 1) receives F I ( j - 1) which is the first ancestor of vertex ( j - 1 ) . If F I ( j ) = ( j - 1 ) , then F 2 ( j ) = F I ( j - 1). If F2(j) is computed at this step, then a copy of it is sent to PE(j, 2). The PE(j, 2) which stores FI(j) sends the same to PE(j, 3). The ances- tor information can be sent to rightside PEs by horizontal right rotation. The PE(j, 1) next receives FI( j -- 2) and, repeats the steps outlined above. For the operations in this stage, a constant amount of local memory is required in a PE.

The entire operation of finding the ancestors of a vertex can be completed in O(n) time with con- stant local memory per PE. After the entire opera- tion is over, the consecutive PE locations from PE(j, 2) to its right contain the path information from the root to the first ancestor of vertex j. This completes the construction of the matrix P.

The depth of the ith vertex in the tree can be found by counting the consecutive non-zero ances-

tors in the ith row PEs from PE(i, 2). This operation can also be completed in O(n) time.

4.3. Finding lowest common ancestors

Let T(V, E') be a spanning tree as discussed in Sub-section 4.2. The lowest common ancestor (LCA) of two vertices u and v (u, v ~ V) in T is the vertex w e V such that w is a common ancestor of both u and v, and any other common ancestor of u and v in T is also an ancestor of w in T. An O(loga n) time algorithm for finding the LCA for q vertex pairs in an n vertex tree T has been dis- cussed in [-31] with reference to a SIMD PRAM architecture having n processors. The LCA may be used for finding all fundamental cycles of connected undirected graph, finding all bridges in a connected undirected graph, etc. 1-31].

Let the vertex w be the LCA of vertices u and v where u, v, w ~ V. Suppose that the vertex ro is the root of the tree T. The portion of the path from w to ro is common to both the paths u to ro and v to r0. Let the paths from root to all vertices in T be found as discussed in Sub-section 4.2. The matrix P is stored in the array of PEs. Therefore, row u and row v have identical entries in PEs from column 2 to column depth (w). Comparing row u and row v, one can easily find the LCA for the vertex pair (u, v). The computation of the algorithm proceeds as follows:

Suppose that the data of the ith row move down- wards from row to row. When the j th row (j > i) receives the ith row data, a PE(j, k) of the j th row compares its content (i.e. ancestor information) with that of the PE(i, k). If these two ancestor values are equal, then the content of a general purpose register g4 (for a 4 x 4 array) is set to one. After this computation is over, count from the PE(j, 2) how many consecutive PEs of the jth row have g4 = 1. This gives the depth of the path for the LCA. Quite evidently, the local memory requirement for this operation is a constant.

514 S. Sarkar et al./ Microprocessing and Microprogramming 40 (1994) 499-520

Suppose that the 94 register content of PE(j, 2) to PE(j ,k) are all 1 and that of P E ( j , k + I ) to PE(j, n) are all 0. Then the ancestor information stored in PE(j, k) gives the LCA for the vertex pair (i , j) . The LCA for all pairs of vertices can be found in O(n) time with constant local memory in a PE.

4.4. Directed breadth first spanning tree

In [3] many important graph theoretic algo- rithms have been presented for execution on a mesh-connected general purpose MIMD archi- tecture. For an n vertex graph, these algorithms require O(n) time-steps in an n x n array of gen- eral-purpose processors. The problems considered in [3] include: finding the bridges and articulation points of an undirected graph, finding the length of a shortest cycle, finding a minimum spanning tree and many others.

Many parallel algorithms for graph problems have been reported in the literature with reference to general purpose SIMD and MIMD architec- tures I-2, 23]. Some parallel algorithms have also been developed for linear arrays and two-dimen- sional meshes. The n x n mesh ISA/TSA also has a mesh interconnection. However, in an ISA the local memory at a processing element should be kept as low as possible. Thus, in an ISA when a processing element sends a data to its neighbor, normally, unlike in [3], no copy of the data is assumed to be retained unless the same data is required for further computation in this PE. In view of this while attempting to implement graph theoretic algorithms in ISA, adequate care should be taken to minimize local memory.

In this and the two subsequent sub-sections, al- gorithms for finding directed breadth first spanning tree, bridges of a graph and the articulation points of a graph have been considered for mesh-connec- ted ISA implementation. It is seen that these

algorithms can be executed in an ISA in O(n) time- steps with constant local memory in a PE.

Suppose that a directed breadth first spanning tree T(V, A) of a connected graph G(V, E) is to be constructed. Let the spanning tree be routed at vertex 1. Let eij represent a directed edge from vertex i to j. Suppose that S,(i,j) represents the shortest path from vertex i to vertex j such that the intermediate vertices belong to the set of vertices {/31, /)2, . . . , V n } , 13 i • Vfor 1 ~< i ~< n. The ISA imple- mentation discussed in this sub-section is based on the following algorithm due to Atallah et al. [3]:

Algorithm. 2 Step 1. Compute S.(i,j) for G. Step 2. Find level(i) = S.(i, 1) and level(j) = S.(1,j). Step 3. Distribute level(j) value from PE(1,j) to all

processors in the j th column. Similarly, level(i) value is distributed to all processors in the ith row.

Step 4. Every processor PE(i,j) now contains values of level(i) and level(j). For i ~ 1, F(i) is computed at PE(i, i) using the formula Fq) = min {j le~,~ • E and level(i) = level(j) + 1}. The F(i) for i # 1 value at PE(i, i) are sent to every processor in the ith row and the ith column.

Step 5. Check in which PEs F(i) value match with row number or column number of the PE and, perform the following computations in those PEs:

PE(F(i), i) decides that the edge eFlO, i • A and that the edge eF,),i • EA

PE(i, F(i)) . . . . ei.m~ • EA

Here EA is a subset of edges of E. The same notation Ea shall also be used in the algorithm for determining bridges in the next sub-section.

In an n x n mesh-connected ISA, the computa- tions required in Step 1 can be performed in O(n) time-steps using the transitive closure algorithm and, requires a constant local memory in a PE. Let

s. Sarkar et al./ Microprocessing and Microprogramming 40 (1994) 499-520 515

the Sn(i,j) value found after Step 1 be stored in a general purpose register g2 in PE(i, j). The general purpose register g 1 in PE(i, j) stores the information that ei,j e E. Suppose that a third general purpose register g3 in PE(i, 1) stores the level(/) value and, yet another general purpose register g4 in PE(1, j) store the level(j) value. The Step 2 computation requires copying of data from g2 to g3 or g4, and this requires constant time. Computations in Step 3 of the algorithm can be achieved with row and column distribution operations each of which can be completed in O(n) time with cons~tant local memory ~n a PE. Suppose that the level(/) and the level(j) values received by a PE(i,j) at this step be stored in general purpose registers g5 and g6 respectively.

In an n × n mesh ISA, Step 4 requires O(n) time with a constant memory per PE. This may be seen as follows:

Since each PE(i, j) stores level(/) value in g5, level (j) value in g6 and el,j value in gl, the conditions ei,j ~ E and level(i) = level(j) + 1 may be checked at PE(i,j) in constant time. If both these conditions are satisfied, the content of another general purpose register g7 is set equal to the content of the general purpose register g8 which stores the j value. For computing F(i) at PE(i, i), the minimum of all g7 register values of the ith row PEs are first obtained. The diagonal operation discussed in Sub-section 2.2 may be used for this purpose.

Suppose that the general purpose register g9 in PE(i,j) stores the i value. Then in the first phase of the Step 5 computation which requires checking whether F(i) is equal to row or column number, can be performed in constant time. The PEs where this condition is not satisfied, are masked using the flag M for the next phase of computation in Step 5. The next phase of computation can also be performed in constant time. The result of this computation are stored in general purpose registers gl0 and gl 1.

Therefore, the overall algorithm can be executed in O(n) time with a constant local memory in a PE.

4.5. Bridges of a graph

The terminology used in this sub-section is the same as that used in Sub-section 4.4. The directed breadth first spanning tree algorithm discussed in the previous sub-section is used in this sub-section to find the bridges of a graph. The algorithm for finding bridges of a graph 1-3] is given by:

Algorithm 3 Step 1. Find a directed breadth first spanning tree

T(V, A) of the given graph G(V, E). Step 2. Find directed graph G ' = ( V , E - E a )

where EA is a set of edges as found in Algorithm 2. Compute G1 = (V, El) as eli,~ = e~,j OR ai, j

where eli,~ E El, e~,j ~ (E - Ea) and ai, j ~ A (for T(V, A))

Note that E1 contains all edges in A and the set of directed edges obtained by replacing every edge in E - E~ by two opposite directed edges.

Step 3.

Step 4.

Compute the transitive closure T* and GI* of T and G1 respectively. For each i # 1, check whether the ith row values of the adjacency matrix of T* and GI* are same and collect the answer (yes/no) at PE(i, i). For processor PE(i,/) containing an answer 'yes', PE(F(i),/) and PE(i,F(/)) are marked. The processors marked at this step gives the edges of G that form the bridges of the graph, i.e. if processor PE(i,j) is marked then edge e~,~ is a bridge.

Step 1 of this algorithm can be implemented in an n x n mesh ISA in O(n) time-steps as discussed in the previous section. The memory required in a PE for this step is constant. For Step 2 computations, both E and EA are available in the mesh of PEs and,

516 S. Sarkar et al./Microprocessing and Microprogramming 40 (1994) 499-520

so (E - Ea) can be computed in constant time with- out any additional local memory in a PE. Now the adjacency matrices for both T and G' are available in the mesh array. Hence, the adjacency matrix for G1 can be computed in constant time. So Step 2 can be completed in constant time with constant local memory. In Step 3, transitive closures of T and G1 can be computed in O(n) time-steps with constant local memory. Step 4 requires comparison of the ith row contents of the adjacency matrices for T* and GI* and can be performed in constant time. An answer of 'yes' is indicated by setting the content of another general purpose register in PE(i,/) to one. The operations required for deriving the answer can also be done in O(n) time-steps by the diagonal operation discussed in Section 3.2. After the diag- onal operation, the result at PE(i, i) are already distributed to all PEs in ith row and ith column. The marking of PE(F(j),j) or PE(j, F(j)) can be done in constant time by setting a flag register F1.

From the above, it is evident that in an n x n mesh ISA this algorithm can be completed in O(n) time with a constant local memory in a PE.

4.6. Articulation points of a graph

In this sub-section, an n x n mesh ISA implemen- tation of the algorithm described in [3] for finding the articulation points of a graph has been considered. This algorithm is based on the con- struction of an undirected graph H(V, E2). An algo- rithm [3] for constructing H is discussed next. The notations used in this section are the same as used in the previous two sub-sections.

Algorithm 4: Algorithm for construction of H Step 1. Construct T, T*, G' and G1 as in Algo-

rithm 2 and in Algorithm 3. Step 2. Create the adjacency matrix of a directed

graph Z(V, X) by logical multiplication of the adjacency matrices of T* and G' which have been computed in Step 1.

An edge between vertex i and vertex j in Z is computed as

(i,j) = ~ T*(i, k). G'(k,j), k = l

where addition represents logical OR and multiplication represents logical AND. Note that there is an edge between i,j in Z iff there is an i - k path in T and an edge from i to j in G'.

Step 3. Every PE(x, y) creates a duplicate copy of T*(x, y) and F(x). Every row of processors now sends the copies of T*(x, y) and F(x) by vertical rotation. The computation per- formed in this stage is: When PE(i, y) receives the upward moving copy of T*(x, y) and F(x) from PE(x, y), the following conditions are checked for satisfaction: (1) F(i)= F(x) (2) The edge (x, y~ is an arc of T* (3) The edge (i, y ) is an arc of Z If all three conditions are satisfied then PE(i, y) decides that H(i,x) = 1.

With the completion of the above computations, the adjacency matrix for H has been generated. Step 3(a). Decide in PE(i,j) whether vertex i is

special or vertexj is special. This is done as follows. When the upward moving copy of the content of PE(x, y) is sweep- ing past PE(i, y), PE(i, y) checks whether following conditions are satisfied: (1) x = F(i) (2) The edge (x, y) is not an arc of T* (3) The edge (i, y ) is an arc of Z. If all these conditions are satisfied then vertex i is special. The PE(i, y) now sends this special ver- tex information to PE(i, i) which in turn relays the same to all PEs in the row and column i.

S. Sarkar et al./ Microprocessing and Microprogramming 40 (1994) 499-520 517

In a mesh-connected ISA, the computations in Step 1 of the above algorithm can be completed in O(n) time-steps with constant local memory per PE as discussed in Sub-sections 3.4 and 3.5. In Step 2, the computation for finding an edge (i,j) in Z has the same formulation as the matrix multiplication problem discussed in Section 2. It has been shown in [3, 9, 16] that the matrix multiplication compu- tation can be performed in O(n) time with constant local memory in an n x n mesh array. Following an approach very similar to the matrix multiplication, it can be shown that the computations in Step 2 can be completed in O(n) time-steps with constant local memory per PE.

In Step 3, the creation of duplicate copies of data requires constant time and constant memory. The vertical rotation used to send these duplicate copies can be completed in O(n) time with constant local memory. Checking conditions (1) and (2) requires constant time. Checking condition (3) once only suffices the purpose. The time required for Step 3(a) is examined now. All conditions in Step 3(a) can be checked in constant time. Therefore, deciding whether vertex i is special can be performed in O(n) time-steps with constant local memory per PE. After it has been decided in PE(i, y) whether vertex i is special, this information is sent to PE(i, i) by one horizontal rotation. The PE(i, i) now stores the information whether vertex i is special. Next by one horizontal followed by one vertical rotation, PE(i, i) relays this information to all other PEs in row or column i. So after this operation, all the PEs in row and column i know whether vertex i is special. This information is stored in a general purpose register by setting its content to one if vertex i is special; otherwise, the content of this general purpose register is set to zero. It is noted that the computations of the Step 3(a) can be com- pleted in O(n) time-steps with constant local mem- ory per PE. Therefore, Step 3 can be completed in O(n) time-steps with constant local memory in a PE.

Hence, construction of the graph H can be com- pleted in O(n) time-steps with constant local mem- ory per PE.

The algorithm for finding articulation points is given below:

Algorithm 5: Articulation point algorithm Step 1. Construct H. Step 2. Find H*. Step 3. For every PE(i,j) if F(i) = F(j) = 1, then

test whether H*(i,j)= 1 and, if not then create a message saying that vertex 1 is an articulation point. Collect the answer at PE(1, 1) as follows:

In one horizontal rotation, every PE(i, l) checks if one of the processors in row i con- tains a message saying that vertex 1 is an articulation point. If so, then PE(i, l) itself creates such a message. Next in one verti- cal rotation, PE(1, 1) finds out if a proces- sor in column 1 contains such a message and, if so then PE(I, 1) decides that vertex 1 is an articulation point.

Step 4. Every PE(i, 1) checks whether vertex i is a leaf of T. If so then vertex i is not an articulation point and this is indicated by setting a flag in PE(i, 1).

Step 5. Every PE(j, 1) for which F(j) # 1, checks whether H*(j, k) = 1 for some k and vertex k is special. If this is not satisfied then PE(j, i) indicates PE(F(j), 1) that vertex F(j) is an articulation point.

Step 1 of this algorithm can be performed in O(n) time-steps with constant local memory per PE as discussed earlier. The Step 2 operation involves finding the transitive closure of H. The Step 2 com- putation can also be performed in O(n) time-steps with constant local memory per PE as mentioned earlier in Section 2.

In Step 3, testing the conditions requires constant time. All the vertical and the horizontal rotations required for Step 3 computations can be completed

518 S. Sarkar et al. / Microprocessing and Microprogramming 40 (1994) 499-520

Table 3 Performance measures for ISA/TSA implementations for different graph theoretic algorithms

Algorithm No. of PE Time-complexity Cost Memory

Connectivity testing n 2 O(n) O(n 3) constant ( ~ 256 bytes)

Paths from nodes to the root of a spanning tree n 2 O(n) O(n 3) constant ( ,~ 256 bytes)

Finding LCA n 2 O(n) O(n 3) constant ( ,~ 256 bytes)

Directed breadth n 2 O(n) O(n 3) constant first spanning tree ( ,~ 256 bytes) Bridges of a graph n 2 O(n) O(n 3) constant

( ,~ 256 bytes) Articulation points of a graph n 2 O(n) O(n 3) constant

( < 256 bytes)

in O(n) time-steps with constant local memory per PE. Operations in Step 4 can be completed in constant time. For checking the condition in Step 5, horizontal rotation may be used. All the messages that are required to be sent from PE(j, 1) to PE(F(j), 1) can be sent by one vertical rotation only. So the Step 5 computat ion can also be completed in O(n) time-steps with constant local memory per PE.

Therefore, an ISA implementation of the above algorithm for finding the articulation points of a graph requires O(n) time steps with constant local memory per PE.

With reference to an n x n ISA implementation, Table 3 summarizes the performances of the ISA implementations discussed above.

5. Conclusion

In this paper, ISA implementations of some graph theoretic algorithms have been considered. It has been shown that many parallel graph theoretic algorithms developed for general-purpose S IMD or M I M D architectures can be implemented in an ISA/TSA with same time-complexities and a con-

stant local memory requirement per processing ele- ment. The algorithms considered in this chapter for ISA implementations are representative samples of a few types of graph theoretic algorithms that have been examined in the literature. It is expected that many other graph theoretic algorithms can be sim- ilarly implemented in ISA. Since the local memory requirement is not very high, ISA implementations of graph theoretic algorithms therefore appear to be attractive. As already mentioned earlier, TSA implementations can be easily derived based on the corresponding ISA implementations.

Acknowledgement

The authors are grateful to their colleagues at Jadavpur University and Indian Institute of Tech- nology for their suggestions.

References

[1] A.V. Aho, J.E. Hopcroft and J.D. Ullman, The Design and Analysis of Computer AIoorithms (Addison-Wesley, Read- ing, MA, 1974).

S. Sarkar et al./ Microprocessing and Microprogramming 40 (1994) 499-520 519

[2] S.G. Akl, The Design and Analysis of Parallel Algorithms (Prentice Hall, Englewood Cliffs, N J, 1989).

[3] M.J. Atallah and S.R. Kosaraju, Graph problems on a mesh-connected processor array, J. ACM 31(3) (July 1984) 649~67.

[4] A. Benaini, P. Quinton, Y. Robert, Y. Saouter and B. Tourancheau, Synthesis of a new systolic architecture for the algebraic path problem, IRISA research report, June 1989.

[5] F.Y. Chin, J. Lam and I-Ngo Chen, Optimal parallel algorithms for the connected component problem, Proc. Int. Conf. on Parallel Processing (1981) 170-175.

[6] F.Y. Chin, J. Lam and I-Ngo Chen, Efficient parallel algorithms for some graph problems, Commun. ACM 25 (1982) 659-665.

[7] N. Deo, Graph Theory (Prentice-Hall of India, New Delhi, 1987).

[8] K. Doshi and P. Varman, Determining biconnectivity on a systolic array, Proc. Int. Conf. on Parallel Processing (Aug. 1987) 848-850.

[9] L.J. Guibas, H.T. Kung and C.D. Thompson, Direct VLSI implementation of combinatorial algorithms, Proc. Cal- tech Conf. on VLSI (Jan. 1979) 509-525.

[10] D.S. Hirschberg, A.K. Chandra and D.V. Sarwate, Com- puting connected components on parallel computers, Commun. ACM 22 (1979) 461-464.

[11] R. Hughey and D.P. Lopresti, Architecture of a pro- grammable systolic array, Proc. Int. Conf. on Systolic Ar- rays, San Diego, CA (May 1988), Editors: K. Bromley, S.Y. Kung and E. Swartzerlander (Computer Society Press) 41-50.

[12] K.J. Jones, High-throughput, reduced hardware systolic solution to prime factor discrete Fourier transform algo- rithm, lEE Proc. Part E, 137 (1990) 191-196.

[13] H.T. Kung and C.E. Leiserson, Algorithms for VLSI pro- cessor arrays, in: Chapter 8, section 3 of Introduction to VLSI Systems, C. Mead and L. Conway (Addison-Wesley, 1980).

[14] H.T. Kung, Why systolic architectures?, Computer 15(1) (Jan. 1982) 37-46.

[15] S.Y. Kung, S.C. Lo and P.S. Lewis, Optimal systolic design for the transitive closure and the shortest path problem, IEEE Trans. Comput. 36(5) (May 1987) 603-614.

[16] S.Y. Kung, VLSI Array Processor (Prentice Hall, Engle- wood Cliffs, 1988).

[17] M. Kunde, H.W. Lang, M. Schimmler, H. Schmeck and H. Schroder, The instruction systolic array and its relation to

other models of parallel computers, Parallel Comput. 7 (1988) 25-39.

[181 M.S. Lam, A Systolic Array Optimizing Compiler (Kluwer Academic, MA, USA, 1989).

[19] H.W. Lang, The instruction systolic array - a parallel architecture for VLSI integration, VLSI J. 4 (1986) 65-74.

[20J H.W. Lang, ISA and SISA: Two variants of a general- purpose systolic architecture, Proc. Second Int. Conf. on Supercomputing 1 (1987) 460-467.

[21] H.W. Lang, Transitive closure on an instruction systolic array, Proc. Int. Conf. on Systolic Arrays, San Diego, CA (May 1988) Editors K. Bromley, S.Y. Kung and E. Swartzerlander (Computer Society Press) 295-304.

[22] C. Mead and L. Conway, Introduction to VLSI Systems, (Addison-Wesley, 1980).

[23] P. Quinton, Automatic synthesis of systolic arrays from uniform recurrent equations, 11th Annual Symp. on Com- puter Architecture, Ann Arbor (June 1984) 208-214.

[24] E. Reghbati and D.G. Corneil, Parallel computations in graph theory, S lAM J. Comput. 7 (1978) 230-237.

[25] S. Sarkar and A.K. Majumdar, Fast Fourier transform using linear tagged systolic array, Proc. IEEE Region 10 Conf. on Computer and Communication Systems, Hong Kong (Sep. 1990) 289 293.

[26] S. Sarkar and A.K. Majumdar, Tagged systolic arrays, lEE Proc.-E 138(5) (Sep. 1991) 289-294.

[27] S. Sarkar and A.K. Majumdar, An instruction systolic array implementation of the two-dimensional fast Fourier transform, Microprocessing and Microprogramming 33 (1991) 101 110.

[28] S. Sarkar, Synthesis of enhanced systolic arrays, Ph.D. Thesis, Computer Sc. & Engg. Dept., IIT, Kharagpur, India, Oct. 1991.

[29] S. Sastry and V.K.P. Kumar, A general purpose VLSI array for efficient signal and image processing, Proc. Int. Conf. on Parallel Processing (Aug. 1987) 917420.

[30] K.W. Shin and M.K. Lee, A VLSI architecture for parallel computation of FFT, Proc. Int. Conf. on Systolic Arrays (1989) 116-125.

[31] Y.N. Tsin and F.Y. Chin, Efficient parallel algorithms for a class of graph theoretic problems, SIAM J. Comput. 13(3) (Aug. 1984) 580-599.

[32] T. Willey, R. Chapman, H. Yoho, T.S. Durrani and D. Preis, Systolic implementations for deconvolution, DFT and FFT, lEE Proc. Part F 132(6) (Oct. 1985) 466-472.

520 S. Sarkar et al./ Microprocessing and Microprogramming 40 (1994) 499-520

Susanta Sarkar received B.E. degree in Electrical Engineering from Jadavpur University, Calcutta, India in 1985. From 1985 to 1988 he worked with the National Thermal Power Corporation, India. He joined the Indian Institute of Technology, Kharagpur, as a Research Scholar in 1988. Currently he is conti- nuing with his Ph.D. program. His main areas of interest are parallel pro- cessing and VLSI.

A.K. Majumdar received M. Tech. and Ph.D. degrees in Applied Physics from the University of Calcutta, India in 1968 and 1973, respectively. He also obtained Ph.D. degree in Electrical En- gineering from the University of Florida, Gainesville, Florida, USA in 1976.

From 1976 to 1977 he was asso- ciated with the Electronics and Com- munication Sciences Unit, Indian Stat- istical Institute, Calcutta. He served as Associate Professor in the School of

Computer & Systems Sciences, Jawaharlal Nehru University, New Delhi, from 1977 to 1980. Since 1980 he has been associated with the Indian Institute of Technology, Kharagpur, where he is a Professor in the Computer Science & Engineering Depart- ment. During 1986-87 he served as a visiting Professor at the Department of Computer and Information Science, University of Guelph, Ontario, Canada. His research interests are design automation, database management systems, VLSI and artificial intelligence.

R.K. Sen received Ph.D. degree from the University of Calcutta, India in 1979. Presently he is associated with the Indian Institute of Technology, Kharagpur, where he is an Assc. Pro- fessor in the Computer Science & En- gineering Department. During 1988-90 he served as a visiting Asst. Professor at the Department of Com- puter Science, Hampton University, USA. He is a member of Computer Society of India, Association of Com-

~ puting Machineries and the IEEE Computer Society. His main areas of interests are graph theory optimization problems and parallel processing.