material identification algorithms for parallel systems

Pergamon Computers Elect. Engng Vol. 22, No. 5, pp. 325-342, 1996

Copyright 0 1996 Elsevier Science Ltd

PII: soo45-7906(96)00010-9 Printed in Great Britain. All rights reserved

0045-7906/96 $15.00 + 0.00

MATERIAL IDENTIFICATION ALGORITHMS FOR PARALLEL SYSTEMS

MICHAEL KAHN and SOTIRIOS G. ZIAVRAS Department of Electrical and Computer Engineering, New Jersey Institute of Technology, Newark,

NJ 07102, U.S.A.

Abstract-The direct binary hypercube interconnection network has been widely used in the design of parallel computer systems, because it has low diameter, and can effectively emulate commonly used structures such as binary trees, rings, and meshes. However, the hypercube has the disadvantage of high VLSI complexity due to the increase in the number of communication ports and channels per PE (processing element) with an increase in the dimension of the hypercube. This complexity is a major drawback of the hypercube because it limits its potential for scalability. Ziavras has introduced the reduced hypercube (RH) interconnection network. The RH reduces VLSI complexity without compromising performance. The RH is obtained from a regular hypercube by a uniform reduction in the number of edges for each hypercube node. The main objective of this paper is to demonstrate the feasibility of the RH in implementing materiat identification algorithms using X-ray fluorescence. Relevant algorithms are developed for the PRAM and the hypercube also, for comparative analysis. The algorithms rely heavily on the binary tree emulation capabilities of these systems. The results indicate that the RH is capable of providing hypercube-like performance at a significantly reduced complexity/cost. Copyright 0 1996 Elsevier Science Ltd

Kqv words: Parallel processing, computer architecture, hypercube, VLSI complexity, interconnection networks, material identification, performance analysis, reduced hypercube, scalability.

1. INTRODUCTION

In practical applications, accuracy and computation time are important considerations. Material identification algorithms are currently being implemented on sequential computer systems. The processing power of sequential computers imposes limitations on performance. To overcome these limitations, it is necessary to implement parallel processing. This paper examines the feasibility of the RH interconnection network to implement a parallel material identification algorithm. RHs [l] have lower VLSI complexity than hypercubes, [2,3] and thus permit the construction of powerful massively parallel systems. In order to develop a parallel algorithm for the RH, first the sequential algorithm is examined. From that, a hypothetical machine (PRAM) is developed. [4] The PRAM (Parallel Random Access Machine) can be used to determine what parts of the sequential algorithm are parallelizeable. Based on the PRAM, a hypercube model is obtained. This realistic model includes communication overheads. Finally, an RH case is considered. The RH is then compared to the regular hypercube model to see how much loss of performance .or ‘degradation’ occurs in the RH due to the missing links. The goal is to show that the degradation can be very small in practical cases.

The following describes some basics of material identification. Details of the theory are not relevant here, but a brief explanation would be useful in understanding the algorithm. The material identification algorithm uses X-ray fluorescence (XRF) to perform the identification. When high energy X-rays are applied to a material, each element in that material gives off X-rays of a lower energy. Every element has a unique X-ray energy, therefore the X-ray is characteristic of that particular element, hence the name ‘characteristic X-ray’.

An X-ray detector can detect the various X-rays from different elements in the sample. When this detector is connected to the proper electronics, the data can be fed into a computer for analysis.

The energy continuum is discretized into many levels, which are referred to as ‘channels’. The computer tabulates how many X-rays of various energies have been radiated in each channel and creates a histogram. This histogram (sometimes referred to as a ‘spectrum’) contains information about the material that can be used to identify it. To perform material identification, this histogram is compared to a library of cataloged histograms stored on disk. If certain criteria are met, then

325

326 Michael Kahn and Sotirios G. Ziavras

the two histograms are considered to ‘match’, and the unknown material is thereby identified. Throughout the rest of this paper, the histogram of the unknown sample will be referred to simply as the ‘unknown histogram’. Figure 1 shows a typical histogram.

The paper is organized as follows. Section 2 presents a material identification algorithm. Section 3 discusses its sequential implementation. Section 4 introduces its PRAM (Parallel Random Access Machine) implementation which will become the starting point for the development of parallel algorithms. Section 5 presents its hypercube implementation. Section 6 is an introduction to the reduced hypercube (RH) topology. Section 7 presents the RH implementation of the material identification algorithm. Finally, Section 8 contains conclusions.

2. MATERIAL IDENTIFICATION ALGORITHM

2.1. An identiJication is determined in the following way:

On the unknown histogram, the algorithm selects R areas of interest. [5] These areas are integrated independently, and these integration results are compared with integration results from the library histograms, which use the same integration intervals. Finding the limits of integration requires finding the local maxima of all R areas. Determining the local maxima on the unknown histogram itself can sometimes lead to erroneous results due to statistical fluctuations in the histogram, which are caused by random ‘noise’ in the data acquisition electronics. Therefore, the unknown histogram is input into a non-causal filter function for ‘smoothing’. This ‘smoothed’ histogram is used for finding the precise integration limits. The smoothing is performed as follows: for d: = b to C - 1 - b do

b S[d - b - i] ddl = C

b-‘S[d+b+i]

,=cl 2b-’ +c 2b-i (2.1)

,=ll

where: b is the level of smoothing d is an index z is an array storing the smoothed histogram S is an array storing the unknown histogram.

The array z is the same size as S, which holds the unknown histogram. This is done to keep the indices consistent. However, z[b] is the first valid element of z, since for each element of the smoothed histogram there are b previous channels required as inputs. Using z, the algorithm then finds the R areas of interest. To accomplish this, first all local maxima are

B 3 Each area (i) is

s Vi channels wide

I

‘Channel num’ber ’ One set of integration limits

Fig. 1. Typical histogram.

Material identification algorithms for parallel systems 327

found, then the value of the histogram at the local maximum is compared to a predetermined constant, P. If the value at the local maximum is greater than P, then that local maximum represents one of the R areas, otherwise, it is discarded.

In order to find the local maxima, array zl is created from the smoothed spectrum z: for d: =2btoC-1-2bdo

b-l b-l

zi[d]= C(i-b)z[d-b+i]+ C(b--i)zW-b+il i-0 i-0

(2.2)

where b is again the level of smoothing. zl represents a function whose polarity and zero crossings match that of z’ (first derivative of

z). The zero crossings in z 1 represent relative extrema. By examining if zl was increasing or decreasing at the zero crossing, it can be determined if the critical point was a local minimum or local maximum. For this algorithm, only local maxima are of interest. Basically a ‘second derivative test’ is performed to find the local maxima. The following pseudocode determines the local maxima, and the number of integration areas R. for d: = 3b to C - 1 - 3b do begin

i: = 0; (initialize index) test: = true;

repeat if z l[d - b + f > 0 AND z l[d + b + f < 0 then (Is this a negative going zero crossing? If so then this is local max.)

test: = true else

test: = false: in&);

until (j = b OR test = false);

if (test = true AND x > P) then (Does this local maxima exceed the threshold to be included?)

begin X: = z[d]; oi: = d; i:= i +l;

end;

end; R: = i; (number of integration areas)

x now represents the amplitude of the ith local maximum and Oi represents the channel where the local maximum occurs. The lower and upper integration limits of each area, namely Ai and Bi , have amplitude which is close to l/2 the amplitude of the local maximum, as shown in the following pseudocode:

d: = Oi - 1; while z[d] > (m/2 dad:= d - 1; A,= d;

d:= oi +I; while z[d] > (m/2 dad:= d +l; Bi = d;

Each area’ consists of c channels, where K = Bi - Ai + 1.

’ 4 represents the number of channels in each area. This number can be different in each area. The largest area is a limiting factor in the calculations. For the sake of simplicity, throughout the rest of this paper only V,, (the largest integration limits) will be- considered, and V will be assumed to be V,, unless otherwise stated.


There are two levels of comparison used to determine if a library histogram matches an unknown histogram. The first level, referred to as a quick comparison is performed on all N histograms in the library. If the quick comparison for a histogram in the library succeeds, then a full comparison is performed on that histogram. If the quick comparison fails, a full comparison is not performed on the library histogram. The full comparison involves many more computations than the quick comparison, so there is a time savings if a full comparison is not performed on all N library histograms, but only on F histograms, where F is the number of histograms in the library which passed the quick comparison test with the unknown histogram. Therefore, F I N.

The quick comparison consists of comparing the largest local maximum to that of the library histogram, and is evaluated using the following pseudocode:

if ML,,1 - K,,l I T(JE) then quick-compare: = true else quick-compare: = false;

if quick-compare then full-compare;

K is the array of the library histogram. T is a predetermined ‘sensitivity’ parameter. T represents the desired confidence level. For example, if T = 1, 2, or 3 then the confidence level would be 68, 95 and 99.7%, respectively. The larger T is, the less sensitive the quick comparison will be. That is, more histograms will pass, hence F will be larger. In the worst case, F= N.

There are various types of error associated with any type of instrumentation. In XRF, there is a counting error, which is random in nature. The discrete counted events, i.e. the X-rays, may be treated as forming a random sequence which may be expressed by the Poisson distribution [5]. Since there are many of these events occurring during the data acquisition, the central limit theorem can be applied, and the Poisson distribution may be replaced with a ‘Normal’ or Gaussian distribution. The details are beyond the scope of this paper, however it can be shown that the counting error is approximately the square root of the count rate (counts per second . Therefore, the counting error or standard deviation associated with Y,,, is approximately ti Y,,,.

For the full comparison, the same integration limits determined for the unknown histogram are applied to the F histograms that passed the quick comparison. The integrations are performed on the unsmoothed histogram. The smoothed histogram z is used only for finding integration limits. The unknown and library histograms must be integrated. R sums will be created for each histogram. U, and Li represent the sums for the unknown and library histograms being compared, respectively, as shown in the following formula:

u, = 2 su] L, = -$ Kb]. (2.3) /= A, I = A,

i is a number from 1 to R, corresponding to an integration area of the histogram. Assume that C is the total number of channels in the histogram. j is the channel number, which is the x-axis of the histogram graph, and may range from 0 to C - 1. A, and B, are the lower and upper limits, respectively, of integration for the ith window. S represents the array containing the histogram of the unknown sample. U, and L, represent integration results for the ith window of the unknown and library histograms, respectively. There are R integration results for each histogram, since there are R integration ‘areas’.

The unknown and library histograms are compared by checking the condition:

M> &u,- L,)2. (2.41 ,=I

M is a predetermined constant. The more similar the unknown and library histograms are, the smaller the right side of condition 2.4 will be. When the right side is smaller than the predetermined


constant M, the library histogram is said to match the unknown histogram, thereby identifying the material. This formula must be applied to F histograms in the library.

So, basically the material identification algorithm carries out the following tasks:

1. Create z, the smoothed version of the unknown histogram. 2. Create zl, the ‘first derivative’ type function of z. 3. Find local maxima and integration limits for all R integration areas of the unknown

histogram, using z and z 1, respectively. 4. Perform quick comparison of unknown histogram and all N library histograms to obtain the

subset F of the library histograms to be used with full comparison. 5. Evaluate integrations of unknown histogram. 6. Using the same integration limits, evaluate integrations of the F library histograms. 7. Using condition 2.4, compare the unknown histogram to the F library histograms to tlnd a

match. It is assumed that a single match exists.

3. SEQUENTIAL ALGORITHM CONSIDERATIONS

Consider a hypothetical microprocessor, in which a single addition, subtraction, or logic operation requires tadd clock cycles, while multiplication requires tmu, clock cycles.

To obtain z, the smoothed version of the unknown histogram, requires 2b + 1 additions for each element of the histogram. However, not all C elements are used. b elements are omitted from each end of the histogram, which results in (2b + l)(C - 26) total additions. To find the integration limits for R integration areas requires the creation of the zl array, and zl must be scanned to find the local maximum. The total number of steps can be conservatively estimated as 3(2b + l)(C - 2b). Actually, zl uses C - 4b elements of the histogram. However b is assumed to be small compared to C. The upper and lower integration limits are found by searching the area around the local maxima. This results in approximately:

,$,(B, - Ai) (3.1)

addition/comparison steps. Since the above steps are only applied to the unknown histogram, and are therefore small in

number, in comparison to the number of steps required to process F library histograms, the time required to find the integration limits will be ignored when discussing a general case where the library size N is assumed to be large compared to the number of elements in the histogram C, and NzF>>C.

Each area of integration requires V additions to be evaluated. There are R areas for each histogram. Therefore, V R additions are required to get the R sums of the unknown histogram. The quick comparison involves all N library histograms, while the full comparison is only performed on F library histograms. So, there are (F + 1)V R additions to get the sums of the unknown histogram, and the F library histograms.

To evaluate condition 2.4 requires R subtractions, one for each area of integration. This subtracts the library histogram area from the unknown histogram area. Since each subtracted sum is squared, there are R multiplications. The results of the R multiplications are summed, requiring R additions. This results in a total of FR subtractions, FR additions, and FR multiplications. Since the time to process additions and subtractions is assumed to be the same, effectively the evaluation of the right term in condition 2.4 requires 2FR additions, and FR multiplications. The right term is compared each time to the value M, which is a given constant. If the inequality is satisfied, a ‘true’ is stored in a results array, indicating a match with the library histogram, otherwise, a ‘false’ is stored. This requires F comparisons.

In the general case, using a library of size N the processing time required is:

T scqumia~ = ((F + 1)VR + 2FR + F + N)radd + (FR)t,,, (3.2)


which is (on the order of) 0 (FV R); word operations are assumed throughout the paper. The amount of time required to solve this problem depends on F. In the worst case, F = N, therefore the upper bound on the time complexity for this problem is 0 (N I/ R).

4. PARALLEL RANDOM ACCESS MACHINE (PRAM) IMPLEMENTATION

In order to derive a general case for a SM(shared memory) SIMD(single instruction stream, multiple data stream) PRAM using n processors to search a library of size N, it is first necessary to determine how each part of the algorithm can be calculated in parallel. The very restrictive EREW (exclusive read, exclusive write) model is assumed. [4]

In a general case, to use n processors to search a library of size N, we must distinguish among three cases. In the first case (n < V), there are not enough processors to work on more than one integration area at a time. In the second case (V I n I RV), multiple integration areas of the same histogram can be processed, but only one histogram can be processed at a time. In the third case (n > RV) more than one histogram can be processed at one time.

In order to simplify analysis, the following parameters are defined:

U.: time required to integrate one area of a histogram. 8: time required to perform quick comparisons. I? time to subtract and square all areas of the unknown and library histograms. A: time to sum the squared terms to obtain the ‘final’ sum that will be compared with M

in condition 2.4. Y: time to evaluate condition 2.4 on F histograms.

Since there are three different cases to be considered for the PRAM, the above parameters will be subscripted as pl, ~2, and p3 for each respective case.

Case 1. (n < V) Each area of integration requires V additions to be evaluated. Let cl,, be the processing time

required to evaluate each area. With n processors, the evaluation of each area requires:

(4.1)

steps. The first term in equation processing elements (PEs), while

4.1 represents additions through emulation of a binary tree of the second term represents internal PE additions.

To develop a general PRAM model, first the fine grain calculations are considered. For all N histograms, a quick comparison is performed. The quick comparison must be performed before the full comparison. Let &,, be the processing time for the quick comparisons. Therefore:

8,, = r X itadd . (4.2)

For each full comparison, there are R subtractions followed by R multiplications. The multiplications cannot be performed until the subtractions are complete due to data dependency and the SIMD mode of computation. Let r,, be the processing time for these steps. Therefore:

(4.3)

Then, to complete evaluation of the comparison formula (condition 2.4) R values must be summed. Let A,, be the processing time required for these steps. Therefore:

4, = log(min{R,n}) + r f 1 1 fadd . (4.4)

Finally, this value must be compared to M, for all F histograms that are fully compared. Let Ic/,, be the processing time required for these steps. Therefore:

Y’,, = r F itadd . (4.5)


There are R areas per histogram. F histograms in the library must be searched, plus the unknown histogram for a total of F + 1 histograms to be searched.

Time for all integrations = (F + l)Rcr,, . (4.6)

The unknown histogram must be compared with each library histogram, therefore F comparisons must be made. Equations (4.3) and (4.4) show the processing steps required for each histogram. Therefore:

Time to compute F final sums = F(T,, + 4,) . (4.7)

Adding equations (4.2) (4.5H4.7) results in the total time to search a N size library using n processors:

TL, = (F + l)&, + 8,, + J’(I,, + 4,) + ‘y,, .

Case 2. (V I n I RV)

(4.8)

In this case, one integration area can be completed in the minimum time. If enough processors are available, multiple areas can be calculated concurrently, since the areas have no data dependencies. The total processing time depends on how many areas can be calculated simultaneously. Basically, this time is given by:

r total number of areas in a histogram number of areas simultaneously processed

l(time to process one area) . (4.9)

This expression can be further evaluated since the number of areas that can be simultaneously processed is the number of processors available divided by the number of processors needed for the fastest possible evaluation of an area. This is given by:

number of areas that can be simultaneously processed = L z J . (4.10)

In this case, processing one area would use V processors emulating a binary tree structure. Therefore, the time required to integrate a single area of the histogram is:

aP2 = (logV)t,, . (4.11)

Since this must be done for F library histograms plus the unknown sample, the complete expression for evaluating the integration areas is: - _

Time for all integrations = (F + or R lap2 . LFJ

(4.12)

For the comparisons equation, r,, is evaluated by:

rP2 = Lid +

assuming that n 2 R. AP2 is evaluated by:

t mul (4.13)

A, = OogWa, ’

The total processing time is:

T;,,=(F+ l)r R - loI,z + VP2 + APJ + BP2 + yP2 LFJ

(4.14)

(4.15)

where BP2 = BP1 and Jlp2 = $,+ Case 3. (n> RV) In this case, for integration results, all R areas are calculated at once in the minimum number

of steps. For every RV processors, another histogram can be simultaneously processed. Therefore:

TS F PRAM = r (F+ ‘) la,, + r L$ LFvJ

l(r,, + 44 + BP3 + y P3 *

(The p3 parameters are equal to the corresponding p2 parameters.)

(4.16)

i


Speedup Fixed workload (FW)/Fixed time (FT)

a_

32 - 30 - 28 - 0 FWspeedup 26 - + FT speedup 24 - o Ideal 22 -

$$ 20- ; 18- $ 16-

14 - 12 - 10 -

Number of processors (log scale)

Fig. 2. Speedup for the PRAM.

To evaluate the theoretical performance of the parallel machine, speedup and efficiency are considered. Efficiency is defined as:

Efficiency = speedup

number of processors ’ (4.17)

Efficiency is an indication of the actual degree of speedup performance achieved as compared with the ideal speedup. In order to make an evaluation, a hypothetical example was considered where t add = 1, tmul = 5, R = 10, and V = 64. Also, the fixed-workload (FW) and fixed-time (FT) models were considered. Figures 2 and 3 show the speedup and efficiency for each model. Using these values, first the fixed-workload model was considered. The efficiency dropped sharply as the number of processors increased. This indicates that adding extra processors did not significantly increase performance for n 2 4.

For the fixed-time model, the processing time was held constant for the parallel machine. The main goal in this case is to improve accuracy by increasing the integration limits (V) to include more of the histogram in the comparison. To calculate fixed time speedup, a fixed size library was

1.1

Efficiency plot Fixed time/Fixed workload

0.8 -

6 E 0.7 - .$ g 0.6- w

0.5 - 0 Fixed workload

0.4 - + Fixed time

0.3 -

0.2 ; I 8

Number of (log scale)

3. Efficiency the PRAM.

32


considered, then the amount of time needed to process areas of size V on a sequential machine was calculated. v’ is a scaled parameter reflecting the increased workload that requires the same amount of time on the parallel machine. The time required to process v’ on both the parallel and sequential machines were used to derive the speedup and efficiency for the fixed-time model. In this case, the efficiency remained close to 1 as the number of processors increased. This is an indication of the high scalability of this problem, which makes it a good candidate for implementation on a parallel machine.

5. HYPERCUBE IMPLEMENTATION

The hypercube model is based on the PRAM model discussed in the previous section. However, in this model there is the added overhead of t,,,,, the time for interprocessor communication between neighboring processing elements (PEs). Throughout this paper, it is assumed that each PE has a local memory which contains all initial data for the algorithm to begin. However, the time to send calculated results to other PEs as input will need to be considered.

This algorithm uses binary trees mapped onto the hypercube topology. The binary tree is well suited for tasks such as calculating sums and products of an array of numbers, which this algorithm uses. The binary tree can be mapped in an optimal manner onto the hypercube for the implementation of these operations. [4]

As with the PRAM, there are three cases: in the first case (n < V), there are not enough processors to work on more than one integration area at a time. In the second case (V I n 4 RI/), multiple integration areas of the same histogram can be processed, but only one histogram can be processed at a time. In the third case (n > RV) more than one histogram can be processed at one time. The subscripts hl, h2, and h3 are used to describe the previously defined parameters for the hypercube model.

Case 1. (n < V) ct,,, is evaluated by:

(5.1)

since for all but the final addition, all intermediate results must be passed to a neighboring PE for further processing.

Each processor must perform a quick comparison on the histograms it has in its local memory. This requires /3,,, processing time, which is equal to j,,].

For the full comparison, r is evaluated by:

(5.2)

To complete the evaluation, the R squared integration values must be summed. A,,, is evaluated by:

Abl = log(min{R,n}) + r f 1 1 tadd + log(min{R,n})t,, (5.3)

since now the communication through the binary tree is considered. Each processor must now compare the output of equation (5.3) to the predetermined constant M. All information is within the local memory of the PE, so there is no communication overhead. This requires Y,, 1 processing time, which is equal to Y,,.

The total number of clock cycles required is:

Tj* = (F+ l)Rahl + Bhl + Y’,, + Whl + AhI) .

Cape 2. (V I n < RV)

(5.4)

The time to process one area, ah2, is evaluated by:

ah2 = (~ogVkdd + (logv)tcom . (5.5)


The complete expression for evaluating the integration areas is:

Time for all integrations = (F + l)r R ]clh2 .

L;J

Fh2 is evaluated by:

rhZ = &dd + h,,ul .

AhZ, the processing time for summing the R values is evaluated by:

42 = (logR)&d, + (logR)t,,,

Therefore:

(5.6)

(5.7)

(5.8)

(5.9)

Case 3. (n > RI/)

This is similar to case 3 in the PRAM, except that binary tree communication overhead has been included

Tfubc = r (F + l) F ___ ia,, + bh3 + Y,, + r ___ Lj&J Lj$J

10-m + AMI . (5.10)

(The h3 parameters are equal to the corresponding h2 parameters.)

6. THE REDUCED HYPERCUBE FAMILY OF INTERCONNECTION NETWORKS

The main objective of the reduced hypercube’ is to offer regular hypercube-like performance at reduced cost due to lower VLSI complexity. The VLSI complexity of RHs allows construction of systems with more PEs than regular hypercubes. Although [l] introduces RHs with several levels, only two-level structures are considered here. Note: in the RH terminology, n is not the number of processors, but a parameter which determines the number of processors.

The reduced hypercube RH(k, n) contains D nodes, where D = 2k +*“, k 2 n and n 2 1. Let u = k + 2”. Each node of the RK(k,n) is attached to k + 1 edges. Each node in the regular hypercube with the same number of nodes is attached to u edges. Therefore, each node of the D-node RH has 2” - 1 less edges than nodes in the corresponding D-node regular hypercube.

The D-node RH (k,n) is constructed from the D-node regular hypercube by uniformly removing 2” - 1 edges from each of its nodes. To accomplish this, the u-bit addresses of hypercube nodes are first partitioned into two fields, the 0th and the 1st fields, as follows. The 0th field contains the least k significant bits of the v-bit node address. This field represents the address of the node within a complete k-cube, which will be referred to as a building block (BB). The 1st field contains the 2” most significant bits of the u-bit node address. It represents the address of the BB thatcontains the node. In addition, a subfield is identified in the 0th field, the 0th subfield. It contains the n most significant bits of the k-bit 0th field. It represents the address of a (k - n)-dimensional subcube, which will be referred to as a subblock (SB) within the k-cube BB that contains the node.

In order to reduce the u-cube into the RH (k, n), out of the u edges of each node the following two sets are kept, leaving k + 1 edges to each node.

Set 1. The k edges of the v-cube that traverse the k lowest dimensions (i.e. dimensions 0 through k - 1) and connect the referenced node with k distinct nodes are kept. As a result, a complete k-cube BB that includes the referenced node is maintained.

Set 2. This set contains only one edge which is also present in the original u-cube. This edge is the one which connects directly the referenced node with the node whose address differs from

‘The information in this section was taken from [l].


k bits

2” bits F

n bits k-n bits \

P-F BB address l SB address l Node address in SB

-\

-- la field 0”’ subfield

I

Oth field

Fig. 4. Node addresses.

this node’s address only in the mth bit of the 1st field, where m is the decimal value in the 0th subfield and 0 I m I 'it'-'.

The resultant RH(k,n) contains 22” k-cube BBS. It can be viewed as a 2”~cube of k-cubes. A BB address forms the 2” most significant bits (i.e. the 1st field) of the u-bit addresses for contained nodes. Each BB is divided into 2” SBs which are uniquely identified by the 0th subfield of contained nodes. Connections between pairs of SBs in different BBS are as follows: a node in a particular SB of a particular BB is connected directly to the node with the same 0th field in the BB whose 2”-bit address differs from the referenced node’s 1st field only in the mth bit, where m is the value in the 0th subfield of the former node.

To conclude, the address of each node in the RH(k,n) is formed as shown in Fig. 4. The 0 denotes concatenation.

Fig. 5. The RH(2,2).

The RH(2,2) is identical to the cube-connected cycles CCC(4) [6]. The popular CCC(k) is obtained from the k-cube by substituting a ring with k nodes for each node in the k-cube. Distinct nodes in each ring then implement connections in distinct dimensions. Despite its reduced hardware complexity, another impressive property of the RH(k,n) is its capability of emulating simultaneously in an optimal manner 2k-” CCC(2”)s. Figure 5 shows the structure of the RH(2,2). The 6-cube with 64 nodes contains 384 channels (two directions of data transfers for any two neighbors), whereas the RH(2,2), also with 64 nodes, contains only 192 channels.

Lower cost RHs perform comparable to hypercubes [7, 81. Due to their success, a generalized family of RHs were investigated in [9].

7. REDUCED HYPERCUBE IMPLEMENTATION

In this paper, the special case of k = n is considered. However, it can easily be extended for the case where k > n. When k = n, each SB is simply a PE, and each BB becomes a k-cube of PEs. The RH can then be thought of as a 2”-cube of n-cubes. In the rest of this section, k will be used instead of n. Since a hypercube can optimally emulate a binary tree with many-to-one mapping, [4] (distinct nodes from any single level of the binary tree are mapped to distinct hypercube. nodes) the RH can represent a ‘binary tree of binary trees’ [7]. In the RH, there are 22” BBS, and each BB has 2k PEs. Therefore, up to k hops are required within a BB in order to reach a PE of specific dimension. The dilation of a source edge is defined as the length of the shortest path that connects the images in the target topology of its two incident nodes. The worst case dilation between two BBS is 2k + 1. Consider that each BB forms a k + 1 level binary tree. In order to form a k + 2 level binary tree, it will be necessary to merge the roots of two k + 1 level binary trees from two different BBS into another BB. The communication overhead comes from two sources. Communication within the k-cube creates overhead of logk. Communication between k-cubes creates an overhead of 2k + 1. Therefore, the total overhead is (logk + 2k + l)t,,,.

Communication overhead is higher in a RH than in a regular hypercube because some communication channels have been eliminated to reduce cost and complexity. This cost/complexity savings is the main motivation for using an RH topology. In order to optimize RH performance, it is necessary to distribute the processing load properly among the BBS. The workload can be distributed to a large number of BBS. This will result in a smaller computational workload per PE than using a smaller number of BBS. However, the tradeoff is that the communication overhead is higher with a larger number of BBS. The goal is to choose the right number of BBS for the task such that the processing time is optimized.

Two parameters are used to determine the optimal number of BBS within an RH that should be used to emulate the binary tree:

1. Problem size, which is referred to in this paper as the ‘data set’. 2. The ratio between computation time of PEs and communication time between neighboring

PEs.

In order to determine the best distribution of the workload, a binary tree of k + 1 + y levels mapped onto a RH is considered, where y is a ‘sharing parameter’, and 0 I y < 2k. y is a parameter which describes how many BBS are used for a binary tree mapping.

u,* the time to process one integration area on a binary tree on the RH is given by:

&k = w+Yl+r&l 1 bdd + lv@k + 1) + 01 + lYd~,om (7.1)

where the rh subscript indicates parameters used in the RH implementation. In this paper, three types of data sets are considered: small, medium, and large, which are one,

two, and three orders of magnitude larger than the number of PEs, respectively. In a multiprocessor example [lo] using an M68000 microprocessor as the PE, and high speed

serial links for inter-PE communication, the ratio of computation time to communication time was approximately 25. As VLSI technology improves, and dedicated communication coprocessors


110

Communication overhead with tcom = 25

;:/-y

b 0 Small E 60 - + Medium ‘3 50- 0 Large

+

‘5 40-

3 30-

a 20- 5 +

10 -

O I I I 1 2 3 4

y (sharing parameter)

Fig. 6. Communication overhead for the RH.

become commonplace, this ratio is likely to decrease. Therefore, for this paper, three communication ratios were used: 25, 10 and 5.

For the following analysis with an example, a RH (2,2) was considered. This has the same number of PEs as a 6-dimensional regular hypercube (64 PEs), with half the number of communication channels. The small, medium and large data sets used in this example are 640,640O and 64000, respectively.

The equations that describe the RH model are based on the equations of the regular hypercube. However, the added communication overhead of the RH has been considered. Since most of the processing in this material identification algorithm involves integration of the R areas of the histograms using binary trees, examining the performance of these integrations alone will give an indication of how well suited the material identification algorithm is for the RH topology.

Communication overhead is defined as (total communication time)/(total processing time) x 100%. Figure 6 shows the communication overhead with the three data set sixes.

As the data set size increases, the percentage of time spent on communication decreases, since the increase in total time is due mostly to computational workload. The fact that communication

Fixed workload optimization using small data set (normalized)

l’l1

0 tcom = 25 + tcom = 10

0.5 - 0 tcom = 5


Fig. 7. Fixed workload optimization with small data set.


Fixed workload optimization using medium data set (normalized)

0 tcom = 25 + tcom = 10 0 tcom = 5


Fig. 8. Fixed workload optimization with medium data set.

overhead drops significantly as data set size increases indicates that a scaled workload may be necessary to obtain maximum benefit from the RH topology.

The fixed workload (FW) optimization is defined as aJam,,. This is the time required to process a binary tree calculation for a particular value of y, divided by the worst case binary tree calculation time. The minimum FW optimization indicates where processing time is minimized. It is normalized so that the worst case is equal to 1. Figure 7 indicates fixed workload optimization using the small data set. As the communication time t,,, changes, the value of y where the processing time is optimized may also change. Figure 8 shows fixed workload optimization using the medium data set. With the medium data set, the communication overhead is less of a factor, and y = 4 becomes the optimal value for the RH with a rcom of 10 or 5. With t,, = 25, y = 3 yields the best performance. With a large value for t,,, the communication overhead between BBS is the predominant factor in the total processing time. In this case, it is best to contain the binary tree to a small number of BBS. This forces each PE to perform more computations, but the savings in communication will reduce the overall processing time. As shown in Fig. 7, when t_ = 25, the RH is optimized when y = 0. y = 0 means the (k + 1) level binary tree, contained in one BB yields the best performance. As t_, is reduced, the RH is optimized at a larger y. For example, using the small data set, and r,,, = 5, the RH is optimized at

FW degradation as function of data set worst case. tcom = 25

= 9-

.i *- c -I-

g 6-

0 Small + Medium 0 Large


Fig. 9. Fixed workload degradation.


Y = 2. This means that 2k+y = 24 = 25% of the PEs in the RH were used to form a (k + 3) level binary tree. This yielded the best tradeoff between computation time and communication overhead.

Comparitive analysis of hypercube and RH performance for the material identification algorithm is now called upon. The fixed workload degradation is defined as the inverse of the speedup of the regular hypercube compared with the corresponding RH, with the same number of PEs. It is a measure of the decrease in performance as compared to the regular hypercube, resulting from missing communication channels. As shown in Fig. 9, as the data set increases, the computation time becomes the predominant factor in the total processing time. With the small data set, Y = 0 is the optimal RH distribution, for the medium data set, Y = 2, and for the large data set, Y = 4. In the case of the large data set, the optimal FW processing time is achieved when the entire RH is used, since the computational time savings achieved by increasing the level of the binary tree are large compared to the added communication overhead.

The material identification algorithm can be applied to the RH using the concepts of small, medium, and large data sets. Some parameters, such as R, will always be small data set parameters. Parameters such as V, can be medium or large data set parameters. In the PRAM and hypercube, three cases were considered, based on the number of processors. In the RH, the BBS are composed of k-cubes.

The quick comparisons are performed on all N histograms. Considering N to be of medium/large data set size, the ‘sharing parameter’ Y is incorporated. Therefore, j?,* is evaluated by:

(7.2)

For full comparisons, the R integration areas of the unknown and library histograms are subtracted and squared. Since R is typically ranging from 5 to 15, it is considered to have small data set size. Therefore, no sharing would be implemented, since the inter-BB communication overhead would outweigh the decrease in computation time. Therefore, F,,, is evaluated by:

To complete the evaluation, the R results are summed. Therefore, A becomes:

(7.4)

As before, Y is not considered since R is of small data set size. The F histograms are compared to M, a predetermined constant, as the final step in the material

identification. F is considered to be of medium/large data set size, therefore Y,,, is evaluated by:

(7.5)

Therefore:

GH = (F + l)Ra,,, + IL + YY, + WA + An,) . (7.6)

When Y is less than 2k, not all of the PEs in the RH are being used for the binary tree mapping. In this case, the other PEs can be used to form binary tree(s) to work on another ‘instance’ of data, i.e. another histogram. In the RH(2,2), when Y = 0, only l/16 of the RH is being used. There are 15 other identical k-cubes that can be used to process other data, if the workload can be scaled accordingly. To analyze a scaled workload degradation, the time for an RH to process multiple instances of data is compared to the time for a regular hypercube to process the same amount of data. Figure 10 shows scaled workload degradation for the RH(2,2).

Since the RH and its corresponding regular hypercube have the same number of processors, difference in performance between the two is due mostly to inter-PE communication. Since with

Y = 0 (no BB sharing) inter-PE communication is minimized, the scaled workload degradation


4.21

SW degradation as function of data set worst case. tcom = 25

4.0 -

3.8 -

3.6 -

3.4 -

3.2 -

3.0 -

2.8 -

2.6 -

2.4 -

2.2 -

2.0 -

q Small + Medium


Fig. 10. Scaled workload degradation.

approaches 1 as data set size increases. However, there are limits to which a problem can be scaled, so it may not always be possible to operate with a scaled workload and y = 0. Some problems may not have several instances of the same type of data. If there are a large number of data dependencies, many processors in the RH could be forced to be idle for a large portion of the total processing time. However, the material identification algorithm uses a library of many instances of the same type of data (histograms). Therefore, this problem can be operated in a scaled mode.

8. CONCLUSIONS

Material identification algorithms were presented in this paper for the PRAM, hypercube, and reduced hypercube parallel systems. In order to obtain maximum benefit from the RH topology, communication time and data set size are important factors that must be considered. To achieve the shortest processing time of a fixed workload, these factors are used to determine the binary tree level that will optimize performance by balancing the tradeoff between communication overhead and computation time. As technology improves, communication overhead will be reduced and RH feasibility will continue to improve. Nevertheless, the analysis here assumed the packet switching routing for data transfers. With the popular wormhole routing technique, [1 I] the communication time does not depend heavily on the dilation, therefore much better performance is achieved on the RH.

The following criteria are needed for a scaled workload:

1. The problem contains multiple instances of the same type of processing. 2. The problem contains a large data set size.

If these criteria are met, then it may be possible to scale the workload. With a large data set, a scaled workload will achieve the lowest degradation. The low degradation means that hypercube-like performance is achieved at significant cost/complexity reduction.

The RH topology enables the construction of massively parallel systems. Therefore, the RH topology benefits the material identification problem by enabling a larger library to be searched, and improving accuracy by using more integration areas, without compromising the computation time. The material identification problem fits the criteria listed above, and therefore is a good candidate for RH implementation. The RH has the potential to make parallel systems feasible for material identification, as well as many other practical applications.

AcRnowledgemenrs-The work presented in this research was supported in part by the National Science Foundation under Grants CCR-9109084, CDA-9121475, and DMI-9500260.

Michael Kahn and Sotirios G. Ziavras 341

REFERENCES 1. Ziavras, S. G., RH: A versatile family of reduced hypercube interconnection networks. IEEE Trans. Parallel Distributed

Systems, 1994, 5, 1210-1220. 2. Ziavras, S. G., Scalable multifolded hypercubes for versatile parallel computers. Parallel Processing Letters, 1995, 5,

241-250. 3. Ziavras, S. G., On the problem of expanding hypercube-based systems. Journ. Parallel Distributed Compuring, 1992,

16, 41-53. 4. Akl, S. G., The Design and Analysis of Parallel Algorithms. Prentice Hall, Englewood Cliffs, NJ 1989. 5. Woldseth, R., X-Ray Energy Spectrometry, Kevex Corp., Burlingname, CA, 1973. 6. Preparata, F. P. and Vuillemin, J., The cube-connected cycles: a versatile network for parallel computation. Comm.

ACM, 24, May, 1981, pp. 30&309. 7. Ziavras, S. G. and Sideras, M. A., Facilitating high-performance image analysis on reduced hypercube (RH) parallel

computers, In Parallel Image Analysis: Theory and Applications, Davis, L. S., Inoue, K., Nivat, M., Rosenfeld, A. and Wang, P. S. P., Series Mach. Perception Artif. Intell. Vol. 19, pp. 23-42, World Scien. Publ. Co. Pte. Ltd, Singapore, 1996. Also, Intern. J. Pattern Recogn. Artif. Intell., 1995, 9, 679-698, Special Issue on Parallel Image Analysis.

8. Ziavras, S. G. and Mukherjee, A., Data broadcasting and reduction, prefix computation, and sorting on reduced hypercube parallel computers. Parallel Computing, 1996, 22, 595-606.

9. Ziavras, S. G., Generalized reduced hypercube interconnection networks for massively parallel computers, Amer. Math. Sot. Book Series Discr. Math. Theor. Computer Science, Vol. 21, Interconnection Networks and Mapping and Scheduling Parallel Computations, ed. Hsu, D. F., Rosenberg, A. and Sotteau, D. June, 1995, pp. 307-325.

10. Hwang, K., Advanced Computer Architecture: Parallelism, Scalability, Programmability. McGraw-Hill Inc, New York 1993.

11. Dally, W. J. and Seitz, C. L. The torus routing chip. Distributed Computing, 1986, 1, 187-196.

AUTHORS BlBLIOGRAPHlES

Michael Kahn received the MSc. degree in Electrical Engineering from the New Jersey Institute of Technology in 1994

Dr Sotirios G. Ziavras received the Ph.D. degree from George Washington University in 1990. From 1988 to 1989 he was also with the Center for Automation Research at the University of Maryland. He joined the ECE Department at NJIT in Fall 1990, where he is an Associate Professor. He also holds a joint appointment in the CS Department. He is a member of the Advisory Committee for the Computer and Information Science Section of the New York Academy of Sciences, and is an Associate Editor of the Pattern Recognition journal. He has authored more than 50 refereed papers.

material identification algorithms for parallel systems

Documents