[ieee comput. soc. press ieee international workshop on defect and fault tolerance in vlsi systems -...

Roundoff Error-Free Tests in Algorithm-Based

Fault Tolerant Matrix Operations on 2-D Processor Arrays

Dah-Yea D. Wei, Jung H. Kim*, and T.R.N. Rao Center for Advanced Computer Studies University of Southwestern Louisiana JAIST Lafayette, Louisiana 70504-4330

*School of Information Science

Tatunokuchi, Ishikawa 923-12, JAPAN

Abstract Assaad and Dutt [2] proposed the Hybrid Checksum test method for the Joating-point

matrix-matrix multiplication in ABFT environment, by which the error coverage can be greatly increased. However, the threshold test in their approach is still necessary in the floating-point addition part of the matrix multiplication, and the number of error detections decreases with the increase in the dynamic range of data. Here, instead of using the thresholdjoating-point checksum test, we present an effective method, called the Concurrent Floating-Point Checksum (CFPC) test. The proposed CFPC test provides complete error detectionlcorrection capabilities in floating-point additions with less time latency and hardware overhead regardless of the dynamic range of input data.

1. Introduction The floating-point detectionlcorrection problem in an Algorithm-Based Fault-Tolerance

(ABFT) computing environment [ l ] has long been an unsolved problem, since it is susceptible to roundoff inaccuracies caused by the limited precision of contemporary machines. Thus, number of false alarms and undetected errors will occur due to the roundoff inaccuracy. By a false alarm and an undetected error, we mean that, in the floating-point computations, the checksum test might fail even in fault-free cases and pass in the presence of errors, respectively. Most of the researchers have tried to overcome these problems by adopting proper encoding schemes to min- imize the ratio of numerical error and/or choosing a suitable threshold depending on the actual input data [1,5,6,7]. However, the selection of a proper threshold is generally a difficult issue, and the test results were not completely satisfied in most cases. Choosing a small threshold will result in many cases of false alarm during the test. On the other hand, many undetected errors may occur when a large threshold is used. Hence, a generally acceptable approach for completely checking the floating-point computations in ABFT environment remains unsolved.

In this paper, we will present, instead of using threshold floating-point checksum test, another method, called Concurrent Floating-point Checksum (CFPC), to do the complete checking in floating-point additions. With reduced hardware redundancy and time latency compared to threshold floating-point checksum method, the CFPC approach provides complete error coverage in the floating-point addition part of a matrix multiplication. Notice that the error detection as well as correction in the exponent part of a floating-point format is less error-prone, easy to detect by any simple checking method, and thus not covered in this discussion.

This paper is organized as follows. In Section 2, we will briefly overview the Hybrid Checksum (HC) test proposed by Assaad and Dutt [2] The new approach proposed in this

74 1063-6722194 $4.00 0 1994 IEEE

Testable Architectures 75

paper, called ModiJied Hybrid Checkrum (MHC) test, is described in Section 3. It consists of the Integer Mantissa Checksum test (similar to that in HC scheme) and the Concurrent Floating- Point Checksum (CFF'C) test. The implementation and the analysis of its hardware and time complexities of the MHC approach on 2-D unidirectional processor arrays are also discussed in this section. Finally, the conclusions will be addressed in the Section 4.

2. An Overview of Hybrid Checksum (HC) Test Assaad and Dutt [2] improved the floating-point checksum capabilities in a matrix-matrix

multiplications with the HC test, comprising the Integer Mantissa Checksum test for floating- point multiplications and Threshold Floating-Point Checksum test for floating-point additions. The basic idea of HC test is that, on one hand, the mantissa checksum test extracts the mantissa of all elements of two matrices to form two integer matrices, and then performs the modified integer mantissa checksum test in a fixed-point double precision fashion. The algorithm of integer mantissa checksum test is shown in [2]. For the test of floating-point summations, a properly thresholded version of the floating-point checksum test [2] is performed. A threshold A in a floating-point equality test means that if the difference between two floating-point numbers being compared is at most A, the computation is regarded as fault-free; otherwise it is as faulty. The most difficult issue of this HC test is to select a proper threshold A such that it can offer the highest error detection and the lowest false alarm probabilities. Since the size of the roundoff errors on both sides of the checksum equality test is strongly correlated to the dynamic range of the input data, the optimal choice of threshold can be empirically found only if the dynamic range of the input data elements is known before the computation.

Their results showed that the error coverage has a significant improvement in the HC test with respect to both the mantissa checksum and the floating-point checksum tests alone. How- ever, the threshold checksum method is necessary in order to increase the rate of error coverage. Moreover, the dynamic range of actual input data needs to be known before the proper threshold can be chosen, and this choice is machine and data dependent. The error coverage of HC test decreases with the increase in the dynamic range of input data [2], which is undesirable when processing a set of data with a large dynamic range is processed or when its dynamic range is unknown.

3. Modified Hybrid Checksum (MHC) Test The motivation of this research are as follows. (1) Regardless of the dynamic range of

input data, the error detection should work effectively; (2) The dynamic range does not need to be known before each processing; and (3) Error coverage should remain high regardless of increases in the dynamic range.

3.1 Concurrent Floating-point Checksum (CFPC) Test The basic idea of the CFPC is to test before these roundoflerrors are accumulated in each

PE. Similar to integer-based mantissa checksum test [2], two 54-bit integer registers and the integer ALU are used to perform the necessary checking operations in each PE. Before perfonn- ing a floating-point addition, the mantissa and sign bits of its two operands, say FA and FB, of the floating-point products ai,k . bk,j and ur ,k+l . bk+I,j are extracted into two 54-bit signed- magnitude integer registers, IA and IB, respcctively, as shown in Figure 1. The position IA, is reserved for the cany of the result; it may happen when both operands have the same sign in a binary addition.

76 International Workshop on Defect and Faul t Tolerance in VLSI Systems

Our modified integer addition is similar to the normalized floating-point addition that is described in [4]. Before a floating-point “add” instruction is executed, the sign bits and mantissas of FA and FB should be loaded to their corresponding integer registers IA and IB, respectively (only from FB to IB for the rest of additions). Concurrently with the floating-point addition, an integer adder also adds the contents of these two registers simultaneously. The resultant sum of IA and IB will be loaded back into the register IA. Conceptually, these operations can be written as

IA t IA + IB.

There are four major stages that must be executed in order to complete the addition of two integer numbers.

1. Initialization and roundoff bits checking. 2. Align one of the small operands according to the difference of exponents of FA and FB. 3. Add the integer IA and IB. 4. Postnormalize the accumulated sum in IA.

Thus, after several iterations until matrix multiplication has completed, the final result of additions is stored in IA at which time the fifth stage can be used for error checking against the final result of floating-point additions that stored in FA. These addition stages as well as the error detection stage are depicted in the flowchart of Figure 2. Because of limited space, the detailed operations within each stage are illustrated by the separated algorithms that listed in [ 9 ] , and are in self-explained pseudo-codes. The theorem proving its error detection and correction capabilities is also given in [8]. Notice that the error checking is needed in INlT&ROUND-CHECK only when the condition that the exponent of accumulated sum in F A is less than that of incoming new product FB. This is simply because some amount of the least significant bits in FA, which might contain any error during the previous floating-point additions, are going to be rounded off in the next addition. The remaining errors during computation can be detected eithcr in the mantissa checksum test [2] or in final checking stage (FINAL-CHECKING).

3.2 Modified Mantissa Checksum (MHC) Test

to be slightly modified to the Algorithm 1 listed below.

Algorithm 1 MOD-MANT-CHECKSUM(A, B ) // Same as that in Algorithm MANT-CHECKSUM(A, B ) except the modification in step 3.

Based on the above approaches, the original integer mantissa checksum test algorithm has

3. Perform the floating-point matrix multiplication C = A . B and activate CFPC(FA, FB, IA, IB) procedure simultaneously;

Aftcr this modification, we can see that the operations of CFF’C are totally embedded in modified mantissa checksum, MOD-MANT-CHECKSUM, where CFPC runs and tests concurrently within step 3 during the matrix multiplication C = A . B . No other procedure needs to be acti- vated after the modified mantissa checksum test. Thus, the original HC algorithm must also be modified accordingly in the Algorithm 2.

Algorithm 2 MOD-HYBRID-CHECKSUM(A, B ) /I A is an n x m manix andB an m x I matrix.

1 . MOD-MAN’CHECKSUM(A, B);


3.3 The Implementation of Modified Hybrid Checksum (MHC) Test

Instead of letting the results stay within PE's, as the mesh does, a 2-D unidirectional processor array [3] is capable of sending its result elements ci,j out from the left end of processor array at each step. In addition, without requiring longer clock cycle as the hexagonal processor array does [I], the unidirectional processor array can be used to implement the proposed MHC scheme in a much more efficient way. The implementation of MHC approach on a unidirectional processor array for multiplying two n x n matrices A and B for n = 3, is shown in Figure 3. There is an additional data path for sending integer elements gi,j from one PE to its neighboring PE in the first n rows of the array. In addition, for the checksum purpose, outputs from the right side of the array are connected to two checksum checkers that are shown at the Figure 3.

Data stream of matrices A, B, and C are fed into processor array same as those in [3]. The column summation vector of A'"'"' (matrix of the extracted mantissas of A) u : + ~ , ~ = m ~ n t ( a k , ~ ) , 1 I j I n, is arranged in a staggered form following the elements of matrix A from the top of array. The row summation vector of B'"""' (matrix of the extracted mantissas of B) b;,"+] = munt(bi,k), 1 I i I n, is received from upper input of the (n + I)th row of array. Here, mantissa extractions, mant(ai,k) and are of a 23-bit mantissa in a single-precision floating-point number into the lower 23 bits of a 32-bit integer. The integer summation above would take at least 2' = 256 additions for a 32-bit integer to overflow. Mod- ulo 231 integer additions will be used if the dimension of A or B is large. Let the final mantissa column summation vector

X = coIsum(A""") . BmUd = (x,+~,~, x , + ~ ~ , . . . , x,+~,~),

and the final mantissa row summation vector

. rOWsum(Bmant) = (Yl,n+l, ~ 2 , n + l r . . ' 9 yn ,n+ l )T . yT = Aman'

The input of each element x,,+~,~, 1 I j 5 n, of vector X, initially set to 0, and follows the input element c , , ~ of the jth row. Elements yi,"+], 1 I i I n, of vector Y, initially set to 0, are fed into the lower input of PE's in the ( n + 1)th row.

The mantissa checksum is accomplished as follows. Each output line in the jth row to swj, 1 I j I n, for outputs gi,j, 1 I i I n, is connected to a column checksum checker shown in Figure 3(a) to perform the integer column checksum test X = colsum(G). Initially, the switching element swj, which is connected to the jth row, is set to the right so that the integer outputs gi , j , 1 I i 5 n, are accumulated into an integer buffer. The checking process is accomplished when the integer output x,+,,~ leaves the array. At this moment, the switch swj is set to the left; the integer mantissa checksum test is done by comparing the last output x ~ + ~ , ~ against the accumulated integcr gi,j . Therefore, each column checksum checker connected by a swj to the output of the jth row is to perform the mantissa column checksum test for the jth column of matrix G .

An integer row checksum test is accomplished by connecting every integer output line to its corresponding input line in Figure 3(b). The integer row summations are accumulated from each row altematively and finally compared the equalities against the integer outputs yi,n+l, 1 I i 5 n, kom the ( n + 1)th row in the TSC comparator. Therefore, the row checksum checker in Figure 3(b) is to perform the mantissa row checksum tests Y = mvsum(G).

An unidirectional processor array works efficiently without requiring longer clock cycle. As Seen from Figure 3, PE's within the (n + 1)th to the (2n - I)rh column of the fist n rows (region R4) need a capability to process both floating-point and integer computations by itself.


That is, besides the floating-point computation, whenever the last integer input u ’ , ~ , ~ from the top reaches the PE, the operation of an+l,k . mant(bk,,) have to be executed and accumulated to the jth element of vector X , x ,+~, , , in the jth row within R4 region. As a result, the hardware redundancy for each PE within this area is increased considerably. We will discuss the structures of P E S in every region in the next section.

3.4 Inside PE Structures of Modified Hybrid Checksum (MHC) Test

According to the algorithms, CFPC does its error detection within each PE as follows. When, for instance in the first row of Figure 3, two operands a13 and b31 reach PE1,l, they are multiplied into c11 (or FA) and sent it to the second PE in the first row, PE12. Also, the integer IA is formed by extracting the mantissa of c l l . It is also sent to P E 1 ~ at the same time. At the next step, another two operands alz and bZ1 reach PE12. Their product, say FB = u12. bzl is going to be added to the previous product, say FA = U13 . b31, that has been stored in FLP Buffer (see Figure 4). Before this floating-point addition, the mantissa of FB is extracted to the register IB for the integer addition IA t IA + IB. Then, both additions can process independently until they are all completed. This does not take any extra time in integer addition, since the Stage 1 to Stage 4 of CFPC are totally overlapped with the floating-point addition. The results of these two additions are sent to the register IA and FLP Buffer of PE1,3 at the right. After the third partial product FB = all .b11 completed and normalized, in case of exp(FA) < exp(FB), the number of bits in mantissa of FA to be shifted out and stored in the register z of FLP ALU [4], should be compared to their corresponding bits in IA before the alignment operation starts. At that moment, the switch sw3 sets to the left and TSC comparator does the error checking. Finally, after all floating-point and integer summation completed, TSC comparator compares the mantissa part of FA in FLP Buffer against the contents of IA in Buffer 2 for the FINAL-CHECKING stage by setting the sw3 to its right. With the adoption of CFPC test, the inside structure of PE in the first n x n subarray is shown in the Figure 4.

In Figure 4, two additional blocks that surrounded by dashed lines are for the purpose of implementing CFPC operations. These two blocks contain three stages of the CFPC procedure; they are initialization, alignment, and postnormalization stages, and can be easily accomplished by two 54-bit integer shift registers. The INT Adder acts as two functions that marked as paths “ 1 ” and “2” in the Figure 4. The first function is to accumulate the extracted mantissa to an integer buffer (Buffer l), where the value of g,,, is stored. Another function is CFPC test in IA t IA + IB operation after alignment. The process of these two additions are independent though using the same INT Adder, and work concurrently with the floating-point normalization and floating-point addition, respectively. Thus, it does not incur any time overhead with these additional hardware and operations. The TSC comparator does the error checking for roundoff bits on FA and for the final checking as well. The switch sw3 above the comparator is used to control the data path either from some rounding bits of IA in INIT&ROUND-CHECK stage, or from the accumulated inner product IA stored in Buffer 2 in FINAL-CHECKING stage, to check with the final contents of mantissa in FLP Buffer, FA (or c,,,).

By using the same INT Adder, Buffer 1, and data paths for integers g,,, and IA, time multi- plexing technique is used here by using the switches sw1 and sw2 to send two different integers at different instant of time. On one hand, setting sw1 and sw:! upward initially, causes the mantissa checksum to use INT Adder to sum the extracted mantissa of partial product to the giSj stored in Buffer 1 through the data paths “1”. The result is stored back to Buffer 1 . At the next cycle time, while CFPC is activating, the accumulated gb j is sent to the Buffer 1 of next PE of its right through sw2 and SWI (of next PE). On the other hand, when CFPC completes its


postnormalization and stores the result to Buffer 2, set swI and sw2 downward, so that this result can flow to the right from Buffer 2 to the shift register IA of next PE. At the same time, floating- point number FA in F’LP Buffer moves to the FLP Buffer of the next P E input data ai,k and bk,, are also fed to its lower and right neighbors, respectively, from the buffers AR and BR syn- chronously.

In contrast to the Figure 4, the PE located in R4 region must own a capability to process both floating-point and integer computations. An additional integer multiplication and some switching controls might be needed. However, to reduce the hardware overhead, one may con- sider a PE structure that is shown in Figure 5. This type of PE can perform the same floating- point operations as that shown in Figure 4, when switches S W ~ , swg, and sw6 set to the left. These setting operates just as those description for Figure 4. Once the integer input floating-point input bk,,, and integer checksum element x,+],~ in the jth column reach a PE in jth row within R4 region, the switches sw4, swg, and sw6 set to the right in such a way that this PE is capable of performing the integer computations. Since the sw4 is set to the right, the mantissa of floating-point input bk,j is extracted (through data paths “2”) and multiplied with u , , ~ , ~ in FLF’ Multiplier. Its result is sent to the INT Adder to be accumulated to the value of x,+],~, which is then sent to its neighboring PE at the next clock cycle through the output line of g i , j . Note that when x,+l,j is fed into the leftmost PE in R4 region, its sw6 sets to the right (up), and this value is then go through the other line (used to sending gi,,) to the INT Buffer (Buffer 1) waiting to be accumulated by INT Adder through data paths “2”. Note also that, beginning from this point, the FLP Adder and FLP Buffer in rest of PE’s are not functioned any more (ci,, = 0, for i, j > n).

In order to perform the additional integer multiplication, the FLP Multiplier requires some hardware, switches, and additional wires to control the data flows, but what we gain is reducing the hardware overhead of an integer multiplier. A flag is set when the integer inputs u , + ~ , ~ and x,+l,, reaches the PE, then FLP Multiplier changes its function to an integer multiplier and switches sw4 and swg set to the right (only sw6 in leftmost PE’s in R4 region set to the right). Note that the sign bit and mantissa part of the FLP Multiplier must be extended to 56 bits (including sign bit), instead of 53 bits. These extra 3 bits have their contents 0 while the floating-point computations are performing.

PE’s in the R I and R2 regions are for processing the integer computations only, thcir structure as shown in Figure 6 are much simpler than those PE’s described above. These PE’s need only to either extract the mantissa of floating-point inputs bl , j in the jth row, 1 S j I n, if they are located in the R I region to compute the elements of vector X, or extract the mantissa of floating-point inputs aj,k in the (n + 1)th row, if they are located in the region of R2 to compute the elements of vector Y. These two extraction operations are controlled by an cnable line e, as shown in Figure 6, depending on where the PE is located and what kind of inputs it has. An integer multiplier is used to compute the product of these two integers and its product is accuniu- lated by INT Adder to the value of stored in INT Buffer, where element x , + ~ , ~ is to be scnt out at the next clock cycle. Note that the input line for the value of ci,j is neglected if it is located in the R2 region.

A PE that located in the R3 region only acts as buffers that store the data sent from its left (top) side and transfers these data to the right (downward) output. Value of the scalar z is not computed in the (n + 1)th row. Inconsistent row and column checksum can be detected only by the other two equality tests, X = colsum(G) and Y = rowsum(G) (similar to the implementation of floating-point checksum test in hexagonal processor arrays of [l]).


3.5 Analysis of the Modified Hybrid Checksum Test To simplify our analysis, we assume that C,,, = A,, . E,,,. The total number of PE’s

used in a floating-point matrix-matrix multiplication without a fault-tolerant technique is 2n2 - n. For MHC test, the total number of PE’s used is 2n2 + n - 1. However, each PE used for MHC method has been modified to accommodate the hardware requirements for the integer mantissa checksum test as well as the CFPC test. Hence, comparing the number of PE’s in two methods does not reflect the real hardware cost of the MHC test.

Let k be the ratio of hardware complexity of an integer PE to that of a floating-point PE, then k < 1. These PE’s are located in RI and R2 regions shown in Figure 3. Let k‘ be the ratio of hardware complexity of a PE that located in the region of R3 acting as buffers, then k‘ << 1. Let k” be the hardware complexity of a modified floating-point PE to that of an original floating- point PE, then k” > 1, since each PE has an additional INT Adder and an INT Buffer attached for mantissa checksum approach and two additional shift registers, three switches, a TSC comparator, and some more wires for proposed CFPC test, as shown in Figure 4. These PE’s are located in the first n x n subarray. Let k”’ be the ratio of hardware complexity of a modified floating-point PE (for both floating-point and integer computations) to that of an original floating-point PE, then k”’ > k”, since each PE in R4 region has a modified FLP Multiplier, three additional switches, and some more wires compared to that of k”.

The hardware complexity of MHC test compared to that of without fault-tolerance is listed in Table 1. The hardware complexity of MHC test obviously is less than that of the Assaad and Dutt’s HC approach, since we have reduced a row and a column of PE’s used for the threshold floating-point checksum test.

Regarding time overhead, since it is known that any integer operation is much faster than its corresponding floating-point counterpart, the integer-based computation in CFPC takes much less time than that of a floating-point addition. The only consideration is the time for doing error checking before alignment (necessary only when the exponent of FA is less than that of an incoming new product FB) and the time for FINAL-CHECKING stage. Therefore, the number of error checkings is n/2 + 1 in average case and, in fact, each of these comparison can be fin- ished within one cycle time. Referring to the algorithm THRESH-FTOAT-CHECKSUM [2] in comparison to the time spent in calculating two full checksum matrices Cf and Dr, these little extra time can be neglected, because bit string comparison between two registers within a PE is much faster than that of checksum comparisons incurred by HC approach. Note that the PE that we designed, especially those in the first n x n subarray and in R,, takes almost the same computation time as that in the original PE without fault-tolerant capability, since additional integer operations can be executed concurrently with its floating-point counterpart without any time overhead. Note also that the running time for the row and column mantissa checksum tests does not take any additional time to do the checking process, since these operations are completed earlier than the data run through the processor array. The overall execution time to complete a matrix multiplication in a unidirectional processor array using MHC test is shown in Table 1.

4. Conclusion The proposed CFF’C uses the less error prone integer ALU to provide the concurrent error

detection for the floating-point matrix multiplications on ABFT processor arrays. Reducing the time and hardware overhead and avoiding the tedious threshold checksum method, CFPC test is a cost-effective approach with 100% of error detectiodcorrection capability in floating-point additions. In conjunction with the modified mantissa checksum test, the floating-point matrix-

Testable Architectures

Original FLP matrix multiplication without fault-tolerant techniques

2n2 - n

2n2 - 2n

PE’s

Buffers between PE’s INT col. checksum checker 0 INT row checksum checker 0 Execution time 4 n - 1

81

FLP matrix multiplication using Modified Hybrid Checksum test approach

k(3n - 1) + k’ + k”n2 + k”’(n2 - n) 2 n 2 + n - 1 n 1 5 n + l

matrix multiplication problem in ABFT has been completely solved by using the modified hybrid checksum (MHC) test.

The proposed MHC scheme not only a robust and efficient approach to detect or correct errors (under CEPC scheme) in unidirectional processor arrays, but also it is a very flexible and versatile approach to apply to other multiple processor architectures.

References K.-H. Huang and J.A. Abraham, “Algorithm-Based Fault Tolerance for Matrix Operations,” IEEE Trans. Computers, Vol. C-33, No. 6, pp. 518-528, June 1984. ET. Assaad and S. Dun, “More Robust Tests in Algorithm-Based Fault-Tolerant Matrix Multiplica- tion,’’ Pmc. FTCS-22, pp. 430-439, July 1992. J.H. Kim and S.M. Reddy, “Easily Testable and Reconfigurable Wo-dimensional Systolic Arrays,” Computer System Science and Engineering, Vol. 7 , No. 3, pp. 160-169, 1992. Kai Hwang, Computer Arithmetic: Principles, Architecture, and Design, John Wiley & Sons. New York, 1979. P. Banerjee, J.T. Rahmeh, G. Stunkel, V.S. Nair, K. Roy, V. Balasubramanian and J.A. Abraham, “Algorithm-Based Fault Tolerance On a Hypercube Multiprocessor,” IEEE Zi-am. Computers, Vol. 39, No. 9,pp. 1132-1145, Sep. 1990. V.S.S. Nair and J.A. Abraham, “Real-Number Codes for Fault-Tolerant Matrix Operations On Pro- cessor Arrays,” IEEE Trans. Computers, Vol. C-39, No. 4, pp. 426-435, April 1990. A. Roy-Chowdhury and P. Banejee, “Tolerance Determination for Algorithm-Based Checks Using Simplified Error Analysis Techniques,” Proc. of 1993 International Symposium on Fault-Tolerant Computing: FTCS-23, pp. 290-298, June 1993. Dah-Yea D. Wei, “Complete Checking in Algorithm-Based Fault-Tolerant Matrix Operations on Processor Arrays,” Ph.D. Dissertation, University of Southwestem Louisiana, December 1993. Dah-Yea D. Wei, Juug H. Kim, and T.R.N. Rao, “Complete Checking in Algorithm-Based Fault- Tolerant Matrix Operations on Processor Arrays,” to appear in Journal of Microelectronic Systems Integration, Vol. 1, No. 4.

Table 1. Complexities of original vs. MHC test in a 2-D unidirectional processor array.

82 International Workshop on Defect and Fault Tolerance in VLSI Systems

ar- m N U

:Ea 9 .................................................

............................ -

[ieee comput. soc. press ieee international workshop on defect and fault tolerance in vlsi systems -...

Documents